Discussion on the development status of speech recognition technology at home and abroad

Speech recognition means converting the content and meaning of a person's speech into a computer-readable input, such as a button, a binary code, or a sequence of characters. Unlike the speaker's recognition, the latter is mainly to identify and confirm the person who made the speech, not the content contained in it. The purpose of speech recognition is to let the machine understand the language spoken by human beings, including two aspects: the first is to understand word by word rather than into written language; the second is to order or request the words contained in the spoken language. Comprehend and make the right response, not just the correct conversion of all vocabulary.

Since 1952, Davis et al. of AT&TBell Labs has developed the first voice-enhanced system with a ten-digit English number. Audry system. In 1956, Olson and Belar of the RCA Laboratory of Princeton University in the United States developed 10 A system of monosyllabic words that uses spectral parameters obtained by a bandpass filter bank as speech enhancement features. In 1959, Fry and Denes attempted to construct a phoneme to 4 vowels and 9 consonants, and used spectrum analysis and pattern matching to make decisions. This greatly improves the efficiency and accuracy of speech recognition. Since then, computer speech recognition has received the attention of researchers from various countries and began to enter speech recognition research. In the 1960s, the Soviet Union's MaTIn et al. proposed endpoint detection of speech end points, which significantly increased the level of speech recognition. Vintsyuk proposed dynamic programming, which is indispensable for future recognition. The important achievements in the late 1960s and early 1970s were the signal linear prediction coding (LPC) technology and dynamic time warping (DTW) technology, which effectively solved the feature extraction and unequal length speech matching of speech signals. Vector quantization (VQ) and hidden Markov model (HMM) theory. The combination of speech recognition technology and speech synthesis technology enables people to get rid of the shackles of the keyboard. Instead, it is a natural, user-friendly input method that is easy to use, such as voice input. It is gradually becoming a key technology for human-machine interface in information technology.

Development status of speech recognition technology

One: Development Status of Speech Recognition Technology - Classification of Speech Recognition System

The speech recognition system can be classified according to the restrictions on the input speech. If the relevance of the speaker to the recognition system is considered, the identification system can be divided into three categories:

(1) Specific person speech recognition system. Only consider the recognition of the voice of a person.

(2) Non-specific person speech system. The recognized speech is independent of the person, and the recognition system is usually learned by a large number of different people's speech databases.

(3) Multi-person identification system. The voice of a group of people can usually be recognized, or become a specific group of speech recognition systems that only require training of the voices of the group of people to be identified.

If you think about the way you talk, you can also divide the recognition system into three categories:

(1) Isolated word speech recognition system. The isolated word recognition system requires a pause after entering each word.

(2) Connective speech recognition system. The conjunction input system requires a clear pronunciation of each word, and some legatos begin to appear.

(3) Continuous speech recognition system. Continuous speech input is a natural, fluent continuous speech input with a large number of legato and accent.

If you consider the vocabulary size of the recognition system, you can also divide the recognition system into three categories:

(1) Small vocabulary speech recognition system. A speech recognition system that typically includes dozens of words.

(2) A medium vocabulary speech recognition system. An identification system that typically includes hundreds of words to thousands of words.

(3) Large vocabulary speech recognition system. A speech recognition system that typically includes thousands to tens of thousands of words. With the improved computing power of computer and digital signal processors and the accuracy of recognition systems, the classification system is constantly changing according to the vocabulary size. It is currently a medium vocabulary recognition system and may be a small vocabulary speech recognition system in the future. These different limitations also determine the difficulty of the speech recognition system.

Development status of speech recognition technology

Second: the development status of speech recognition technology - a summary analysis of speech recognition methods

At present, representative speech recognition methods mainly include dynamic time warping technology (DTW), hidden Markov model (HMM), vector quantization (VQ), artificial neural network (ANN), and support vector machine (SVM).

Dynamic TIme Warping (DTW) is a simple and effective method for speech recognition in non-specific people. Based on the idea of ​​dynamic programming, the algorithm solves the problem of template matching with different pronunciation lengths. It is a speech recognition technology. An earlier, more commonly used algorithm appears. When applying the DTW algorithm for speech recognition, the pre-processed and framing speech test signals are compared with the reference speech templates to obtain the similarity between them, and the similarity between the two templates is obtained according to a certain distance measure. And choose the best path.

Hidden Markov Model (HMM) is a statistical model in speech signal processing. It is evolved from Markov chain, so it is a statistical recognition method based on parametric model. Because the pattern library is the best model parameter with the highest probability of matching with the training output signal through repeated training instead of the pre-stored pattern sample, and the likelihood probability between the speech sequence to be recognized and the HMM parameter is used in the recognition process. The optimal state sequence corresponding to the maximum value is used as the recognition output, so it is an ideal speech recognition model.

Vector QuanTIzaTIon is an important signal compression method. Compared with HMM, vector quantization is mainly used in speech recognition of small vocabulary and isolated words. The process is to quantize the scalar data of several speech signal waveforms or feature parameters into a vector in a multi-dimensional space. The vector space is divided into several small areas, each of which finds a representative vector, and the vector that falls into the small area during quantization is replaced by this representative vector. The design of vector quantizer is to train a good codebook from a large number of signal samples, find a good distortion measure definition formula from the actual effect, design the best vector quantization system, and use the least amount of search and calculation of distortion calculation. Achieve the highest possible average signal to noise ratio.

In the actual application process, people also studied a variety of methods to reduce complexity, including memoryless vector quantization, memory vector quantization and fuzzy vector quantization.

Artificial neural network (ANN) is a new speech recognition method proposed in the late 1980s. It is essentially an adaptive nonlinear dynamics system that simulates the principles of human neural activity, with adaptive, parallel, robust, fault-tolerant and learning characteristics, its powerful classification capabilities and input-output mapping capabilities. Very attractive in speech recognition. The method is an engineering model that simulates the human brain thinking mechanism. It is the opposite of HMM. Its classification decision-making ability and ability to describe uncertain information are universally recognized, but its ability to describe dynamic time signals is not satisfactory. The MLP classifier can only solve the static pattern classification problem and does not involve the processing of time series. Although scholars have proposed many structures with feedback, they are still not sufficient to characterize the dynamic characteristics of time series such as speech signals. Since ANN can't describe the temporal dynamic characteristics of speech signals well, ANN is often combined with traditional recognition methods to use their respective advantages for speech recognition to overcome the shortcomings of HMM and ANN. In recent years, significant progress has been made in the identification of recognition algorithms combining neural networks and implicit Markov models. The recognition rate is close to that of the hidden Markov model, which further improves the robustness and accuracy of speech recognition.

Support vector machine (SVM) is a new learning machine model that applies statistical theory. It adopts Structural Risk Minimization (SRM), which effectively overcomes the shortcomings of traditional empirical risk minimization methods. Taking into account the training error and generalization ability, it has many superior performances in solving small sample, nonlinear and high-dimensional pattern recognition, and has been widely applied to the field of pattern recognition.

Development status of speech recognition technology

Three: The status quo of speech recognition technology development - foreign research

The speech recognition research can be traced back to the Audry system of AT&T Bell Labs in the 1950s. It is the first speech recognition system that can recognize ten English numbers.

But real progress has been made and research has been carried out as an important topic in the late 1960s and early 1970s. This is firstly because the development of computer technology provides the possibility of hardware and software for the realization of speech recognition. More importantly, the speech signal linear predictive coding (LPC) technology and dynamic time warping (DTW) technology are proposed to effectively solve the speech. Feature extraction and unequal length matching of signals. The speech recognition in this period is mainly based on the principle of template matching. The research field is limited to the identification of isolated words by specific people and small vocabularies. The unique speech speech recognition system based on linear predictive cepstrum and DTW technology is realized. Vector quantization (VQ) and hidden Markov model (HMM) theory.

With the expansion of the application field, the constraints on speech recognition such as small vocabularies, specific people, and isolated words need to be relaxed, and at the same time, many new problems are brought about: First, the expansion of the vocabulary makes the selection of templates. And the establishment of difficulties; Second, in continuous speech, there is no obvious boundary between each phoneme, syllable and words, each pronunciation unit has a co-articulation phenomenon strongly influenced by the context; third, non-specific person recognition When different people say the same, the corresponding acoustic characteristics are very different. Even if the same person speaks the same content in different time, physiological and psychological states, there will be great differences. Fourth, the identification There is background noise or other interference in the voice. Therefore, the original template matching method is no longer applicable.

A huge breakthrough in laboratory speech recognition research came out in the late 1980s: people finally broke through the three major obstacles of large vocabulary, continuous speech and non-specific people in the laboratory. For the first time, these three features were integrated into one. In the system, the typical Sphinx system of Carnegie Mellon University is the first high-performance non-specific, large vocabulary continuous speech recognition system.

During this period, speech recognition research has gone further, and its distinctive feature is the successful application of HMM model and artificial neural network (ANN) in speech recognition. The wide application of the HMM model should be attributed to the efforts of scientists such as Rabiner from AT&TBell Labs. They have engineered the original HMM pure mathematical model to understand and understand more researchers, making statistical methods the mainstream of speech recognition technology. .

The statistical method shifts the researcher's line of sight from micro to macro, and no longer deliberately pursues the refinement of speech features, but more to establish the best speech recognition system from the perspective of overall averaging (statistics). In terms of acoustic model, the Markov chain-based speech sequence modeling method HMM (implicit Markov chain) can effectively solve the short-term stability and long-term time-varying characteristics of speech signals, and can be constructed according to some basic modeling units. The continuous speech sentence model achieves high modeling accuracy and modeling flexibility. At the linguistic level, the fuzzy and homonyms brought about by the recognition are distinguished by the co-occurrence probability between the words of the real large-scale corpus, that is, the N-ary statistical model. In addition, artificial neural network methods and grammar-based language processing mechanisms have also been applied in speech recognition.

In the early 1990s, many famous big companies such as IBM, Apple, AT&T and NTT invested heavily in the practical research of speech recognition systems. Speech recognition technology has a good evaluation mechanism, which is the accuracy of recognition, and this index has been continuously improved in laboratory research in the mid-to-late 1990s. Some representative systems are: ViaVoice from IBM and NaturallySpeaking from DragonSystem, NuanceVoicePlatform voice platform from Nuance, Whisper from Microsoft, and VoiceTone from Sun.

Among them, IBM developed the Chinese ViaVoice speech recognition system in 1997, and in the following year, it developed ViaVoice'98, a speech recognition system that can recognize local accents such as Shanghai dialect, Cantonese dialect and Sichuan dialect. It comes with a basic vocabulary of 32,000 words, which can be extended to 65,000 words, and also includes common office entries with an "correction mechanism" with an average recognition rate of 95%. The system has high precision for news speech recognition and is a representative Chinese continuous speech recognition system.

Four: The status quo of speech recognition technology development - domestic research

China's speech recognition research started in the 1950s, but it has developed rapidly in recent years. The level of research has also moved from laboratory to practical. Since the implementation of the National 863 Program in 1987, the National 863 Intelligent Computer Expert Group has established a special project for speech recognition technology research, which is rolled every two years. The research level of speech recognition technology in China has basically been synchronized with foreign countries. It has its own characteristics and advantages in Chinese speech recognition technology and has reached the international advanced level. Research institutes such as the Institute of Automation, Institute of Acoustics, Tsinghua University, Peking University, Harbin Institute of Technology, Shanghai Jiaotong University, University of Science and Technology, Beijing University of Posts and Telecommunications, and Huazhong University of Science and Technology have conducted research on speech recognition. The research unit is the State Key Laboratory of Pattern Recognition of the Department of Electronic Engineering of Tsinghua University and the Institute of Automation of the Chinese Academy of Sciences.

The speech technology and special chip design group of the Department of Electronic Engineering of Tsinghua University developed the recognition accuracy of the non-specific Chinese digital string continuous speech recognition system, reaching 94.8% (indefinite length string) and 96.8% (fixed length string). In the case of 5% rejection rate, the system recognition rate can reach 96.9% (indefinite length number string) and 98.7% (fixed length number string), which is one of the best international recognition results, and its performance is close. Practical level. The research and development of the 5000 word postal packet calibration non-specific person continuous speech recognition system recognition rate reached 98.73%, the first three selection recognition rate reached 99.96%; and can recognize the two languages ​​of Mandarin and Sichuan dialect, to achieve practical requirements.

In 2002, the Institute of Automation of the Chinese Academy of Sciences and its model company (Pattek) released their joint "Tianyu" Chinese speech series for different computing platforms and applications - PattekASR, which ended the Chinese speech recognition product since 1998. The history of monopoly of foreign companies.

V: The status quo of speech recognition technology development - current problems to be solved

The performance of speech recognition systems is influenced by many factors, including how different speakers pronounce, how they speak, environmental noise, transmission channel fading, and so on.

There are four specific problems to be solved:

1 Enhance the robustness of the system, that is, if the condition becomes very different from the training time, the performance degradation of the system cannot be abrupt.

2 to increase the adaptability of the system, the system should be able to stabilize the continuous changes in adaptation conditions, because the speaker has differences in age, gender, accent, speech rate, speech intensity, pronunciation habits and so on. Therefore, the system should be able to eliminate these differences. Achieve stable recognition of speech.

3 Looking for a better language model, the system should get as many constraints as possible in the language model to solve the impact of vocabulary growth.

4 To perform dynamic modeling, the speech recognition system presupposes that the segments and words are independent of each other, but in fact the clues of vocabulary and phonemes require an integration that reflects the characteristics of the vocal organ motion model. Therefore, dynamic modeling should be performed to integrate this information into the speech recognition system.

Six: The development of speech recognition technology - the latest development of speech recognition system

Speech recognition technology has developed to the present day, especially for small and medium vocabulary non-specific person speech recognition system recognition accuracy has been greater than 98%, the recognition accuracy of a specific person speech recognition system is higher. These technologies have been able to meet the requirements of typical applications. Due to the development of large-scale integrated circuit technology, these complex speech recognition systems have been fully made into dedicated chips and mass-produced. In the western economically developed countries, a large number of speech recognition products have entered the market and service areas. Some user switches, telephones, and mobile phones already include voice recognition dialing functions, voice memos, voice smart toys, and other functions, as well as voice recognition and speech synthesis. People can use the voice recognition spoken dialogue system to query relevant air tickets, travel and bank information through the telephone network. Survey statistics show that as many as 85% of people are satisfied with the performance of the speech recognition information query service system. It can be predicted that in the past five years, the application of speech recognition systems will be more extensive, and a variety of speech recognition system products will continue to appear on the market. The role of speech recognition technology in manual mail sorting is also increasingly apparent, and the development prospects are attractive. The postal sector in some developed countries has already used this system, and speech recognition technology has gradually become a new technology for mail sorting. It can overcome the shortage of manual sorting and rely solely on the memory of the sorter, solve the problem of excessive staff cost, and improve the efficiency and efficiency of mail processing. In the field of education, the most direct application of speech recognition technology is to help users better practice language skills.

Another development branch of speech recognition technology is the development of telephone speech recognition technology. Bell Labs is a pioneer in this field. Telephone speech recognition technology will enable telephone interrogation, automatic wiring and operations such as tourism information. After applying the voice-inquiring system of voice understanding technology, the bank can provide customers with 24-hour telephone banking services, day and night. In the securities industry, if the telephone voice recognition voice system is used, the user can directly query the stock name or code if he wants to check the market. After the system confirms the user's request, the latest stock price will be automatically read out, which will greatly facilitate the user. . At present, there are a large number of manual services at the 114 checkpoint. If voice technology is used, the computer can automatically answer the user's needs and then play back the phone number of the query, thereby saving human resources.

3V Lithium CR Series Button Cell

DADNCELL 3V lithium-manganese buckle-type disposable battery adopts manganese dioxide with very stable chemical properties as positive materials and lithium with very high energy as negative electrode materials.

DADNCEL batteries have excellent safety performance, good sealing performance, stable discharge voltage and long storage life. At the same time, they have the characteristics of temperature and wide. The battery can work normally at -20°C ~ +60 °C. Therefore, it is often used in backup memory power supply of some products, such as computer motherboard and automotive alarm. Tire tester, handheld computer, industrial control motherboard, calculator, watch, shoe lamp, electronic thermometer, electronic toys, flashlight, tax control machine, medical device, small electronic gift, Bluetooth wireless products, multifunctional wireless remote control, PDA, MP3, electronic key, card radio, IC card, smart home Electrical, digital camera, mobile phone and other equipment.

All types of batteries developed by DADNCELL Lab do not involve any heavy metals in production, use and waste. They are green and environmentally friendly and can be disposed of with domestic waste.

3V Lithium Cr Series Button Cell,3V Lithium Coin Battery Long Lasting,Button Bettery For Carkeys,Batteries Offer Premium Performance

Shandong Huachuang Times Optoelectronics Technology Co., Ltd. , https://www.dadncell.com

Posted on