A Probe into Spoken English Recognition in English Education Based on Computer-Aided Comprehensive Analysis

—At present, computer-aided spoken English learning is becoming increasingly popular among learners. The computer-aided comprehensive analysis technology can evaluate and correct learner's spoken pronunciation, thereby improving their pronunciation. Based on computer-aided comprehensive analysis, this paper aims to explore the automatic recognition and scoring methods of spoken English in English education. For this, it studies the effective matching of the feedback information with the known pronunciation scoring results, and then develops a computer evaluation plug-in consisting of different modules such as user login, English spoken speech acquisition and recognition, voice evaluation, speech broadcast, and spoken dialogue. The research results show that the computer evaluation plug-in matches and compares the extracted feature parameters of input speech with the standard features, scores the spoken language input by the learner, and gives the correct pronunciation so that the learner can get feedback in time. For different stages of English learning, the focus of recognition technology and the spoken recognition algorithms applied also vary. The research findings provide theoretical and technical support for oral English recognition, error correction and scoring.


Introduction
English is a universal language for global communication. With the continuous acceleration of the global integration, the demand for English learning has grown rapidly. Spoken language recognition in English education has always been a difficult point for English learning [1]. At present, affected by the mother tongue, Chinese native learners are used to practicing or recognizing English using the Chinese pronunciation method despite the difference in pronunciation features between Chinese and English, which results in some difficulties in recognizing English pronunciation [2][3]. The recognition technology of spoken English began in the 1950s. However, with the continuous expansion of the English field, the constraints on spoken language recognition such as small-vocabularies, speaker-independent and isolated-words need to be relaxed, and the expansion of the vocabulary makes it difficult to select and establish the templates of spoken language recognition [4]. Now there have been many methods of spoken language recognition in English education. The commonly used ones are methods based on vocal tract models and phonetic knowledge, the template matching method, and the artificial neural networks etc. [5].
Following the rapid development and widespread application of computer technology, the computer's powerful data analysis and processing capabilities and colorful multimedia functions have greatly enhanced the English learning efficiency, and computer-aided speech recognition and language learning have increasingly attracted more attention [6][7]. The computer-aided spoken English recognition system mainly studies how to measure the indicators such as intonation, stress, speed of sound, and prosody. It can detect and correct a given spoken pronunciation error by comprehensively evaluating the quality of pronunciation [8]. Spoken English recognition can help correct wrong pronunciations, and improve the accuracy of English pronunciation, which lays the foundation for further learning [9]. The computer-aided comprehensive analysis can be performed to recognize spoken English without relying on specific time and place. It can perform real-time monitoring of pronunciation errors in a targeted manner, and ensure an effective English teaching with computer's powerful computing and data analysis capabilities [10]. In view of the above, this paper attempts to explore the automatic recognition and scoring methods of spoken English based on the comprehensive computer analysis technology, and effectively pairs the feedback information with the known pronunciation scoring results. This study provides theoretical and technical support for spoken English recognition, error correction and scoring.

Selecting feature parameters
Spoken English pronunciation is related to phonetic symbols and phonemes etc. The pronunciation signals are analog signals that change over time. Generally, preprocessing is required during the recognition of spoken English. Common preprocessing methods include pre-emphasis, windowing and framing, etc. [11]. Among them, the windowing of signals is realized by weighting a movable finite-length window. Let x (n) and w (n) be the spoken language signals and window functions, respectively, then the windowing process can be expressed as: Currently, there are three commonly used window functions:Rectangular window, Hanning window and Hamming window. Their function expressions are shown below: Hanning window: Hamming window: where, N is the window size. The level of spoken English pronunciation by English learners is feedback to the learner from the computer, so the computer must have a clear reference or standard in the measurement process [12]. During the English spoken language recognition, feature comparison is used to compare learner's spoken language features with those of standard English for scoring and error analysis in spoken language [13]. The feature parameters selected during this process include pitch period and short-term average magnitude, which mainly refer to phonetic symbols and phonemes in English [14].

Scoring mechanism and scoring parameters adjustment in feature comparison of spoken language
When processing spoken English by machine, it is difficult to compare simple input speech features with standard speech features, because the language has regional and accent features, and the input signal has a large randomness [15]. It's assumed that the feature vector of the standard spoken language template is a, and the feature vector of the input spoken language is b. In the recognition algorithm, the distance between features needs to be calculated, but it cannot be directly used as the score of the pronunciation. This study attempts to explore the relationship between them, as shown in Equation 5: Score= ;<< ;=>(?@AB) C (5) Figure 1 shows the scoring functions by the feature comparison. It can be seen that as the distance increased, the scoring value decreased. The actual scoring is shown in Equation 5, in which the feature vectors of the standard spoken language template and the input spoken language were used. In the actual calculation, the weighted average of the two should be taken. The adjusted formula for the parameters is shown in Equation 6.
where, w1 and w2 are the weights of two feature parameters, and w1+w2 =1.

Speech recognition technology in spoken English learning
In spoken English learning based on speech recognition, pronunciation is the key. For different stages of English learning, the focus of recognition technology also varies. Some recognition algorithms of spoken English are specifically used to study the pronunciation errors of beginners, and some are to the entire English pronunciation skills [16]. In English education, both teacher or parent are not willing to let students use the spoken English learning system oriented to accuracy recognition. Figure 2 shows the functional modules of the spoken English learning system including the login module, speech recognition, speech evaluation, voice announcement, and spoken dialogues, etc., where the speech recognition module was subdivided into speech collection, speech data pre-processing, feature extraction of speech data, etc.; speech evaluation was subdivided into speech recognition, phoneme scoring, and comprehensive scoring. Figure  3 shows a flow chart of spoken English recognition. After preprocessing the spoken language signal, the acoustic parameters were analyzed; then, its measure was estimated by the distortion measure, and the recognition result was determined. Figure 4 shows a flow chart of the spoken English recognition modules, in which the learner's spoken pronunciation and standard pronunciation were first extracted respectively; then the reference phoneme model was used for forced alignment, to score the phonemes; the correction suggestions or comprehensive scores were finally given. In short, the main modules of the spoken language recognition system include five parts: feature value extraction, factor recognition, factor correlation, pronunciation evaluation, and error detection.

Application of speech recognition scoring system in spoken English
The speech recognition scoring system is one of the most important applications of computer-aided language learning. The quality of the scoring system is strongly dependent on speech recognition technology. Figure 5 shows the flow chart of the scoring system for spoken English recognition. It started with extracting feature parameters of a speech signal, and then cut the speech signal. In the end, the scoring was made by voice recognition and tone recognition of a single syllable. When using the corresponding scoring method, the system recognizes the spoken language and produces related results. Spoken language recognition can score the entire sentence, individual words, individual phonemes, or the overall rhythm. Figure 6 shows the flow chart of the scoring system. The scoring process of the entire system can be divided into two parts. The first is to extract the spoken language features and select a suitable recognition algorithm for eliminating the false scores caused by differences in standard pronunciation; second, different score units at phoneme-level should be decomposed according to the spoken language features to eliminate noises. Figure 6 shows the operation flow of spoken English recognition process, which fulfils the speech recognition, phoneme scoring, and pronunciation evaluation at a time.

Detailed design and implementation of spoken English recognition system
The core of spoken English education is the design and implementation of spoken English recognition system. Whether it is an English sentence or a word, the recognition process is based on a single English vocabulary. The single standard spoken speech samples were collected to provide data for the training templates and training codebooks of the system. Based on the computer-aided comprehensive analysis, the spoken English recognition system in this paper involved sampling frequency, status number, and isolated-word speech number, and a sequence of isolated word speech strings related to English. Figure 7 shows the core framework of the spoken English recognition system. It first read the input audio file, and performs pre-processing, feature extraction, and feature matching of the voice signals; then, it combined reading and display of computer multimedia learning materials to perform phonetic learning and words learning, pronunciation recognition and pronunciation correction. In the initial spoken language recognition, the speech data collected by the computer were original signals, which need to be pre-processed step by step. This process included sampling and quantization of analog signals, format conversion of digital signals, mute processing at the beginning and end, and pre-emphasis of voice data. Figure  8 shows a model of spoken English learning and recognition based on computer-aided comprehensive analysis. It can be seen from Figure 8 that there are three ways to learn spoken English: graphic learning, video learning, and audio learning, among which the video learning method is to use pronunciation articulation, sentence pronunciation demonstration and guidance, and imitate pronunciation; while the audio learning is by listening to standard pronunciation and learner pronunciation to form contrasts in pronunciation so that learners can adjust pronunciation in time according to feedback.

Development of a computer evaluation plug-in
Spoken English learning is established through stimulus and response. It is a process of identification and feedback. With the widespread application of computer-aided learning or teaching tools, students can actively choose and process in the learning process, thereby constructing internal mental representations, that is, during the learning process using computer-aided systems, learners can use spoken language recognition system to build their own knowledge system. In order to implement the spoken language recognition system based on speech recognition technology, the author developed a computer evaluation plug-in. This plug-in can better help learners to standardize the English pronunciation. The whole plug-in module includes user login, speech acquisition and recognition of spoken English, speech evaluation, voice broadcast, and spoken dialogue, of which speech acquisition includes the main parameters of sampling rate, audio data format, and audio source etc. Figure 9 shows the interface for the spoken English acquisition and recognition. The acquisition module of the entire plug-in includes start recording, end recording and broadcast. The entire internal evaluation module is given by a detailed comparison between the spoken language recognition algorithm and the phoneme level scoring unit. Figure 10 shows the functional operation flow of the spoken language input module. The input module is controlled by three buttons: microphone on, microphone off and voice broadcast. The spoken language input and broadcast form a complete feedback and correction process of spoken English.

Conclusion
Based on the computer-aided comprehensive analysis, this paper explores the automatic recognition and scoring methods of spoken English, and effectively matches the feedback information with the known pronunciation scoring results. The specific conclusions are as follows: • The functional modules of the spoken English learning system include a login module, speech recognition, speech evaluation, speech broadcast, and spoken dialogue etc.; speech evaluation includes speech recognition, phoneme scoring, and comprehensive scoring. • The quality of the scoring system is strongly dependent on speech recognition technology. First, the spoken language features must be extracted and a suitable recognition algorithm must be selected to eliminate false scores caused by differences in pronunciation. Second, different score units at phoneme-level should be decomposed according to the spoken language features to eliminate noises. • There are three ways to learn spoken English: graphic learning, video learning, and audio learning. This paper develops a computer evaluation plug-in. This plug-in can better help learners to standardize the English pronunciation. The whole plug-in module includes user login, speech acquisition and recognition of spoken English, speech evaluation, voice broadcast, and spoken dialogue.