Training

— The majority of Chinese people are still bound up in "dumb" English today. The English learning software is ubiquitous in our lives, but most of them merely focus on English literacy without pronunciation evaluation and corrective feedback enabled. How to improve the oral English learning efficiency and quality has more and more become a hotspot of people’s common concern. The maturity of Speech Recognition Technology (SRT) has kicked off a new mode of oral English learning, which allows the learning software enable pronunciation evaluation and feedback function. This paper probes into the speech signal extraction and pattern matching in SRT. For the sake of ease learning, the Android mobile phone platform is introduced for learner whereby to propose a rating method based on Adaptive Parameters (AP), create a mouth shape correction, and design intelligent English oral phonics training and evaluation system. This paper describes the system implementation process in detail and gives a test demonstration for the system's availability.


Introduction
In the context of globalization, English, as one of the internationalized communicative languages, has appealed to more and more Chinese learners. While the oral English is a major means for achieving English oral communication and measuring conversational competence. Therefore, how to improve the oral English learning efficiency and quality has become a focus of frontier studies. With the rapid development and the popularization of computer technology, many English learning software relying on computer platforms emerges greatly. however, most of them focus on the improvement of English literacy. Individual software has simple repeat after functions, but lack good design and application in the terms of oral pronunciation evaluation and corrective feedback [1], which has also become a bottleneck in spoken English learning intelligence.
In recent years, SRT has gradually become matured, which makes the communication between human and machine come true, and allows the learning software enable the evaluation and feedback mechanisms. More and more people are therefore prone to use SRT for spoken English articulation training studies. In particular, the rapid development of Internet technology and the massive popularization of smartphones have made it possible for SRT's application in the mobile and intelligent spoken English training systems [2]. A new model for oral English learning is initiated.
Speech recognition is the key to success in the design of the intelligent spoken English pronunciation training and evaluation system. This paper, based on the analysis of SRT elementary theory and key algorithm, and relying on the Android mobile phone as a platform, designs a training and evaluation system for intelligent spoken English pronunciation, thus to realize the intelligence of spoken English articulation, which not only facilitates more English learners but also plays an important role in improving the oral English learning efficiency.

2
Designing System Speech Recognition Algorithm

SRT
The SRT can be simply interpreted as a machine-recognizable input converted from human language by a certain technology so that the human-machine communication has triumphed [3]. With the wide spreading of the Internet, SRT has started to be applied in various areas. The advancement of computer technology applies more and more people to studying computer aided speech learning based on speech recognition.
Speech recognition is the key to success in the design of intelligent spoken English pronunciation training and evaluation system. The three types of speech recognition modes in SRT are shown in the Artificial neural network approach To achieve more complicated, is currently in the experimental phase Currently, the speech recognition systems mostly adopt SRT more mature based on the pattern matching. The flow chart of speech recognition system based on pattern matching is shown in Fig. 1, including the speech signal preprocessing, feature extraction and pattern matching, etc. [5]. This paper focuses on profound exploration and analysis on speech signal extraction and the pattern matching.

Extraction of speech signal feature
It calculates and extracts a few of parameters that can reflect the signal feature to give an effective description for language signals. The three commonly used feature parameters and their characteristics are described as shown in the  MFCC () has good speech recognition and anti-noise performances. Since the change in spoken English pronunciation and intonation will not affect the recognition effect, it is applicable to the intelligent English pronunciation training and evaluation system as a basic requirement. The computation process of MCC is shown in Fig.2

Fig. 2. MCC parameter extraction process
The specific procedure is given as follows: Convert the initial speech signal into a single-frame short-term signal (n) after preprocessing.
Apply FFT (fast Fourier) butterfly transform algorithm to convert the short-term signal x(n) into the linear spectrum X(k), the formula is given as follows: Where N is the window length. X(k) is filtered using the Mel-Triangle filter function to solve the logarithmic energy S(i). The Mel-Triangle filter function is shown in the formula (2), and the logarithmic energy S(i) is shown in the formula (3). Where, , f(i): the center frequency of filter, 0!i!M M: the number of filters.
DCT is used to discretely transform the logarithmic energy S(i) into a cepstrum domain, i.e., a signal feature parameter MFCC (p is a parameter order).

Speech signal pattern matching
Dynamic time warping (DTW) and Hidden Markov Method (HMM) are the two commonly used approaches for matching the speech signal pattern. Given that the algorithm shall be simple and the intelligent spoken English pronunciation training and evaluation system gets more practical, this paper adopts the DTW algorithm. The schematic diagram of DTW algorithm is shown in Fig. 3 [8]. First, the standard speech template and the test template as referred are based to perform computation, respectively. The frame matching distance matrix is available at last, from where the optimal path is found, i.e. the cumulative distance and the mini-mum (matching distance) of the optimal path function. This path shall depart from the starting point (1, 1) to the destination (N, M) via every intersection. The standard speech template and the test template are compared by this minimum matching distance. The comparative result may reflect the similarity of language features.

3
System Design

System analysis and design
The development of mobile Internet technology and the popularization of smart phones greatly facilitate people's life and learning. This paper takes the widely used Android mobile phone as the application platform of intelligent spoken English pronunciation training and evaluation system so that English learner can study spoken English anytime, anywhere.
The requirement analysis of the current spoken English learners shows that SRTbased intelligent spoken English pronunciation training and evaluation system shall mainly include the following several core functions: Audio record and playback module: it, as the basis of the system, is an imperative part for achieving human-computer interaction. The headset set in Android system is used as a recording and playback device.
Video-based pronunciation scoring module: it, as one of the core functions of the system, uses SRT to rate and evaluate learners' spoken English pronunciation, so as to let them recognize their mistakes, help them improve the level of spoken English.
Pronunciation formant graph display module: it can achieve the key function of spoken English learning feedback. Learners may correct their pronunciation errors based on the comparison between the audio formant standard speech template and their pronunciation graph.

Pronunciation rating method and process design
Pronunciation rating method. This paper draws on the standard speech template reference to achieve the intelligent spoken English pronunciation training and evaluation system for English learners. In order to improve the veracity and reliability of scoring, this paper proposes a rating method based on adaptive parameter (AP) which adapts to different Android mobile phone applications. AP's rating algorithm [9]: Where, x and y are the adaptive parameters; a separate rating parameter generation module is configured on each device. Before the pronunciation rating, the learner performs the adaptive training for the rating parameter generation. Expert rate it by experience. The best values of x, y are generated by the least squares. D is the frame match distance; N is test template frame length; D (N, M) is the distance between the standard speech template and the test template.
Designing the speech rating process. A system pronunciation rating process is shown in Fig. 4. After the system preprocesses the speech signal, MCC feature extraction and DTW pattern matching are done to calculate the frame matching distance between the standard speech template and the tester's speech template. If user uses the system for the first time, the expert rating is also required to generate adaptive parameters. Only after the rating function is determined can the systems make a rating. This process runs only once. The system will automatically save the relevant parameters. User can directly get the pronunciation scores when reusing it.

Speech feedback and lip reclamation
The relationship between pronunciation formant and orolingual shapes. Chinese people are accustomed to pronounce spoken English using Chinese articulation manner, but the two differ greatly. The study suggests that the most leading problems in spoken English round up to vowels. Chinese articulation rests in the front of the oral cavity while English is "Back vowel pronunciation" [10].
In order to truly enable the feedback function of the intelligent spoken English speech training system, this paper proposes a phonetic formant graphics lip reclamation based on the relationship between the speech mouth-shape and formant [11]. The schematic diagram of the relationship between the speech formant and mouth-shape is shown in Fig. 5 [15]. The vowel pronunciation has three formants, i.e. F1, F2 and F3. the F1 spectrum is the highest, and has the basic characteristics of the speech formant. This paper adopts the F1 formant as the basis for articulation quality evaluation. The formant frequency can reflect the positions of the oral cavity and the tongue. The higher the frequency is, the lower the tongue is and the larger the mouth opens.

Fig. 5. Different pronunciation resonance diagram
Formant-based lip reclamation. Based on the relationship between the articulation formant and the orolingual shapes, the standard speech template and the tester template can be compared to expose the disparity between the articulation formants for the purpose of the orolingual reclamation to the tester. The comparison of formants is shown in Fig. 5 [12], where the black and red lines represent the formants of the learner and the standard speech templates, respectively. The disparity between the two helps us analyze the similarity between the learner and the standard articulations based on which to correct the mouth shape. As shown in Fig. 6, it follows that the learner should narrow the mouth shape and raise up the tongue to coincide with standard articulation according to the F1 formant mentioned above. Based on the relationship between the articulation formant and the orolingual shapes, the standard speech template and the tester template can be compared to expose the disparity between the articulation formants for the purpose of the orolingual reclamation to the tester. The comparison of formants is shown in Fig. 5 [12], where the black and red lines represent the formants of the learner and the standard speech templates, respectively. The disparity between the two helps us analyze the similarity between the learner and the standard articulations based on which to correct the mouth shape. As shown in Fig. 6, it follows that the learner should narrow the mouth shape and raise up the tongue to coincide with standard articulation according to the F1 formant mentioned above.
Designing speech feedback module process. Speech feedback module mainly extracts the formants from the tester and standard speech templates after pretreatment, FFT transformation, and presents a graphic display for learners so that they analyze gap between their articulation mouths and standard. The pronunciation feedback module process is shown in Fig. 7.  Designing the user interface structure. The system uses the Android system mobile phone as a running platform to facilitate user operation and learning. The user interface is designed to be simple and concise, see Fig. 8 for the user interface structure. After user enters the main interface, he or she may select vowels, consonant phonogram and word pronunciation practice functions, click on the button to run the functions. There is phonics demo, pronunciation, practice, repeat after and other functions set on the phonogram practice interface. The beginner may find the rating parameter adaptive function in the menu, where also has Help and Launch functions.

System Implementation
SRT-based intelligent spoken English pronunciation training and evaluation system is developed under the Eclipse integration environment [13], and runs on the Android system mobile phone platform in real machine. In this paper, the main interface of the system and a single phonetics pronunciation exercises are taken as an example to introduce the implementation of system functions.
The system interface is implemented through extended Activity [14]. As shown in Fig. 9, this is the main interface of the system, where there is vowel, consonant and word pronunciation function keys respectively in the left upper side. When clicking on the Vowel phonetics practice function key, the system enters the individual interface for vowel phonetics practice as shown in Fig. 10. The main interface for vowel phonetics practice is shown in Fig. 11. The learner may click on an array of function keys such as articulation demo, repeat after, contrast and evaluation on demand to enter the corresponding function interface. By clicking on the Pronunciation evaluation, the system will pop up the Scores dialog, as shown in Fig. 12. If learner wants to correct his or her pronunciation, click the View formant button, the system pops up a contrast map about his or her formants with the standard articulation, as shown in Fig. 13, based on which learners can adjust the orolingual shapes using the above method, and repeat pronunciation test and comparison.

Conclusion
This paper explores the SRT theory and proposes a SRT-based intelligent spoken English pronunciation training and evaluation system for the purpose of improving the spoken pronunciation efficiency of current English learners with specific results as follows: Based on SRT, the adaptive parameters (AP), rating method and formant map, which adapts to the spoken English pronunciation evaluation and feedback, are acquired; The Android system is used as the application platform to design the SRT-based intelligent spoken English pronunciation training and evaluation system in detail; The system is designed to run on the real machine, and allows expanded functions on the system's main interface and individual pronunciation practices by the test of system availability, which is helpful to improve the learners' level of the spoken English pronunciation to a certain extent.