Study

— The development of automatic speech recognition (ASR) technology makes the human-computer interaction possible. Considering the deficiencies currently in the English pronunciation teaching, this paper employs relevant principles of speech recognition, proposes the design of an automatic English pronunciation grading system in the English teaching field, describes in detail the system structure, functions and processes and introduces the key technologies and steps to implement the system: dynamic time warping algo-rithm, establishment of corpuses, consonant/vowel (C/V) segmentation technology and grading standards. Results of a small-scale test show that the system is useful to testing the English pronunciation of international students.


Introduction
Language is a main tool in interpersonal communication. Under the context increasing international economic and trade activities, more and more people begin to learn the languages of other countries, the most popular of which is undoubtedly English. At present, English teaching is becoming a hot spot in the education field. However, the traditional English teaching approach is not very effective and does not meet the requirements of good English education. In this case, a computer-aided language learning (CALL) technology has been developed and widely used in English education and learning [1]. However, in the application of this technology, due the lack of experience, many teachers put too much emphasis on English reading and writing and pay little attention to speaking. As a result, students are often learning "dumb English". At present, people all over the world are actively learning English, and the market demand for English learning is also growing, but English teacher resources are few and cannot meet the teaching needs. In language learning, pronunciation training is an important part. Through this training, students can find their own pronunciation errors so that they can correct them in a timely manner [2].
During classroom teaching, teachers can promptly find students' pronunciation errors and correct them in time. However, influenced by the traditional teaching philosophy, many English teachers have a certain "fault tolerance" of students' poor pronunciation. As a result, many students do not pronounce correctly. Although these international students can have simple dialogues with teachers and the communication is effective, they will find it very difficult to communicate smoothly with common foreigners [3][4][5]. Therefore, in English pronunciation teaching, there should be a scientific and reasonable English pronunciation evaluation system to evaluate students' pronunciation.

ASR Technology
This technology mainly uses the pattern recognition principle to recognize speech, and its theoretical framework is shown in Fig.1. As can be seen from Fig. 1, this technology mainly consists of several units such as speech signal preprocessing, feature extraction and modeling. Below is a detailed analysis. The training using ASR system can be divided into two stages -training and recognition. Both of these stages require preprocessing the original voice input and extracting the corresponding features. The preprocessing module mainly processes the original speech signals and filters out some noise and background hiss and at the same time analyzes the sub-frames of these voice signals and reprocesses them [6][7][8].The feature extraction module mainly allocates and adjusts the corresponding acoustic parameters, and calculates the voice features in order to extract the key feature parameters of signal features for subsequent speech processing [9]. In speech processing, this system mainly uses parameters like the amplitude, energy, zero-crossing rate, cepstrum coefficient and short-term spectrum. By rationally adjusting these parameters, the system will achieve speech recognition. Thus, the choice of features is the core of the system construction.
In the training stage, the user first inputs a certain amount of training speech. The system performs preprocessing and feature extraction, and determines the required feature vector parameters, and then establishes the corresponding speech reference model library based on these parameters. If the conditions are not sufficient to establish one, the system can also choose the existing model library for appropriate corrections. In the recognition stage, the system compares the similarity measures between the input feature vector parameters and the template in the library to determine the types with the highest similarity, and outputs them as the results. Later

Automatic English Pronunciation Grading System
Considering the language-learning pattern, the pronunciation learning using this model should be from easy to difficult, and the learning process should be divided into the following specific stages: the first step is to learn the basic pronunciation, mainly including English vowels and consonants; the second step is to learn the pronunciation of words and combinations of vowels and consonants and the tone and intonation of sentences; and the third step is mainly to train the pronunciation of phrases and sentences, and teach students to master co-articulation and rhythm. Specifically, this system can be divided into four modules, which are highly independent of each other. Details of the modules and relevant functions are as follows: standard model training module: it is mainly used to establish a standard corpus. This system can compare the learner's pronunciation against the pronunciation in the standard corpus to find out the similarity and give a corresponding score based on the comparison results. Therefore, for this system, the training and establishment of a standard model is one part of the main work. During the establishment, usually standard phonetic materials will be needed to train the standard corpus in order to ensure the scientific validity of the results [13].
Expert grading module: it is mainly designed to evaluate the pronunciation grade of the learner. The score given out by this system may not be directly comprehensible, as it cannot measure the actual pronunciation level of the learner in a convenient manner. Therefore, it is necessary to establish grades for pronunciations in the non-standard corpus against the standard pronunciations, and evaluate the pronunciation based on these grades to make the results easier to understand.
Automatic grading module: it is mainly designed to compare and analyze the pronunciations in the standard corpus and those of the learner. Based on speech recognition, it employs some similarity algorithm to determine the indicators of certain pronunciations. This module collects the learner's pronunciation, preprocesses it and determines its features, and then segment the voice into a number of elements according to the established element model. After that, it grades the learner's pronunciation based on the similarity algorithm according to English pronunciation features [14][15][16].
Error analysis module: it is mainly designed to determine the score of the learner's pronunciation based on the expert knowledge base, and give advice on how to correct some of the pronunciation errors. Students can use this module to correct their own pronunciation problems and get some good advice and help. However, this process requires a lot of expertise, so it is still under improvement. The flow chart schema of this system is shown in Fig.2. Key Technologies to Implement the System

DTW (dynamic time warping) algorithm
In this recognition process, if the reference template and the input template are directly compared without pre-processing, the result will not be very effective. This is mainly because there is certain randomness with the speech signals, and different people make different pronunciations, and even the same person may pronounce differently at different moments with different time lengths [17][18][19][20]. To this end, it is necessary to carry out time correction processing, that is, using the dynamic programming technology to convert the global optimization problems into simple local optimization problems, and then solving the problems one by one. Below is a detailed discussion of this process.
For convenience of analysis, the feature vector sequence of the reference template is expressed as X= X 1 , X 2 , …, X J , and the corresponding input speech feature vector sequence is Y = {Y 1 , Y 2 , ..., Y J }. Let the time warping function be C= C (N) , where N is the length of the corresponding path; the distance between the two d(X i (n), Y j (n)) is the local matching distance. When this DTW algorithm is used in the calculation, the sum of the weighted distances can be minimized based on local optimization, and the specific formula is as follows: (1) When we use the DTW algorithm in speech computation, we need to first determine the optimal time warping function, and use this function to accurately map the timeline of the speech to be tested to the reference model to minimize the accumulated distortion. This algorithm was initially used in the pattern recognition field, and later widely applied in the English speech recognition field. To date, it has become a  classic algorithm, which can efficiently and quickly recognize some isolated words without doing much computation. Therefore, it has been extensively applied in the speech recognition field.

Writing a new document with this template
When building the English pronunciation unit model and the corresponding grading model, we need corpuses for training. During training, we can choose one of the corpuses as needed: the two corpuses are standard and non-standard pronunciation corpuses respectively. The former is mainly used to train pronunciation units. To build this corpus, we choose 10 foreign teachers and Chinese international students whose pronunciations are standard to do the pronunciation unit training so that the pronunciations provided by this corpus will be not only standard, but also easier for students to accept. Based on practical experience, if a corpus only contains the pronunciations of native speakers, international students will not get high scores, and thus they will not be much interested in English learning. Therefore, an appropriate model needs to be built based on actual conditions to achieve better matching effect.
In building the standard corpus, we choose the robust model training approach, that is, to compare the pronunciations of a word by multiple persons, average the corresponding vectors along the DTW path and obtain the final model. The establishment and training process of the corpus is as follows: For a specific word, let X 1 and X 2 represent the feature vector sequences of the first and the second pronunciations respectively, and determine the distortion score of the two d(X 1 , X 2 ) through the DTW algorithm. If the results are smaller than the corresponding threshold value, it can be deemed that the pronunciation feature vectors of the two meet the consistency requirement. Then determine the time-warped average of the two and obtain a new model Y. All elements y k can be calculated by the following method: If Ty is the optimal path length in the DTW algorithm, the corresponding sequence is expressed as follows: (2) Components of the new model can be determined through the following formula: The non-standard corpus is designed to help build an appropriate manual grading model and provide certain reference. In this paper, in order to build a non-standard grading model, we choose the speeches of 50 foreign international students and foreign teachers with significantly different English pronunciation levels. Then teachers sort out and grade these pronunciations.

C/V segmentation algorithm
In C/V segmentation, the English speech evaluation system needs to be used. This system can normalize the speeches of the learner and in the standard corpus, obtain consistent speech feature parameters, then does matching computation of the similarities between the two on a C/V segmented basis, and at last determines the Euclidean distance between the two to reflect the pronunciation level of the learner. However, for convenience of analysis, usually the system does global matching computation of the two speeches and obtains the result based on C/V ratio. In an English syllable, the proportions of vowel and consonant are quite different, and the proportion of consonant is far smaller than that of vowel. As a result, some light and short consonants are very likely to be "drowned" by those stressed and long consonants. In this case, the consonant feature of a syllable will be lost, causing an incorrect matching result. To solve this problem, the system has to separate the vowels and consonants first, then does matching computation of the two separated parts respectively, and at last determine the final matching result on a weighted basis [21][22].
The allocation principle in this algorithm is as follows: within one phoneme, the feature vectors of speech frames are highly similar, but in different phonemes, the feature vectors of speech frames are significantly different. We can use this characteristic to distinguish vowels and consonants. Meanwhile, we can measure the speech differences by the distance between phonetic segments.
For convenience of analysis, we usually call the overall difference between two speech segments the distance between the two, i.e. distance between segments. The calculation of this distance is to measure the similarity. This type of calculation usually involves time frames. Based on the distance between segments, we can calculate the phonetic segment consisting of a certain number of continuous speech frames, and use the distance between segments to reflect the feature difference between speeches.
Let a and b be the two segments in one continuous speech. The number of time frames in these two segments are N a and N b , and there can be overlaps between the two. The feature vectors of time frames in these two segments are X i and Y i respectively, and the Euclidean distance between the two is d ij . Then we can calculate the distance between Segment a and b using the following formula: (4) Based on the distance between segments, we can determine the difference between the two speech segments. If the adjacent two speech segments are one phoneme, the distance between the two will be short; otherwise, the distance will be long. In this way, we can separate the vowel and the consonant in a syllable based on the distance between segments.

Grading standards
The English pronunciation grading system plays an important role in English learning. It can determine the learner's pronunciation level and give appropriate tips. To grade their pronunciation, we can take standard pronunciation as the basis, calculate the distance between the standard pronunciation and the learner's and give a pronunciation score based on this. This approach is convenient, but there are still some problems -the score is assertive and unstable, and the result is related with the learner's personal condition, difficult to understand and inconsistent with people's perception. Therefore, usually the expert grading method is used, which converts the measured pronunciation score to expert's score on the basis of mapping, and gives the final pronunciation result on a comparative basis. This method needs to classify the pronunciation quality into grades like "very good", "good", "average" and "poor" based on experts' score using expertise. This method has such merits like clear meaning, conformity to perception habits and stable results, but the results can also be subjective. When this method is used to evaluate the pronunciation, the user should choose articulation and naturalness, etc. as the evaluation indicators.

Conclusions
This paper mainly studies the application of speech recognition software in English listening learning. In the research, in order to verify the consistency between the scores given by the learning software and experts, we selected 20 Chinese international students at varying English levels as the research objects and tested their pronunciations. We chose English junior-and intermediate-level textbooks as the testing materials and selected over a thousand words covering English syllables and tones from these materials. We tested the students in a language lab. Each test object read 20 random words, which were recorded and graded by the system. At the same time, teachers graded their pronunciations. We compared the results obtained from the two grading methods, and found that this system can accurately evaluate the English pronunciations of international students at different levels, and that the results are accurate and consistent with the teachers' scores, thus this system is helpful in training English pronunciations.
The test results also reveal some problems in this system that need improvement. For example, the system is not stable enough -it lacks the standard training speech data in an environment with interference, thus the score for some rarely used words is often zero; judging from actual practice, the grading mechanism used in this system is still not scientific enough and the scores are not clear enough, and thus the results are not stable. In future research, we need to work on the consistency and stability of scoring, establish a more standardized expert system and at the same time improve its functions so that the system can correct learners' mistakes while grading their pronunciations. In this way, the applicability of this system will be improved, which will lay a good foundation for the promotion of the system in the future.

Author
Hongyan Zhao, postgraduate and associate professor, is an English teacher at Xi'an International University, 408 Zhangba Road, Shaanxi Xi'an, China. She has been engaged in English teaching for almost ten years and has taught more than ten courses such as English writing, advanced English, marketing English, business English, college English, etc. Her major research orientations are business English and English language teaching. Article submitted 05 May 2017. Published as resubmitted by the author 13 June 2017.