Design of Students’ Spoken English Pronunciation Training System Based on Computer VB Platform

—Spoken English communication is most commonly used in the international communication. However, the accuracy of spoken English pronunciation is the key factor to restrict English learners in China. For the current situation that spoken English proficiency is generally low in China, this paper aims to design a spoken English pronunciation training system that will provide guidance and help for English learners’ spoken pronunciation. The Visual Basic platform is used in the design of the system. This paper first conducts an in-depth study on the related theories of voice recognition, discusses the correction algorithm of voice scoring and pronunciation, and puts forward more practical and convenient AP-based scoring method, providing full theoretical support for the design of the system. Then through the function analysis and design of the spoken English pronunciation training system, this paper realizes the system design of scoring and correcting errors of English spoken pronunciation based on the VB platform. The system boasts the basic functions, including English phonetic symbols and word pronunciation to follow, real-time voice evaluation, and pronunciation error correction. According to the test, the similarity of the system with the experts is over 90% in scoring and its efficiency of pronunciation error correction reaches 80%, which plays a certain role in improving spoken English of English learners.


Introduction
With the development of the global integration process and the implementation of "the Belt and Road" and "going global" strategies, people in China have more and more opportunities to communicate directly with foreigners in spoken English. Thus, there are more and more people learning English [1]. However, spoken English pronunciation is the shortcoming of most Chinese English learners, which is the key factor to prevent Chinese English learners from participating in English communication.
Lots of translation software can provide basic auxiliary pronunciation functions, such as word pronunciation, for the inaccurate spoken English pronunciation. However, they cannot provide English voice evaluation and pronunciation error correction to different English learners [2]. At present, voice recognition technology has got a mature development and has been applied in mobile devices and computers. The development of voice recognition technology and progress of pronunciation feedback technology have made it possible to conduct voice recognition and pronunciation error correction.
The spoken English proficiency is generally low in China because there is a big difference in pronunciation between English and Chinese, teachers who can accurately pronounce are insufficient in domestic schools and the environment for spoken English practice is far from enough [3]. To improve the spoken English pronunciation level, relevant scholars have started to carry out studies on computer-assisted pronunciation learning based on voice recognition technology. However, their applications are so small that they are not suitable for promotion. Besides, there also exist some problems in voice evaluation and error correction method [4].
This paper conducts an in-depth study on the related theories of voice recognition technology and puts forward AP-based scoring algorithm for the inaccurate scoring of voice recognition technology in spoken language learning. In addition, combined with Visual Basic platform, it provides a set of feasible technical solution for spoken English training. Detailed analysis and design have been conducted on the system's input and output modules, scoring module, voice feedback module, and user interface, etc. Finally, the design of the system for spoken English pronunciation training is realized and has achieved good results in the actual test. The design of this system can not only provide new assisting methods for spoken English learners, but play a certain role in promoting voice recognition scoring and error correction technology in spoken English evaluation.

VB computer technology
Visual Basic (VB) is a universal object-based programming language developed by Microsoft. It is a structured, modular, object-oriented visual programming language that includes event-driven development environment [5].
Visual Basic originated from the BASIC programming language. VB has a graphical user interface (GUI) and a rapid application development (RAD) system that makes it easy to connect a database with DAO, RDO, and ADO, or to create ActiveX controls for efficient generation of type-safety and object-oriented application programs. Programmers can easily use the components provided by VB to quickly set up an application program. When a traditional program designs language programming, it usually designs the interface of the application program (such as appearance and location of the interface) by writing a program, where the actual effect of the interface cannot be seen. However, in Visual Basic 6.0, Object-Oriented Programming is used to encapsulate programs and data as objects, each of which is visual [6]. In the interface design, developers can use the Visual Basic 6.0 toolbox to "draw" different types of objects, such as window, menu, and command buttons on the screen, as well as set the properties for each object. The only thing that developers need to do is to code for the objects in the event process, so the efficiency of the programming can be enormously improved.
This paper chooses VB computer platform as human-computer interface of spoken English pronunciation training system. By embedding the relevant voice recognition algorithms and evaluation feedback methods, this platform will serve as a simple, practical and efficient platform for spoken English evaluation after the completion of the relevant programming processes.

2.2
Design of language recognition algorithm Introduction of recognition algorithm: The recognition of the pronunciation of English learners is the prerequisite of the intelligent pronunciation training system, and only after the voice can be recognized can further evaluation and feedback be conducted. The essence of voice recognition is machine's recognition of natural voice of humans [7].
At present, there are three common voice recognition algorithms and their characteristics are shown in Table 1.

Voice recognition method Characteristic
Method based on channel model The acoustic model and voice knowledge are too complicated, no reach practical stage

Pattern matching method
Has reached the practical stage, commonly used technology has a dynamic time regulation (DTW), hidden Markov (HMM) and vector quantization (VQ) Artificial neural network method Too complex to achieve, is still in the experimental research stage As shown in the table, artificial neural network and sound channel model have not yet reached the practical stage, so the pattern matching technology is used for automatic voice recognition. The basic flow chart of the technology is shown in Figure 1. T=he main processes of voice recognition include pretreatment, feature extraction, pattern matching and measure estimation. The input voice signal can be achieved the recognition results through these processes.
Petreatment of voice signal: Voice signal is an analog signal with unstable amplitude, and computer recognition can be carried out only after its digital conversion.
According to the Nyquist theorem, the signal is sampled with the sampling rate being greater than or equal to twice of the signal bandwidth. The frequency of voice signal is generally 300 ~ 3400Hz. This pape r uses 8000Hz sampling rate and 16-bit digitalizing bits.
Pre-emphasis of voice signal: The universal law of energy signal loss: Each time the frequency of the signal increases by twice, the amplitude of the power spectrum will drop by 6dB. Therefore, according to the corresponding proportion, the signal is emphasized by the first-order high-pass filter, and the transfer function is: In the form of time domain, the pre-emphasized signal is: Where, a is the pre-emphasis coefficient. Generally, the value is close to 1, and the value in the system is 0.97.
Framing windowing processing of voice signal: Next, the voice signal is framed and windowed by using the time of 256 sampling points as a frame. Then the method of continuous framing is adopted to process the voice signal. In order to reduce the inconsistency of the boundary signal, a window function is usually multiplied. The commonly used window functions include a rectangular window function and a Hamming window function, and their expressions are as follows: Rectangular window function: Hamming window function: The suitable window function is a parameter of the short-term feature of the signal. The selection of the window function mainly considers two aspects, namely window shape and window length. The system uses a rectangular window function in the time domain endpoint, and uses a Hamming window function in the short time-frequency transform processing.
Voice signal endpoint detection: Endpoint detection is to find out the starting and ending point of each segment of the voice signal element by use of digital erection techniques and related algorithms. Voice endpoint is the key to the accuracy of signal detection [8]. This paper adopts the endpoint detection method combining short-time energy with short-term zero-crossing rate. The method is simple with small calculation and high reliability. It can be seen from Figure 2-diagram of voice endpoint detection that combination of short-time energy with short-term zero-crossing rate playa a very good role in determining the starting and ending point of each segment of the voice signal, making good preparation for further processing of signal [9].
Feature extraction of voice signal: The feature processing of voice signal is to calculate and extract the key parameters that reflect the features of the signal. The features of voice signal are effectively described by a small amount of parameters to facilitate the subsequent processing. Feature extraction is conducted on signal. MFCC feature extraction is a more commonly used feature extraction method, which can not only reflect the features of voice, but better reduce noise. MFCC feature parameters extraction steps are shown in Figure 3. MFCC converts the frequency domain to the Mel frequency domain to better smooth the voice spectrum and reduce the effect of harmonic wave. As a result, the input voice will not be affected by tone or volume, which is suitable for spoken English pronunciation recognition.
Pattern matching of voice signal: In the voice recognition evaluation, the similarity between voice to be evaluated and the reference standard voice is reflected by comparing the difference between their feature parameters. However, due to the difference in length of pronunciation and speech, the two cannot directly match each other Thus, a matching discriminated method is used to carry out pattern matching for the feature parameters. Dynamic time warping (DTW) is a nonlinear normalization method that combines time warping with distance-gap calculation. And the distance between the vectors expresses the matching similarity between the template to be tested and the reference template eigenvector. The larger the distance is, the smaller the matching similarity is. The distance between the eigenvectors T (n) and R (m) is usually expressed by Euclidean distance: Where, and represent the ith dimension eigenvector of T (n) and R (m) respectively, and p is the order of the eigenvector. DTW needs time warping function m=w (n). The time axis n of the template to be tested is non-linearly mapped to the time axis m of the reference template to obtain the minimum distance of the whole matching.

Pronunciation Feedback Evaluation Methods and Technologies
The scoring method based on the adaptive parameter (AP) in this paper provides scoring feedbacks for spoken English pronunciation. AP-based scoring algorithm is shown formula (6): (6) Where, x and y are adaptive parameters whose value is uncertain and they can conduct adaptive changes according to the computer or hardware settings [10]. APbased scoring algorithm is shown in Figure 3. The spoken English scoring system generates the adaptive parameters through a separate scoring parameter generation module before scoring. Learners pronounce for different voices, and experts score the learner's pronunciation by experience. The scores of experts correspond to the MFCC frame matching distance one by one. And the set of MFCC frames is , the set of scores of experts is . The corresponding relationship of n pairs of data is shown in formula (6) (Saastamoinen, J et al., 2005): The estimated value of parameters x and y is calculated by three samples for scoring. Such systematic scoring and expert scoring have a high degree of similarity, which makes the system more accurate and valuable for scoring of spoken pronunciation [11].

Design and Realization of Students' Spoken English Pronunciation Training System
The main function of the system is to realize the learning and training of pronunciation of English phonetic symbols and words in the form of multimedia output and to provide evaluative feedback on spoken pronunciation of English learners while guiding the learners to continuously train and improve spoken pronunciation level. The system boasts the basic functions, including pronunciation demonstration, pronunciation to follow, pronunciation contrast, pronunciation scoring and pronunciation result image output [12].

System function module analysis and design
Design of I/O module and mode settings of system input and output are conducted. The system chooses audio record to record the voice signal and uses audio track to output the corresponding voice signal. The final audio format of the system is as follows. Sampling frequency: 8000Hz; sampling sound channel: mono; sampling bit: 16 bits. The system demonstrates how to pronounce for all all phonetic symbols.
Scoring module design uses AP pronunciation scoring technology, including scoring parameters generation and pronunciation scoring. The voice signal is first processed according to the order of Section 2.2 and then the pronunciation is scored according to the score adaptive parameters x and y.
The main content of the feedback module design is to graphically describe the contrast between the standard reference pronunciation and the learner's practice of speaking pronunciation so as to facilitate the qualitative reflection of the differences in pronunciation between the two.
To facilitate spoken English learners to use the system, a simple boundary system interface is very important [13]. Figure 5 shows the structure of the user interface. We can enter the main interface of the system after starting the interface. According to the learners' different learning contents, special practices of "vowel phonetic", "consonant phonetic" and "pronunciation of words" are provided on the main interface [14]. At the same time, the system provides self-adaptive scoring and system help interface to assist learners in completing the pronunciation training and scoring of words better.

Realization of system
Operating environment of system: Hardware: i3 and above processor, 2G memory, 500G hard disk. Operating system and software: window XP, window 7 and other window operating system, Visual Basic 6.0 or above, Office and other common office software.
Audio input device: general anti-noise microphone.
Realization of system: Related controls are set by VB language and are embed with voice recognition technology and AP adaptive scoring function, and corresponding expert voice pronunciation is stored for comparison with input voice. After completing the programming of the corresponding controls, the system functions and interface design are achieved relying on the user interface structure diagram. Unregistered users can click on the training button to start training and are familiar with the system's operating functions and voice training effects after entering the main interface. Users who are satisfied with the system can save training records and set personalized voice pronunciation training.
As shown in Figure 7, it is a set of training samples of words for the learners to conduct pronunciation comparison and pronunciation scoring after entering the pronunciation. For the words with lower pronunciation scores, users can read after the pronunciation to improve the accuracy of pronunciation by training.
En ng glish sp p p poken n p pro onun nciat t at a ion train ning g g g sy ystem English spoken pronunciation training system

System test of spoken English pronunciation training system
20 vowels, 24 consonants, and 12 words are taken as a test sample. As shown in table 2, the results of the system are tested from three aspects: pronunciation success rate, scoring accuracy and error correction efficiency. It can be seen from the table that the system has good test results that can meet learners' needs of spoken English pronunciation training.

Conclusion
The accuracy of spoken English pronunciation is an urgent need for Chinese English learners to improve for the lack of special and systematic training and guidance of spoken pronunciation. With the improvement of internationalization, spoken English has attracted more and more attention. Through a deep analysis of voice recognition technology and phonetic pronunciation scoring algorithm combined with the computer VB platform, this paper achieves the design of students' spoken English pronunciation training system based on the VB platform by fully considering the system's functional requirements. The system design has the following conclusions and significance: Eng g g glish h sp po oken n p pro onun nciat at a ion t train ning g sy yst tem English spoken pronunciation training system • The AP-based voice-scoring algorithm in this paper has high accuracy, which is is more suitable for spoken English pronunciation.
• The system provides a convenient spoken English training platform that will significantly improve the accuracy of spoken English pronunciation.