Development of Indonesian Text-to-Audiovisual Synthesis System Using Syllable Concatenation Approach to Support Indonesian Learning Development of Indonesian Text-to-Audiovisual Synthesis System Using Syllable Concatenation Approach to Support Indonesian Learning

—This study aims to develop of Indonesian Text-to-Audiovisual synthesis system using a syllable concatenation approach to support Indonesian learning. This system can visualize the syllable pronunciation synchronized with the speech signal so that it can provide a realistic illustration of the articulator movement when each phoneme is pronounced. Syllable concatenation approach is used to realize a realistic visualization by assembling articulation and coarticulation in the form of syllables. In the development of the system, we have recorded speech database in the syllables form which refers to the patterns of syllables in Indonesian. The syllable concatenation approach is used to concatenate viseme of each phoneme, and to form the visualization of syllable pro-nunciations. It is synchronized with the corresponding speech from the speech database. Evaluation of this system is conducted based on a "lips-reading" of the 10 Indonesian sentences entered into the system. Ratings are based on the degree of correspondence between the syllable pronunciation and the speech produced. Assessment of all respondents is calculated using MOS (Mean Opin-ion Score). The calculation results show that the Indonesian text-to-audiovisual system has produced the pronunciation visualization more realistic and smoother

1 Introduction action systems. A realistic talking-head is one important part of the character facial animation. In general, talking-head is developed using visemes synchronized with the speech and phonemes spoken. Several different phonemes are visualized by the same viseme, such as phonemes 'm', 'p' and 'b'. Therefore, these phonemes can be grouped into one class. The fact is that the visualization of a phoneme can be differentiated based on coarticulation that follows. For example, the visualization of the phoneme 'b' is different in the word 'buku' (book) and 'baca' (read), since the coarticulation that follows the articulation of phoneme 'b' is different as shown in Fig. 1. The phoneme pronunciation visualization of this type is called dynamic viseme. The use of dynamic viseme is one way to produce a realistic talking-head. A text-to-audiovisual synthesis is an act of mouth movements when someone is communicating with others [5]. One part of this system is the transcription of a text into phonemes. A text consists of prosodic units, such as phrases, clauses, and sentences. The combination of the phonetic transcript and information on prosodic units can be used to create a symbolic unit representing linguistic entities, and is considered the front-end of text-to-speech. Synthesizers use this symbol to represent linguistic entities and convert them into speech.

Related Works
There are several studies on the dynamic viseme [5][6][7] [8]. However, studies on the Indonesian dynamic viseme are rarely conducted. A study of visual speech synthesis method based on Chinese dynamic viseme is conducted by [5]. The dynamic viseme is constructed based on the parameters of the mouth features that are classified using a clustering algorithm. This process obtains 40 basic static visemes combined with the type of consonants and vowels. The experiment results that visual speech generated from the dynamic viseme is smoother and more realistic. An Indonesian text-to-audiovisual synthesis resulted from a study conducted by [6], is constructed based on Indonesian static viseme models. A model of static viseme classes is formed based on the results of a clustering process on a dataset of 2D images of mouth movements. Visualization transitions among units viseme are arranged using the morphing viseme method so that visualization of the phoneme pronunciation is smoother.
In the study conducted by [8], a dynamic viseme is also applied to a visual speech animation by assembling the simple viseme units. Dynamic visual speech gestures are generated from a movement of visual speech articulators. The subjective evaluation is performed to compare static viseme and dynamic viseme. The evaluation results show that dynamic viseme can generate visual speech animation more accurate.
We have conducted a study on a text-to-audiovisual synthesis based on Indonesian dynamic viseme models. In our study, a dynamic viseme model is obtained based on a combination of viseme classes of consonant phonemes and vowel phonemes that form syllables which are in accordance with the Indonesian syllable pattern.

Indonesian Language
Indonesian was declared as a national language in 1928. Indonesian is the interethnic liaison language (Lingua Franca) which could unify many tribes in Indonesia. At first, the Indonesian language was written based on a Latin-Roman alphabet that followed the Dutch spelling. In 1972 Indonesian was enacted using The Enhanced Indonesian Spelling System (Indonesian: Ejaan Yang Disempurnakan, abbreviated EYD). Indonesian uses units such as phrases and sentences. Sentence is the smallest unit in the verbal or written form that reveals intact thought. A sentence can consist of several elements such as subject, predicate, object, complement and description. A combination of elements of a sentence form sentences that have meaning.

Phonology
Phonologies are a linguistic knowledge related to sound, production of sound, instrument sounds, etc. There are two parts to phonology, namely phonetic and phonemic knowledge [9]. Phonetics is part of phonological learning of how to produce the sounds of the language or the way the language sounds are produced by the human vocal organs. Meanwhile, phonemics is part of the phonological learning of the speech sound according to its function as a different meaning [9]. Phonology is also related to terms, namely, phonemes, consonants, and vowels. The phone is a speech sound neutral or still unproven to distinguish meaning.

Definition of Phoneme
A phoneme is the smallest unit of sound of a language to distinguish the meaning [11]. For example, the letter 'h' in the word 'harus' (must) is a phoneme. If the letter 'h' in the word is omitted, it will be 'arus' (current). The word 'harus' is different from 'arus', so the presence of the letter 'h' can distinguish the meaning. An Indonesian phoneme consists of vowels and consonants. The vowel is a speech sound that does not meet an obstruction when expelled from the lungs. Vowels are divided into a single vowel (monophthongs) which consists of 'a', 'i', 'u', 'e', 'o', and a double vowel (diphthongs) which consists of 'ai', 'au', 'oi'. Meanwhile, a consonant is a speech sound produced from the lungs expelled with obstacles.
Grapheme is the smallest unit as a differentiator in a writing system [12]. Grapheme is the epitome of the letter, grapheme referring to the letter or combination of letters as a unit of the phoneme symbol in the spelling. The relation of grapheme and phoneme is a one-to-one relationship, such as in the word 'kursi' (chair) which consisting of graphemes 'k', 'u', 'r', 's', 'i' and its pronunciation also consists of 5 phonemes 'k', 'u', 'r', 's', 'i'. The other relation of grapheme and phoneme can be formed as many-to-one, such as in the word 'ladang' (field) which consisting of graphemes 'l', 'a', 'd', 'a', 'n', 'g', whereas, its pronunciation consists of phonemes 'l', 'a', 'd', 'a', 'ng'. Therefore, the graphemes 'n' and 'g' represented by the one phoneme 'ng'. Table 1 shows the letters of alphabet in Indonesian and Table 2 shows the phonemes in Indonesian.

The Syllable Patterns in Indonesian
The syllable is a part of a spoken word in one breath and generally consists of several phonemes. Every syllable in Indonesian is characterized by a vowel (abbreviated V) that can be followed or preceded by a consonant (abbreviated C). The number of syllables is determined by searching the number of vowels in a word. If there is a word that contains 3 vowels, it can be determined that the word is composed of 3 syllables. For example, the word 'cepat' (fast) is a word composed of two syllables, namely 'ce' and 'pat'. Each syllable contains a vowel sound, that is sounds of 'e' and 'a'. Every syllable must at least consist of a vowel sound or a combined vowel and consonant sounds. Based on these rules, there are some patterns of syllables in Indonesian. Type of syllable patterns Indonesian is shown in Table 3  There are several stages of the study, those are: creating a speech database, performing a clustering process of consonant phonemes to obtain consonant viseme classes, building a viseme model, synthesizing text-to-audiovisual by a process synchronizing between text, speeches, and phonemes. Overall, these steps can be seen in Fig.  2.
The dataset used in the clustering process is 2D images visual speech resulted from the extraction process of video containing scenes of the person saying 200 sentences in Indonesian. The sentences used in the recording cover the whole of syllable patterns and phonemes of Indonesian. The focus in recording the video is on the mouth movement the person pronouncing the sentences in Indonesian.

Formation of Viseme Model
The formation of the Indonesian viseme model is based on the result of the clustering process of 2D images visual speech dataset. We make a video with 6 minutes and 36 seconds duration containing mouth movements of the person saying 200 sentences in Indonesian. The video is extracted to 10.000 frames of 2D images. We choose a unique frame that represents certain consonants viseme by considering before and after phonemes. This process results in 315 unique frames. Next, we change the image color format from the RGB into grayscale, and crop every image frame to focus on the mouth area with the same size for all images.
We use the Subspace LDA method for extracting features and reduce the dimension. This method is a combination of PCA and LDA method. The use of PCA method is to project the data on the direction which has the largest variety, indicated by the eigenvector corresponding to the largest eigenvalue of the covariance matrix [12]. Specifically, the task of PCA method is to reduce the dimensions by performing a linear transformation of a high-dimensional space into a low dimensional space. Whereas, LDA method aims to find a linear subspace which maximizes the matrix distance of between-class distribution (S B ) and minimize the matrix distance of within-class distribution (S W ). The results of this process are mutually separate classes linearly.
The result of the process of feature extraction and dimension reduction is a dataset used in the clustering process. If there is a set of image dataset as many M of the with the image dimension is row x column pixels projected into a two-dimensional matrix (T), then the matrix is : To calculate the row average of the matrix T, we used (2).
where M i is the number of data row i th and X jm is the average of the data row i th .
Next is calculating the matrix ATrain containing a value of the difference of the image data of T and an average value of the row: where ! is the average value of the row ! !" . Calculating a value of covariance matrix S T (total matrix Scatter S T ) defined using (4).
Eigenvalue (D) and eigenvector (V) are calculated from the covariance matrix S T . The eigenvalue is a characteristic value of a square matrix, whereas eigenvector is the value taken based on of eigenvalues greater than 0. In this study, the eigenvalue (D) and eigenvector (V) are calculated by using Matlab function. Calculation of the eigenfaces value which is characteristic of image data is using (5).
The next task of the PCA method is to reduce the features that are still contained in the image data. The data dimension that has characteristics that are not essential are removed and not used for the next process. The result of this process is the projection matrix PCA, which is calculated using (6).
The projection matrix PCA_Train is used to calculate the projection matrix LDA. The scatter matrix of the within-class distribution (S W ) and the scatter matrix of the between-class distribution (S B ) are defined as (7) and (8).
iJET -Vol. 12, No. 2, 2017 where c is the number of class and N i is the number of data in class A i . Whereas ! ! is the average value each class and ! ! is PCA_train taken each class.
The LDA projection matrix is used as a dataset in the process of clustering using K-Means. K-Means algorithm is an algorithm for clustering n data based on specific attributes into k partitions, where k < n [11]. The initial step clustering process using the K-Means algorithm is to determine the number of clusters. Next is determining the centroid value. In the initial iterations, centroid values are determined randomly. The next iteration, centroid values are determined by calculating the average value of each cluster using (9).
where ! !" is centroid of the cluster i th for variable j th . ! ! is the number of data in cluster i th , while ! !" is data k th for variable j th .
Calculation of the distance between the centroid and each of the data uses a Euclidean distance method as shown in (10).
where ! ! !is a Euclidean Distance and i is the number of data, while (x, y) is data coordinates and (s, t) is centroid data.
Data is grouped based on the minimum Euclidean Distance. This process is repeated so that the centroid values are fixed and cluster members do not move to another cluster. Each cluster consists of data that are similar to one to another in a cluster compared with data from other cluster members.
The quality of the clustering process results measured by using Sum of Squared Error (SSE) as shown in (11). The smaller the value of SSE shows a better clustering quality [14].
where k is the number of clusters, p is data point of each cluster member ! ! , ! !! ! ! is the distance of data point p to centroid m to cluster i th .

Fig. 3. The viseme class models to consonant phonemes
The quality of a cluster can also be seen from a comparison of between-class variation (BCV) and within-class variation (WCV). BCV is the average of the distance between the centroid, whereas WCV is equal to Sum of Square Error [12]. The greater the ratio value of BCV and WCV show a better clustering quality. The ratio value of BCV and WCV calculated using (12).
is the average of distance among centroid. Table 4 shows the calculation results of SSE and the ratio of BCV and WCV from several experiments [10]. The best cluster quality obtained at k = 9, with the smallest value of the SSE and the greatest value of the ratio of BCV and WCV. The classes formed from the clustering process are used as the basis for the establishment of the viseme class models to the consonant phonemes as shown in Fig. 3.

Stringing visemes using Syllable Concatenation Approach
Data used in developing a system of text-to-audiovisual consist of a speech database, which cover the whole syllables in Indonesian, viseme class models for consonant phonemes and viseme models to vowel phonemes ('a', 'i', 'u', 'e', 'o', 'é') and viseme models to diphthong phonemes ('ai', 'au', 'oi'). The visualization of mouth movement of the syllable pronunciation can be generated by stringing visemes. In this study, we use a syllable concatenation approach to concatenate the visemes. Syllable concatenation approach is a method of developing text-to-speech or text-toaudiovisual based on syllables as the smallest unit of speech database [15]. The visual units are generated from stringing viseme of each phoneme to form the visualization of the mouth movements of every syllable. Furthermore, the unit is synchronized with the speech database to produce audiovisual systems. The viseme class model obtained from the clustering process consonant phonemes is used as the basis for the process of stringing visemes. The formation of the viseme classes aims to reduce the amount of variation visualization mouth shape of each phoneme. Several different phonemes can be visualized by the same viseme, for example, phoneme 'b', 'p', 'm' visualized by viseme class#2 (see Fig. 3). One class viseme is a visual representation of a phoneme or several different phonemes. Fig. 4 shows the visualization result of stringing visemes for syllable 'ma-ta' and 'pa-da'. Although the syllable preceded by a different phoneme ie 'm', 't', 'p', and 'd', the visualization can be represented by the same viseme.

Text-to-Syllable Conversion
A text-to-syllables conversion is a separation process of text into words and splitting words into syllables. Generally, the texts have a structure that is not good so a preprocessing step is needed. The preprocessing step consists of case folding and normalization. Case folding is changing all the letters in the text document to lowercase. Whereas, text normalization aims to remove punctuation characters and change the numbers into a series of letters [16].
A word-to-syllable conversion can use simple conversion rules to implement the conversion table containing patterns of syllables in Indonesian. The next process is converting syllables into phonemes to generate phoneme codes, the value of the duration and pitch of each phoneme. The syllable-to-phonemes conversion, in general, can be seen in Fig. 5 [18].
Generally, the words in Indonesian can be converted into a phoneme with simple rules. However, there are some irregular conditions, such as the letter 'e' can be pronounced as 'e' or 'é', so it must be converted into different phonemes for different conditions. This conversation requires a conditional conversion process by taking into account a series of letters before and after that meet certain requirements so that phonemes can be obtained. This condition can be formulated as in (13).

Fig. 5. Text-to-phonemes conversion
The value of the duration of each phoneme is obtained based on this process. It is used to determine the number of frames at the time of stringing pronunciation visualization.

The Synchronization Process
The synchronization process aims to align synchronization between models viseme, speech and syllables, and to create a visual flow of the series of frames that are experiencing a transition viseme [17]. One important part of the synchronization process is stringing visemes. The duration value of each phoneme is one of the factors that determine the number of frames for each phoneme in the process of stringing visemes. The number of frames every phoneme affect the results of visualization pronunciation. In this study, we use (14) to calculate the number of frames for each phoneme. The duration value of each phoneme is different at the end and the beginning of pronunciation of certain phonemes. The value of this duration can be obtained from the extracted files to the file features speech pronunciation of certain phonemes.
where FeP is the number of frames each phoneme, !"# !!!"#$% is duration value at the end of pronunciation of certain phoneme, !"#$%%$%# !!!!"#$ is duration value at the beginning of pronunciation of certain phoneme and !"#$%!!"#$ is the frame rate value used in the development of the animation. For example, the duration value at the end and beginning of the phoneme pronunciation of 'a' is 425 ms and 355 ms respectively. While, frame rate is 24 fps, then the number of frames is 2.9 frames (rounded to 3). Implementation of the number of frames of each phoneme in the making of animation is illustrated in Fig. 6. The duration value of each phoneme in the syllable is combined, so it can generate the visualization pronunciation syllable smoother. Fig. 7 shows the illustration of synchronization process and stringing visemes to form the pronunciation visualization of syllable based on text input.

Experimental Results
The system is tested by entering 10 sentences in Indonesian. We observe a correspondence between spoken syllables, visualize the mouth movement and output speech. Indonesian texts which are entered in the experiment are altered in Table 5. The input texts consist of the syllable patterns that encompass the whole of the syllable pattern in Indonesian.
The experimental results showed that the system can visualize the pronunciation of phonemes and syllables precisely and smoothly. Transition animation of visualizing the mouth shape of the phonemes and syllables are presented subtly. It does not look disjointed between one of visualization of mouth shapes to each other, as illustrated in Fig. 8. This system can be used by learners of Indonesian or foreign learners. In general, the level of the learners is divided into beginner-level learners, intermediate and advanced [19]. For the beginner level, the learners can use this system to visualize the words of greeting, forms a simple sentence of active, passive, negative and prepositions. While for the intermediate and advanced levels, learning is distinguished by the complexity of the sentence. In this study, visualizing the pronunciation of phonemes is built based on the syllable concatenation approach. Visualization of phonemes pronunciation in a syllable patterns is strongly influenced by the phoneme before or after. Fig. 8 (b) shows that the visualization of the phoneme pronunciation 'b' and 'r' is influenced by the following phonemes. Therefore, the results of the syllable pronunciation visualization look smoother. Because of the pronunciation visualization of displacement phoneme does not occur drastically as in Indonesian Text-to-Audiovisual Synthesis System Based on Phoneme Speech Database [6] [20]. This method is an implementation of the dynamic visualization of phoneme pronunciation. The system is tested on 30 respondents to measure the degree of correspondence between the pronunciation visualization and speech. Respondents are people who understand the science of Indonesian phonology namely students and lecturers of the departments of Indonesian language. Respondents evaluate the degree of correspondence between the pronunciation visualization and speech by providing the criteria value as shown in Table 6. Whereas, the assessment results of all respondents is shown in Table 7. It is averaged by using MOS (mean opinion score) as in (15). The result of the calculation using the MOS to the assessment data from all respondents is 4.283. This shows that the degree of correspondence between the pronunciation visualization and speech of this system is good.
where !!!! is sampled value of i th , k is the number of weight and N is the number of respondents.
In this experiment, we also compare the testing results between this system and the system's Indonesian text-to-audiovisual which uses a speech database phoneme-based from our previous study [6]. Fig. 9 illustrates the testing results of system text-to-audiovisual for Indonesian both of the speech database phoneme-based and the speech database syllable-based. Test results show that in general the system text-toaudiovisual for Indonesian based on the syllable speech database can produce the corresponding degree of the pronunciation visualization is better.   Fig. 9. Comparison of the testing results in system's text-to-audiovisual for Indonesian based on the syllable speech database and system's text-to-audiovisual for Indonesian based on a phoneme speech database