Multimodal Interaction System for Home Appliances Control

This paper proposes a way to control home appliances using a multimodal interaction system such as speech, gestures, and smartphone applications. The Kinect sensor used to capture Indonesian speech and gestures from users. Dialogue system, speech and gesture recognition process with finite state machine, Google Cloud Speech and K-Means Clustering, respectively. Users can also use the smartphone application to remotely control home appliances through mobile devices that are connected directly to the real-time database. There are two output responses from this system, namely the audio response generator to provide feedback to the user through the sound of the computer speaker and also provide an action to control home appliances use Esp8266. The average level of accuracy testing of interaction using dialogue systems and gesture are 92.5% and 79,25%. Interaction using dialogue systems is better than gesture. Smartphone applications can control home appliances properly. Keywords—Multimodal interaction, speech recognition, gesture recognition, smartphone application, home appliances.


Introduction
Humans communicate with each other do not only depend on speech but they use different modes or ways, such as gestures, hand expressions (sign language), facial expressions (gaze/eye movements), touch screen, keyboard or pointing device [1]. The rapid development of technology has a role to increase the comfort of home dwellers. Almost all home appliances use electricity so that it can be controlled automatically using the internet of things technology [2].
Today's home automation system widely used and popular. Home automation, also known as domotics, is building automation for a home [3]. Home automation systems will control entertainment systems, climate, lighting, and other electric home appliances [4]. Smart home technology offers a new opportunity to improve the comfort of people with computing technology that provides enhanced communication through a variety from multimodal inputs specifically, speech, gesture and mobile application. This communication translates into actions that help the smart home system to complete the tasks. To design a multimodal interface, users must be given two things; a way to instruct the smart home system uses a dialogue system and feedback about what is action to home appliances [5].
Dialogue is a conversation between two or more people through oral or written, which aims to exchange and share information or to resolve an argument. Dialogue between humans and systems is called a dialogue system, where the system can interact with humans in natural language [6]. Dialogue management important for a dialogue system to set the flow of communication between users and systems with natural language. Challenges in natural language processing (NLP) is the relationship between speech and the intent of the user towards user desires. Building a dialogue management system is a challenging work in speech recognition technology to understand speech from users. This research uses dialogue in Indonesian language. System for understanding words in the flow of the dialogue refers to artificial intelligence (AI) with a machine learning approach that depends on data or information. Dialogue classification is the decision to understand the intent is very important in understanding the user's speech in dialogue [7]. A multimodal interaction system for home appliances control has successfully developed. Multimodal inputs used are speech, gestures, and smartphone applications, so make it easy for users to control all their home appliances. This smart home system is also equipped with a dialogue system so humans can interact to communicate their intents, such as to control room temperature and lights. The novelty of this research is the multimodal interaction algorithm using a smartphone application, speech, and gesture recognition that is equipped with a dialogue system so that machines can interact with humans.

Related Work
Interaction system development for home automation is developing very fast. Existing home automation technology allows people to control home appliances using computers connected to the local network. However, the real challenge is to control home appliances naturally and comfort, allowing users to use greater freedom and flexibility.
Home automation interaction systems can use smartphone applications to remotely control home appliances through mobile devices such as tablets or smartphones [4] [8]. Home automation can also use speech control systems where speech is converted to text using automatic speech recognition such as the Microsoft Speech API or Google Cloud Speech API, then send to the server using Wi-Fi. The server system converts incoming text data into a form that can be used to handle home appliances [9] [10]. Another interaction system uses hand gesture recognition [11] or combines speech and gesture to control home appliances [12]. This research proposes a new way to control home appliances uses multimodal interaction such as speech recognition, hand gesture recognition, and smartphone application equipped with a dialogue system so users can interact with smart home systems. The comparison of related work of interaction system for home appliances control describes in Table 1.  Figure 1 describes the overall architecture system proposed for the multimodal interaction to control home appliances. Microsoft Kinect 2.0 used as a sensor to capture the speech and gesture from the user. Smartphone application used to remotely control home appliances using mobile devices such as a tablet or smartphone. Speech recognition process with the Google Cloud Speech API integrated into Microsoft Visual Studio 2015. Gesture recognition process with the K-Means Clustering method used C# code. Natural language understanding and dialogue system with finite state machine method processed use phyton code. Users can also use the smartphone application to remotely control home appliances through mobile devices such as tablets or smartphones that are connected directly to the real-time database. There are two output responses from this system, namely the audio response generator to provide feedback to the user through the sound of the computer speaker and also provide an action to control home appliances use Esp8266 (NodeMCU).

Kinect v2 sensor
Many researchers and practitioners of robotics, electronic engineering, and computer science use Kinect to evolve new ways to interact with computers or machines. Kinect Software Development Kit (SDK) has the potential to change human-machine interactions in various industries, such as education, health, retail, transportation, and so on [13]. The second generation, Microsoft Kinect v2 released in 2013. Kinect v2 utilizes the principle of time of flight (ToF) and offers a field of view and larger resolution than Kinect v1 [14].  Figure 2 shows the Kinect v2 sensor. Kinect v2 has three main components that work together for the Gesture Recognition process. RGB Camera functions to capture the image in front of it. In Kinect version 2, the captured image has a resolution greater than version 1, which is 1920 x 1080 pixels. The IR Emitters function to transmit infrared particles on the object in front of them and the Depth Sensor to read the distance of each particle that is on the object's surface. From the results of this reading, it can also be obtained the coordinates of objects in 3D space, including detecting each point of a person's skeleton. The detected skeleton can then be mapped to the captured image to fit the actual position. In Kinect version 2 the data processing is faster, and the accuracy is even higher than version 1 [16].

Speech recognition using google cloud speech
Speech recognition is the process of converting human voice signals into written language (text). Speech recognition is the most important part used in this dialogue system because reliable speech recognition can make a dialogue that is following the scenario. Cloud API development has been growing, one of which is the Google Cloud Speech API, which currently has 120 languages including Indonesian languages. We considering choosing Google because Google is the popular largest search engine and freely accessible API for the cloud-based speech recognizer [17]. Google Cloud Speech API (Application Programming Interface) is one of the speeches to text services provided by the largest search engine, Google, which can be integrated by the developer into the application. Cloud API can determine which application software to use can interact with cloud-based platforms [18]. Google Cloud Speech services help to recognize voice transmitted in requests and unify it to voice storage on Google Cloud Storage. This Google Cloud speech applies algorithms of a neural network to recognize user voice by generating high accuracy. Google's speech accuracy itself increases with time when Google companies improve speech recognition technology and internal software. The method used by Google Cloud Speech API is synchronous recognition by sending audio data in the form of a wav file to the Speech-to-Text API, and introducing the data and returning the results after all audio is processed [19].  Figure 3 shows the speech recognition architecture using the Google Cloud Speech API, then processed recorded voice data through software on a personal computer (PC). Sound files are recorded in real time, then transmitted to Google Cloud Server, after the Google Cloud Speech Platform recognizes sound after receiving a sound package then sends the converted text back to the user [20].

Natural Language Understanding (NLU)
The purpose of NLU systems is to enable communication between man (users) and machines. The NLU system identifies the user's intention of natural language by extracting words that contain information and issuing queries to the back-end database to fulfill user requests [21].
The processes conducted in the NLU system are stemming, slot filling, and understanding intent by using rule-based. Stemming is the processing of a sentence to get the root word by separating each word from the base word and prefix and suffix. For example, the words "closes", "closed", and "closer" are clustered with the stem "close." Indonesian stemming is a more complex effort than English stemming, so order of stemming rules requires careful consideration [22].
After the stemming process is done, then each word labeled a part of speech based on root words in the Indonesian dictionary corpus. The number of root words used in this research corpus is 28,526 words. The part of speech in Indonesian has seven types, namely nouns, verbs, adjectives, pronouns, adverbs, number words, and task words. After labeling the part of speech, the next is slot filling. The main task of language understanding is to automatically classify domains of user requests along with specific intent domains and fill in a set of slots to form sentence meanings (semantics). The popular IOB (in-out-begin) format is used to represent sentence slot tags [23] as shown in Figure 4. After slot filling is done, then understand the user's intent by using rulebased.

Dialogue system using Finite State Machine (FSM)
FSM is used to model control and sequence processes in a system with a finite number of states. In particular, the actions of the system depending on the state do not depend only on the input to the system but also on what happened earlier in the system. State machines are very important for determining systems with behaviors that depend on significant circumstances [24]. Figure 5 shows the structure of the dialog scenario. Dialog system uses a finite state machine based on user requests. The system will continue the current state and will respond based on the previous transition and circumstances [25]. This dialog system has text processing to understand the intent of the user. This system filters certain words to recognize as slots. The dialogue system requires input slots in the form of verbs, appliances, adverbs, attributes, or adjectives to be able to move from one state to another. An idle state is an initiation state where a new program is run or command successfully execute. After the slots filled, FSM can be used to handle direct or indirect commands, such as: 1. Direct command: a) "Tolong hidupkan lampu (Please turn on the light)." Verb slot: hidupkan (ON); appliance slot: lampu (lamp). b) "Naikkan suhu AC 2 derajat (Increase AC temperature up to 2 degrees)". Verb slot: naikkan (increase); appliance slot: AC; attribute: suhu (temperature); Number: 2.

Gesture recognition using k-means clustering
The Kinect sensor captures skeletal joints on the right hand and left hand. This skeletal joint featured into three parts, namely thumb, elbow, and wrist, each having a position value for each axis (x, y, z). The data of each joint is then statistically calculated to get the value of average, variance, sum, and median. K-means algorithm is computed Euclidean Distance so each joint can be recognized in the form of clustering results.
Clustering is a data analyzing method to group data with the same characteristics in the same area and the distinct characteristics in the distinct area [26]. Figure 6 shows a proposed gesture recognition system using K-Means Clustering.

Fig. 6. Proposed gesture recognition system using K-Means Clustering
A cluster considers both single point clusters and centroids or means. The centroid will represent all points in the cluster if the data points cluster around the centroid. The spread size standard of points group average is the variance or the sum of distance squares between each point and average. If the data point closes to the average, the variance will be small. Variance generalization, where the centroid is replaced by a reference point that may or may not be a centroid, is used in cluster analysis to show the overall partition quality [27].
Skeletal joint position values for each axis (x,y,z) result of Microsoft SDK which comes with hand utility, named Coordinate-Mapper. Coordinate-Mapper is used to identify whether a point forms the 3D space. As input data, we find the difference of each axis i.e.: The formulas (1), (2), (3) are used to find the value of the distance between the axis so that no value is equal. Then, the value is calculated to statistical data, namely the average value, variance, sum, and median. This applies to any skeletal joint of the right hand and left hand, like thumb, wrist, and elbow as shown in Figure 7. After gets the distance value of each axis, then calculated the value of mean, variance, sum, and median from the statistical data obtained [28].
4. Median (13) K-means unsupervised learning method uses to recognize hand gestures. It needs four motion data for four gesture commands. Each movement has four joint axes (x-yz) for the right hand and four joint axes (x-y-z) for the left hand. Each joint has four features (mean, variant, sum, and median). Then get the matrix of 12 columns and 40 rows which will be input for the clustering process.
The process of sorting is by entering statistical data on each axis for the first movement until the fourth movement. Each movement sampled ten times to get raw data clustering. K-Means clustering applied to the hand gesture recognition into four clusters based on the axes (x-y-z) obtained from the Kinect hand joint. K-Means clustering minimizes the within-cluster sum of squares [29]. (14) This gesture recognition system is running in real-time which is using Microsoft Kinect v2 and K-Means clustering is applied whenever there is an input data source, and the value of K is 4. The centroid point initialization is 0 and the iteration is 1000. The selection of centroid values has an effect on the outcome of clustering. If initial centroid initialization values are different, it will give different results. The next step is to compute the range of each data to the nearest centroid. The distance of each data to the nearest centroid in two dimensions defined as [30]: Where is the input of the source data and is the centroid point. The range between the input source data and the centroid value of each cluster is computed and the result is assigned to the nearest cluster to obtain new cluster means. Then calculated and changed again the value of each cluster centroid by means of the cluster member until the members of each cluster do not change or convergent, then the step stopped and obtained clustering results. Figure 8 shows flowcharts of gesture recognition using K-Means clustering for home appliances control use Kinect v2. Results and Discussion

4.1
Overview the system Figure 9 shows the structure of integrating software and hardware. The design of a hardware system to control home appliances use a personal computer (PC) as a server and Arduino esp8266 (NodeMCU) as microcontroller. Commands from speech and gesture recognition results are sent from PC to Firebase use Wi-Fi. Action result from Firebase send to Arduino esp8266 use Wi-wf a78ijnnnnnnnFi to control home appliances.

Smartphone application
In this research, users can also use the smartphone application to remotely control home appliances through mobile devices such as tablets or smartphones. Figure 12 shows the smartphone application flowchart. The smartphone application develops to control home appliances that have been connected to the NodeMCU (Esp8266) as a microcontroller. The application connects directly to the real-time database (firebase) to change the status of home appliances such as on/off or up/down. Figure 13 displays the features of the application.

Testing and implementation
The test carried out in three stages. Firstly, testing the speech recognition system and dialogue system, secondly testing the gesture recognition system, and thirdly testing the smartphone Application. The testing environment is in the room which allows for the smallest possible noise disturbances. The intensity of the light in the room matches the general conditions of the room during the day, which is between 300 to 400 lux. The location of the Kinect sensor in the static field with user distance remains about 150 centimeters. Multimodal interaction system tested by twenty people from different gender, ages ranges, and dialects because in Indonesia it consists of many cultural tribes with different dialects. The profile of the respondent shows in Figure 14  The dialogue system test for 12 dialogue scenarios in Indonesian. Each dialogue test 10 times for 12 dialogues sentences. Everyone who tests says 120 dialogues. Table 2 shows the testing scenarios between user and SITI. SITI is the name of the smart home system. Ok your welcome Figure 17 shows the accuracy level from testing result. The average level of accuracy testing of human and machine interaction dialogue systems from a total of 2,400 times of testing (20 respondents x 12 dialogues x 10 times) is 92.5%. The highest level of accuracy occurs in the word's "daftarkan" (D07) which is 96.5%. All test samples are clear enough to say "daftarkan" so it can be easily translated by speech recognition systems. The lowest level of accuracy occurs in the dialogue "Halo SITI" which is 87.5%. The pronunciation of "SITI" is often translated by the speech recognition system with the word "Kitty".
The gender and dialect of the respondents did not affect the accuracy of the dialogue system significantly. However, the pronunciation of respondents more than 50 years old affects the accuracy of the dialogue system significantly. From respondents more than 50 years old, as many as 3 people, have an average accuracy rate of 81.7% or 10.83% lower than the average level of accuracy of the entire test sample, which is 92.5%. Fig. 17. The accuracy level from dialogue system testing result Figure 18 shows a gesture recognition dataset using K-Means Clustering for command: turn on the light, turn off the light, turn on the air conditioner, and turn off the air conditioner.
Each respondent performs every gesture command 10 times to control home appliances such as turn on the lights, turn off the lights, turn on the air conditioner, and turn off the air conditioner. The total test data is 800 times (20 test samples x 4 gesture commands x 10 times). Figure 19 shows the accuracy level from the testing result for interaction using gestures. The average level of testing accuracy is 79,25%. The highest is 90% in the gesture command for "Turn on the air conditioner". The lowest is 70% in the gesture command for "Turn off the air conditioner". Respondents between 10 and 20 years old who were shorter than the other respondents, as many as two people, had an average accuracy rate of 65% or 14.25% lower than the average level of accuracy of the entire test sample, which is 79.25%. This is due to the training data performed on adults with an age range between 31-40 years old.
Gesture command for turn on the light Gesture command for turn off the light Gesture command for turn on the air conditioner Gesture command for turn off the air conditioner  Black box testing uses to test smartphone applications to control home appliances properly [31]. The result shows that every button runs appropriately as displayed in Table 3.

4.4
Comparison with previous work Table 4 shows the comparison between the previous works of the interaction system for home appliances control and this experiment.

Conclusion
The designs and steps that have been developed, expected to be a new proposal, to control home appliances using multimodal interaction such as speech, gesture, and smartphone application. The average level of accuracy testing of interaction using dialogue systems and gesture are 92.5% and 79,25%. Interaction using dialogue systems is better than gesture. Smartphone applications can control home appliances properly. In the future, face detection and skeleton tracking can be added to multimodal interaction so human and machine interactions can run more naturally.