Paper—Web Based Recognition and Translation of American Sign Language with CNN and RNN Web Based Recognition and Translation of American Sign Language with CNN and RNN

Individuals with hearing hindrance utilize gesture-based communication to exchange their thoughts. Generally, hand movements are used by them to communicate among themselves. But there are certain limitations when they communicate with other people who cannot understand these hand movements. There is a need to have a mechanism that can act as a translator between these people to communicate. It would be easier for these people to interact if there exists direct infrastructure that is able to convert signs to text and voice messages. As of late, numerous such frameworks for gesture-based communication acknowledgment have been developed. But most of them are made either for static gesture recognition or dynamic gesture recognition. As sentences are generated using combinations of static and dynamic gestures, it would be simpler for hearing debilitated individuals if such computerized frameworks can detect both the static and dynamic motions together. We have proposed a design and architecture of American Sign Language (ASL) recognition with convolutional neural networks (CNN). This paper utilizes a pretrained VGG-16 architecture for static gesture recognition and for dynamic gesture recognition, spatiotemporal features were learnt with the complex architecture, called deep learning. It contains a bidirectional convolutional Long Short Term Memory network (ConvLSTM) and 3D convolutional neural network (3DCNN) and this architecture is responsible to extract 2D spatio temporal features. Keywords—ASL, CNN, VGG-16, 3DCNN, ConvLSTM


Introduction
Gestures are of utmost importance in daily life of humans as a form of nonverbal language. They are also important in sign language recognition, virtual reality, Human Robot Interaction (HCI) and Human Computer Interaction (HCI) from an industrial point of view. In this paper, we aim to recognize hand gestures specifically for hearing impaired people as when they talk with normal people, they usually use hand gestures to communicate which normal people most of the time can't understand. It's the need of the hour to have an aid for the hearing impaired to interact with normal people.
Background research states that there has been a lot of ground work to design the architecture of wearable devices which helps to recognize sign language. For example, hand gloves with a flex sensor or accelerometer sensors [1]. Further development in this direction is the use of webcam and a Kinect as an alternative to the wearable devices [2]. However, all the above-mentioned techniques are high in price and can't be used by all people. The solution to this problem is a cost-effective solution which can be accessible to anyone and will help to reduce the gap between mute-deaf and other normal people. We have developed a system which recognizes and understands static and dynamic gestures of American sign language (ASL) in this paper.
We have implemented an ASL recognition system which is a real time system capable of translating the real time video into audio and text. This enables dynamic communication. We had three objectives in mind as follows: 1. Obtain a video of the user with actions which will serve as an input. 2. Converting the frame in the video to a specific letter for static gesture recognition and classifying a couple of continuous frames to a word for dynamic gesture recognition based on classification score of neural network in both cases. 3. Creating and displaying the entire sentence. This problem is a major challenge as there are certain assumption from computer vision perspective: • Environmental issues (position of camera, lighting sensitivity of background) • Detecting the boundary of a sign (the end of a sign is the beginning of the next sign) • Coarticulation (when preceding or succeeding sign affects the current sign). Although ASL letters recognition has been done by training Neural Networks in the past, many of them need a 3D capture feature that requires motion tracking gloves or a Microsoft Kinect. Such solutions are limited to scalability and are not so feasible because of the extra hardware requirements.
Our system contains a video pipeline in which users sign a gesture for word/number/letter via a web application. For static gesture recognition, we extract individual video frames and generate probabilities of letters (letters a through y, except j and z as these gestures need to move hand) / numbers (numbers 0 through 9) for each frame with trained CNN. The letter/number with highest probability is given as prediction only if that letter/number has highest probability for at least 5 seconds. For dynamic gesture recognition, we capture 36 continuous frames and the frames serve as input to the trained model for prediction.

Related Work
Over the past few decades, lots of research is going on in gesture recognition as it can be used in various application domains like smart home applications, Human Computer Interaction (HCI), gaming, medical systems, etc. Solutions proposed by different researchers are of two types: solutions based on Hardware and solutions based on Software.
Solutions based on hardware include gesture recognition using gloves, wrist bands, etc. These hardware solutions contain sensors as they are necessary to track hand movements. Google has developed wristbands which are able to recognize gestures by tracking hand movements and user is able to hear the recognized word/sentence through a mobile device as the mobile device is connected to the wristband [3]. Glove based solutions are also developed in recent years. CyberGlove was unable to detect all fingers associated with ASL gestures because of the limited number of sensors. Because of the number of sensors, CyberGlove was not able to differentiate between some gestures in which wrist positions are almost similar e.g., R and U, G and H, etc. [4]. In another proposed method, data captured by gloves is sent to the neural network and is processed for classification [5]. InerTouchHand System is proposed for Human Machine Interaction (HMI) and uses distributed inertial sensors, vibro-tactile simulators [14]. Glove based systems may give wrong results as time goes on depending on sensor quality.
Software based solutions include gesture recognition using Support Vector Machines (SVM), Neural Networks (NN), Hidden Markov Models (HMMs), etc. Software based solutions require image processing before classifying gesture images. Amazon Alexa also is able to respond to sign language gestures [5]. But in this system, you have to capture yourself repeatedly performing each sign every time you launch the site in the browser and this is a very tedious task. Also, these systems are not affordable by all people. Histogram of Gradients (HOG) and Scale Invariant Feature Transform (SIFT) features are drawn out from the images of hand gestures and are fed to Support Vector Machines (SVM) for training which is then used to classify new hand gesture images [7]. They used a dataset containing images of different orientations for accurate classification. HOG along with Local Binary Pattern (LBP) features are used together to classify hand gestures and this system attained an accuracy of 92% [8]. However, this system takes greater execution and detection time. HOG features along with Principal Component Analysis (PCA) is used to propose a solution to detect continuous Indian Sign Language [9]. This system extracts key frames from real-time continuous streaming and classifies these key frames so as to reduce classifier request rate. HOG features are not robust against different lightning conditions and SIFT features are more useful in identification tasks rather than classification tasks. Thus, to deal with classification, research is done for hand gesture recognition using Convolutional Neural Networks (CNN). Faster-RCNN having five layers of neural network is proposed to classify hand gestures with an accuracy of 99.2% [9]. 3DCNN is used to classify continuous dynamic gestures and this system achieved 83.8% [11]. This system collected data using color, depth and stereo-IR sensors. Hidden Markov Models (HMMs) are useful to capture temporal-patterns. HMM with 3-D gloves that tracks hand gestures, are used by Starner and Pentland [12]. This HMM model is using time-series data to identify hand gestures and classifies them by recognizing hand position in recent frames. They achieved 99.2% accuracy on the test set. Vision-based solution using Raspberry Pi embedded platform is used to detect hand gestures of elder people [15]. This system is trained only on dynamic gestures instead of static gestures as some elderly people might not be able to keep their hands steady because of their physical problems (numbness or shaking hands). This system is able to classify only 6 hand gestures. A system for recognizing Arabic Sign Language has also been developed [16]. But this system is developed to recognize static gestures for Arabic alphabets only. This is because there are variations in dynamic hand gestures in different Arabic-speaking Countries.

Method
In this section, we depict the training and architecture of our algorithm for static and dynamic gesture recognition.

3.1
Architecture of VGG16 model VGG16 is made up of thirteen convolutional layers, five MaxPooling layers, two fully connected layers and one softmax layer for output. Visual Geometry Group (VGG) used VGG16 at Oxford University in the 2014 ILSVRC (ImageNet) competition. VGG16 architecture is shown in Figure 1.
Specifications of the layers from which VGG16 network is made are as follows: •

Training VGG16 model for static gesture recognition
In our approach, Convolutional Neural Networks (CNNs) are used for classification of ASL letters from A to Z (except J and Z as they are dynamic gestures) and digits from 0 to 9.
For training a model for number recognition, we have used transfer-learning. Using transfer-learning, we can use a pretrained model and train it on more specific dataset to give specific results. We can do this by adjusting some of the weights of the pre-trained model and altering or reinitializing weights at bottom layers by training the model on a new dataset. By using this technique, we can train models in less time and also require less amount of data. However, disadvantage in transfer learning is due to the contrasts between the data that is originally trained and the new data which is being classified. When there are bigger differences in original data on which VGG16 model is trained and the new data which we want to classify, they often require to re-initialize or expand learning rates for more profound layers in the network.
In Keras, each layer has a parameter called "trainable". We can set this parameter to False to freeze the weights, indicating that this layer should not be trained. For training the new model on hand gestures for digits, we have frozen the weights of the first 3 layers and added a new layer with 10 nodes as the last output layer for prediction.
Transfer learning is useful for classification tasks where less amount of data is present for training a model. Since adequate data of ASL gesture images was available, we have not used transfer-learning for training model on ASL gestures. Instead, we have trained the VGG16 model without freezing weights of all layers in the pre-trained model.

Mathematical equation for transfer learning
CNNs can learn features along with weights corresponding to each feature and thus, transfer learning is useful to work with. CNNs use loss functions to optimize parameter values. Here, softmax-based loss function is used: This varies from another well-known decision: the SVM loss. Using a SVM classification, head would bring about scores for every ASL letter that would not straightforwardly guide to probabilities. These probabilities are provided by the softmax loss and thus we can use those probabilities to classify more accurately through trained models.

3DCNN-LSTM model
3DCNN-LSTM network: 3D ConvNets are a conspicuous decision for video classification since they intrinsically apply convolutions (and max poolings) in the 3D space, where the third measurement for our situation is time. Long Short-Term Memory systemsnormally called "LSTMs"are an extraordinary sort of RNN, fit for learning long-term dependencies. LSTMs are unequivocally intended to maintain a strategic distance from the long-term dependency issue. Recalling data for extensive stretches of time is realistically their default conduct. Short-term spatiotemporal features are learnt with the help of 3DCNN while longterm spatiotemporal features are learnt with the help of bidirectional convolutional LSTM one after the other. Afterwards, depending on the learnt 2D long-term spatiotemporal feature maps, higher level spatiotemporal features are learnt using 2DCNN for the final gesture recognition.
We proposed to initially learn transient spatiotemporal features utilizing a shallow 3DCNN, and afterward long-term spatiotemporal features are learnt further utilizing bidirectional convolutional LSTM, ultimately recognize gestures utilizing 2DCNN on the basis of learnt 2D spatio temporal feature maps. LSTM: Long Short-Term Memory (LSTM) systems are a sort of recurrent neural network (RNN) and are able to learn long-term dependencies. They were presented by Hochreiter and Schmidhuber (1997), and were refined and publicized by a number of individuals in the accompanying work.
LSTM mainly consists of three gates as shown in Figure 2: 1. Input Gate 2. Forget Gate 3. Output Gate Above mentioned gates are nothing but sigmoid activation functions. Thus, their output is between 0 or 1 and in the vast majority of the cases output value is either 0 or 1.

Graph 1. Sigmoid activation function
Sigmoid functions are used for gates since we need a gate to give just positive values and this function should tell us clearly whether we have to keep a particular feature or we have to dispose that feature.
"0" signifies the gates are blocking everything.
where, = sigmoid function ℎ −1 = output of the previous lstm block (at timestamp t-1) = weight for the respective gate(x) neurons = input at current timestamp = biases for the respective gates(x) = output gate = forget gate = input gate

Dataset
ASL consists of a set of 26 signs for letters from A to Z as shown in Figure 3. We have implemented ASL recognition and number recognition (numbers from 0 to 9) using a Web application. For static gesture recognition, a dataset contains 26 classes and these classes include gestures for letters A to Y (excluding J and Z) in American Sign Language along with gestures for "space" and "del". This dataset is downloaded from Kaggle. This dataset contains total of 78000 images where each class has 3000 images each. Also, the dataset for gestures of numbers from 0 to 9 consists of total 2050 images, where each class consists of 205 images. Dataset for number gestures is also downloaded from Kaggle.

Proposed System
The system's frontend consists solely of HTML and JavaScript and a flask app written in python works as a server which is running in the background. Thus, the system's backend consists of a python script. Flask app containing a model for static gesture recognition is running continuously in the background at port 5000. Trained VGG16 models on ASL letters as well as hand gestures of numbers are loaded inside the flask app which is running continuously at port 5000 and when an image is sent from webapp to the server, the label of gesture with highest probability is predicted to the user.
Flask app containing a model for dynamic gesture recognition is also running continuously at port 8000 and a neural network consisting of 3DCNN and LSTM is loaded inside this flask app and when 36 continuous frames are sent from the webapp, the label of gesture with highest probability is predicted to the user. Flask app contains Keras, TensorFlow, Matplotlib, OpenCV, Numpy libraries and some sub packages of these libraries.
Following are the steps for sign language recognition: 1. Capture hand gesture of user and preprocess image 2. Feature Extraction and classification using trained model

Capture hand gesture of user and preprocess image
Webcam is used for capturing hand gestures. Webcam took color pictures, which were then converted into grayscale format. The main reason for sticking to grayscale was the extra amount of processing required to deal with color images. External systems like depth sensors are used by most gesture capturing softwares to detect the hand motion. However, these frameworks are more slow once in a while and space utilization can be more. To conquer this, we have made a rectangular space where the client needs to put his/her hand and perform signals.
Webapp by default opens in static mode. If we want the prediction for the gesture of number, we have to switch to that mode by clicking on a button named number. Starting the web app enables the webcam to capture images in the intervals of 1 second in static mode. Image is sent to the server running at port 5000 for prediction and resized to 224x224. Image is converted to a numpy array before passing it to the trained VGG16 model.
If we want the output of a dynamic gesture, then we will have to switch to dynamic mode in the webapp and click on the Start button to start the real time streaming. When 36 continuous frames are captured, they are sent to the web server running at port 8000 and converted to a numpy array. These frames are resized and fed to the joint 3DCNN and LSTM model.

5.2
Feature extraction and classification using trained model a) Static gesture recognition: In this phase, the big problem was image capture rate. We overcame this by balancing network computation speeds with the speed with which images are sent to the network. Captured frames are sent to VGG16 fine-tuned model and the label with highest probability is temporarily set as an output. If the predicted label of the gesture is the same for 5 continuous frames then only the label is sent to the user as a prediction. The predicted label text is then converted into the system's voice. All single labels are cached into a temporary word variable continuously. When the user does the gesture for "space" then, the cached word is given to the user as an output and word variable is reinitialized to empty string for next word.
The system flow for static gesture recognition is shown in Figure 4.
In case of hand gesture recognition for digits also, the gesture label is predicted only if that label occurs with the highest probability for 5 continuous frames.  3DCNN and LSTM model. The label of gesture with highest prediction probability is sent to the user and converted to the system's voice. Predicted labels are written inside of a text area on a webapp. The system flow for dynamic gesture recognition is shown in Figure 5.  Performance of VGG16 model trained on ASL letters from A to Y (except J and Z) and gestures for "delete" and "space" Confusion matrix for this model is shown in Figure 7. The VGG16 model trained on ASL letters attains accuracy of 98.75% on test samples.

6.3
Performance of VGG16 model trained on digits from 0 to 9 Confusion matrix for this model is shown in Figure 8. Output is given both in text and voice format.

Conclusion
For American Sign Language (ASL), we proposed a web-based sign language recognition system. It makes an interpretation of the gesture-based communication into text and predicts the labels using the pictures captured from the web camera. We proposed a novel 3DCNN-LSTM classifier for dynamic gesture recognition. Accuracy on the test dataset of ASL letters is 97.50% and 99.5% on test dataset of digits. On the validation dataset of dynamic gestures, the model attains accuracy of 98.81%. Accuracy tested against hand gestures for ASL letters captured by webcam in real time appears to be 90%. Precision of the framework can be improved by appropriate enlightenment. It differs with intricate backgrounds.