Leveraging Sensor Fusion and Sensor-Body Position for Activity Recognition for Wearable Mobile Technologies

— Smart devices like smartphones and smartwatches have made this world smarter. These wearable devices are created through complex research methodologies to make them more usable and interactive with its user. Various interactive mobile applications such as augmented reality (AR), virtual reality (VR) or mixed reality (MR) applications solely depend on the in-built sensors of the smart devices. A lot of facilities can be taken from these devices with sensors such as accelerometer and gyroscope. Different physical activities such as walking, jogging, sitting, etc., can be important for analysis like health state prediction and duration of exercise by using those sensors based on artificial intelligence. In this paper, we have implemented machine learning and deep learning algorithms to detect and recognize eight activities namely, walking, jogging, standing, walking upstairs, walking downstairs, sitting, sitting-in-a-car and cycling; with a maximum of 99.3% accuracy. A few activities are almost similar in action, such as sitting and sitting-in-a-car, but difficult to distinguish; which makes it more challenging to predict tasks. In this paper, we have hypothesized that with more sensors (sensor fusion) and data collection points (sensor-body positions) a wide range of activities can be recognized and the recognition accuracies can be increased. Finally, we showed that the combination of all the sensors data of both pocket/waist and wrist can be used to recognize a wide range of activities accurately. The possibility of using the proposed methodologies for futuristic mobile technologies is quite significant. The adaptation of most recent deep learning algorithms such as convolutional neural network (CNN) and bi-directional Long Short Time Memory (Bi-LSTM) demonstrated high credibility of the methods presented as experimentation


Introduction
We are living in the age of technological advancements. In this era, the field of human-centric computing research is an emerging field of research in which we can understand the nature of human behavior, habit, interests, etc. Human activity recognition is a type of work where all we have to do is to recognize or predict human movement by studying and analyzing human-computer interaction (HCI), video surveillance, wearable devices, or sensors. However, the challenging part here is to detect, predict activities with decent accuracy. In our work, we tried to recognize human activity by using in-built sensors of smartphones. Sensors give high-frequency data every second with the physical movement of different parts of the body. Keeping smartphones in the pocket will not give us the same data if it is kept in hand. Thus, it is a challenging task to predict human activities from a wide range of sensor data precisely.
Human activity recognition can be defined as "determining or recognizing the human activity using technological aspects". In literature, we found there are several ways to recognize human activities, such as sitting, walking, running, walking upstairs, walking downstairs, etc. Among the technological aspects, use of image processing [1][2][3][4] has been widely accepted and the reviews [5][6][7][8][9] have shown the various techniques focusing on the prediction task. Feature extraction and dimensionality reduction-based approaches were used to get significant performance by the classifiers. The sensor-based approaches [2,6] are not new in this area, but the usability of the smartphone in-built sensors is still in research.
For this paper, we have placed one smartphone on the wrist and another one in the pocket to capture data of different activities. The sensors utilized for this task are accelerometer and gyroscope for each of the smartphones. The contributions of the papers are listed: ─ Integration of multiple sensor data into one multi-model dataset by sensor fusion technique and utilize it for the further processing. ─ Consideration of sensor-body positions such as wrist, pocket and wrist-pocket to collect sensor data and utilize it for the further processing. ─ Applying classification algorithms e.g., KNN, CNN, Bi-directional LSTM, SVM, etc. for recognizing the different activities.
We have collected data for sitting in car and cycling, which are novel in the domain and integrating these activities will provide robustness in the training-testing method.

Related works
There has been intense research going on for the past 2 decades on the area of human computer interaction, particularly on human activity recognition. Several attempts have been made to detect Human Activity from different body points attached to the bodies. Here [10] accelerometer and microphone data were used to recognize activities in a certain environment. Their attempt was to get good accuracy even if the device's location in the body changes. Another attempt here [11] used only a tri-axial accelerometer of a phone to detect activities. They performed their experiment keeping the phone both in wrist and pocket. Then both model's accuracy was compared. They used both individual and combinations of classifiers. Their results were promising. Vinh [12] used the data of the accelerometer from both hip and waist. Their attempt was to detect activity using low-powered and low-cost devices. Bao [13] used bi-axial accelerometer data retrieved from wrist, ankle, thigh, elbow, and hip. While collecting data they did not ask the users, where they should put the device while doing the activities. Therefore, the data came up from those five different body locations. After testing with several algorithms and comparing them, they proved that only thigh and waist data combined can perform close to the five data points combined. The decision tree was the best classifier in their experiment. Jatoba [14] and his team's work has been done for monitoring activities on patients. So, they analyzed the data of micro accelerometer placed on the patient's chest. With KNN and CART, they were able to get decent accuracy.
Most of the work has been done on the accelerometer data taken from either waist or wrist. But Zhu [15] used accelerometer data from foot and waist. And their best-performed algorithm was HMM. They reduced the complexity of the dataset by fusion of the data collected from foot and waist. This way they overcame the problem of the need for a strong displacement of sensors for the HMM model to work well. Some of the attempts were made in a discriminative way too [16]. The closest work to us is the work of San-Segundo-Hernández [17] and E. Bulbul [18]. In [17], they used accelerometer data from the wrist and pocket. They claimed accelerometer data provided better results compared to that of the gyroscope. What we have done is that, instead of analyzing the data of the accelerometer and gyroscope separately, we treated both sensor data as features for output and trained our model from that data. This way a wide range of activities can be measured accurately as the model gets more features to work on. We proved this by analyzing different combination of sensors data in section 3.2. Bulbul [18] in their work, used both the sensors together to build the model. He achieved very good results too. But they used only pocket data. Our work includes both the sensors from two different devices placed on the pocket and wrist. This way the model can be trained to successfully distinguish among wide ranges of different activities even if they are quite similar like sitting in a moving car or a chair in a stationary place. Similarly, Hip-joint based hand activity recognition was proposed in [38], where the distance between the hip-joint and the hand joints were extracted using the Microsoft Kinect skeleton tracking system. Different experiments used different sensors placed on single or multiple places of body locations. There are lots of algorithms that are implemented, but few of them perform well enough. For example, SVM, ANN, HMM, CNN, LSTM, etc. It is evident from the comparison above that when different sensors are combined together as features or more than one sensor are placed in multiple places of the human body, the performance and accuracy are achieved better [33]. In the context of artificial intelligence, utilizing big data for green and sustainable technologies such as human activity recognition models are very crucial and important [34][35]. For the betterment of health technologies, merging of big data and green technologies is a must and the significance can be found in [36][37].
Machine learning algorithms, especially classification algorithms are utilized for similar to human activity recognition. Some of the significant problems addressed using the machine learning algorithms are: detecting malicious links from world wide web (www) [21], handwritten digit recognition [22], depression detection from image and video analysis [23], air temperature prediction [24], etc. Therefore, we have applied seven different classification algorithms namely, logistic regression, random forest, K-nearest neighbor (k-NN), Support vector machine (SVM), Gradient Boosting, Convolutional Neural Network (CNN), Bi-directional LSTM. The experimental results coming from these classifiers are highlighted in the result and analysis section.

Research subject and instrument
Nowadays, every person has at least one smartphone in their pocket. Along with that, people have been starting to wear wristbands instead of analog watches. These gadgets have different kinds of sensors built-in, especially accelerometers and gyroscopes which are common. There is a lot of usage of these two sensors. In our paper, we are focusing on human activity detection and recognition by using accelerometer and gyroscope sensors data. Here we have detected eight activities and these are walking, standing, sitting, upstairs, downstairs, jogging, cycling, and sitting in the car. We have collected this dataset by using three android operating system-based smartphones. These were Samsung Galaxy S10 Plus, Redmi Note 7s, and Huawei Y9. We have used an application called "Sensor Data" to collect our required data.

Data collection procedure
Collecting high-frequency data properly was a challenging task. At first, we managed 17 volunteers to perform different activities for a specific time period. The android application which we used to collect data was set up at a 50Hz sampling rate. So, data was recorded in the dataset in a rate of 50 samples per second. We used two smartphones at a time. One was in the pocket and the other was tied to the wrist. The device tied to the wrist can get similar data to a wristwatch. With those activities, we got data from two locations (wrist and pocket) for each of the activities. Each of them has both accelerometer and gyroscope sensor data which is 3 dimensional with data points known as axes (x, y, z). So, we got total of 12 columns for those 4 sensors each having 3 axes. Finally, we have stored each distinct type of data such as 'Activity_ Hand_accelerometer', 'Activity_Hand_gyroscope', 'Activity_Pocket_accelerometer', 'Activity_Pocket_gyroscope' into text files. That is the procedure of our time series data collection. After finishing the data collection phase, we have pre-processed the raw data to make a suitable dataset where we can implement different algorithms. First of all, we have removed unnecessary strings from the text files. Because the mobile application generated some irrelevant strings at the beginning of the text file. Then we have converted each text file into a comma-separated value (CSV) file. Here, we have 12 input columns for each of the activities and 1 output column where we have labelled the activity name. And for each of the activities, we have taken 44,000 samples. After that, we have merged all the CSV files into one to get our complete dataset. In our dataset, we have got a total 3, 52,000 samples or instances. These instances are fed into the next step for pre-processing.

Data pre-processing
In feature engineering step, number of different steps were needed for data manipulation. we removed null values by using mean and median methods. There were noises in the dataset which need to be removed. So, we have done filtering by the butterfly method to remove the noises from the dataset and make the dataset smooth. Then we have done label encoding. As our output label was categorical data, we had to convert that into numerical data by using the label encoding function. Furthermore, we have split our dataset into input and output columns. Along with this, we also split our dataset into a train set and test set at the ratio of 4:1, which is 80%-20% distribution. The test set was kept aside for final testing stage. In training stage K-Fold cross validation was used to train the models. This method is evident and widely accepted in the machine learning research community.
After splitting the dataset, we have implemented feature scaling on our dataset to bring all the data points into a specific range. We used Robust Scalar and Standard Scalar to scale the dataset. We have implemented both scalar fitting and transformation on the training dataset but only implemented transformation on the test dataset. This prevents data leakage. So, the test set remains completely unseen to the model.
As we have time-series data, we had to specify a specific sized window which is also known as a sliding window. We have tested different window sizes like 2 seconds, 4 seconds and observed the performance. Choosing the window size is tricky. The bigger the window, the better the result will be. But too big of a window will overfit the model. Moreover, the processing will be heavy as in each window there are lots of data. This will cause more problems if activities are detected in real-time. On the other hand, smaller window will be very fast for processing, but the result will not be as good as comparatively larger window. Considering this trade-off, we used a window size of 4 seconds. We have also defined the hop size which is known as stride size. Therefore, sliding window keeps moving forward according to the stride size and in each step, it creates a sample of an activity, dimension of 200×12 each. Used hop size was large enough to reduce overlapping. It ensures the diversity of samples in each window.
Using the sliding window in our dataset, we have got 3D data. As mentioned earlier, each second 50 rows of data are recorded. Therefore, for a window of 4 seconds, we got 200 rows and 12 columns of data. Each window represents an activity. This is a perfect data shape for models that uses Neural Networks. So, we used this dataset for our CNN and Bi-Directional LSTM models. Therefore, for traditional machine learning models, we needed to flatten the dataset. Each activity's shape is 200×12 before flattening. Now after we flatten this, we get 2,400 columns for each activity. There are so many columns after flattening which can lead to overfitting. So, for reducing dimensionality, we used Principal Component Analysis (PCA).
And in the feature scaling step there were also some problems with 3D data while implementing the Deep Learning models. Therefore, we have also made an alternative option to reshape the whole training dataset. For doing that, we had to go through all the samples of activities and then reshape them. This is the complete procedure of our data collection, data storing, data cleaning, and data pre-processing.

Data analysis
While performing any kind of activity, the orientation of the phone and smartwatch is expected to differ in different kinds of activities. And in most cases, this difference does occur. But often two or more activities can provide similar kinds of data. That means the data points will overlap in the same cartesian region. In that case, models can make an error to predict. This can be seen from the Figure 3 visualization of the data where some of the data are plotted. Looking at the accelerometer data of pocket, downstairs and upstairs data have similar kinds of dispersion, range, and orientation in some regions.

Walking Upstairs
Walking Downstairs Sitting Sitting in a Car

Fig. 3. Comparative visualization of Accelerometer and Gyroscope data points in different activities
Therefore, Machine Learning algorithms might often struggle to distinguish between those activities. In cases like these, wrist data adds more variation. Though this is very hard to prove in visualization, so we can only have the intuition here. But in the 3.2 section, this hypothesis is proved by analyzing different sensor combination and it is proved that highest accuracy is reached when all the sensor data is taken as features. Looking at the wrist data of the accelerometer, data defers in dispersion and orientation in upstairs and downstairs activities. Adding the gyroscope to that, we will be able to add more information that can very easily be distinguished by the model. From the visualization of gyroscope data, we can see that both upstairs and downstairs data differ a lot. So visually, with these 4 sensors (2 in wrist and 2 in the waist) the margin of error seems to be very low for any activity that provides almost similar kind of data.
Sitting in a car and sitting home in a chair is an almost similar activity in action and hard to distinguish between those. From the Figure 3 visualization, we can see that accelerometer data of those 2 activities is hard to differentiate between. Data points have almost the same dispersion and orientation. Therefore, any machine learning algorithm will have plenty of errors when trying to predict if someone is sitting in a car or in a chair at home. In this scenario too, the gyroscope data adds variation that can easily be differentiated.  From the above two scenarios, we can see when data are analyzed together from waist accelerometer, gyroscope, and wrist accelerometer, gyroscope we can predict the activities very accurately even if they are very similar in action. When all the activities are put together in a single graph, we can further visualize the above-mentioned problem. Here only waist data are plotted. It is enough to give an intuition for wrist data as well.
All the data plotted in these 2 figures (Figure 4, Figure 5), are the data of the same instance. In other words, these are data for specific 10 seconds from the entire dataset so that we can compare how the data points reside for different activities.
We can clearly see that upstairs and downstairs data are overlapping in Figure 4 and Figure 5 which are Yellow and Cyan respectively. So even from those 3 sensors, we might often not be able to distinguish between Upstairs and Downstairs. Some errors will come up eventually. But from Figure 5, we can easily distinguish between Cyan and Yellow. Therefore, these activities will be separated quite accurately. In the same way, other activities that are almost similar will be differentiated with those 4 activities even if they are almost similar in action. That is the reason why we are able to achieve an accuracy of 99.3% even though some of the activities are very similar to each other. Experimental results and discussion In the past two decades, there have been many attempts to recognize human activities with precision. From our experiment, we will show that, when we combine the accelerometer and gyroscope data from pocket and wrist in columns as features, it is possible to detect and recognize a wide variety of activities with precision. The dataset we built and experimented with proves the base of our hypothesis. We have made the comparisons among accuracies obtained from different combinations of sensors and different combinations of positioning of devices if not used together. It will be evident how the accuracy gets affected if we remove one or more sensors or one device from our model. Here, by lower bound we mean the lowest accuracy by a specific model among the eight activities and by upper bound we mean the highest accuracy among the eight activities.
When the accelerometer and gyroscope data are used from a device placed in the pocket, we can see in Table 2, most of the algorithms perform well achieving 96% to 97% accuracy. But if we look carefully, the Lower bound score did not reach up to the expected point. Though it is better than recognition of human activities from accelerometer and gyroscope sensor data from the device placed on the wrist, we still can improve the result by applying fusion on all the data. So, the models here do not perform very well in our dataset which we built very casually mimicking the reallife actions. Among the traditional classification algorithms, random forest has shown (Table 3) quite satisfactory performance, in terms of accuracy. It is evident that in many cases [25] similar to this, random forest is a good model. Though SVM has been proven to work better in some special cases, such as fault classification in smart distribution network [26], ozone prediction [27], cyberbullying identification [28], harmonic source identification [29], etc. In Table 2, we can see that CNN did good. It has predicted all the activities with decent accuracy. But on the individual activity level, we can see that the lowest f1-score is only 0.77 when the device is placed at wrist. Therefore, it is evident that though some activities were detected quite accurately as the upper bound f1-score is up to 1.00 but to detect some activities CNN did struggle. Though CNN is an extremely powerful model and has shown quite satisfactory performance in many cases, such as detection of license plates [30], rice false smut detection [31], etc. So, the decision from Table 2 is, to predict different activities accurately, data from only wrist or pocket is not sufficient. If only accelerometer data is used (Table 3) from two devices placed in the pocket and wrist the models do not improve at all. Here best-performed algorithms are CNN and Random Forest. Though some activities were recognized without almost any error as the upper f1-score is perfect 1.00, their Lower bound f1-scores are still very low. But our aim is to recognize all the activities with the least margin of error.
Same goes for the models when gyroscope data is analyzed from those two locations. They are in no way any better than the previously analyzed situations. But what we can understand from these scenarios is, if we put all the data together, they will add up additional information when the data overlap for two or more activities in one sensor or device. Therefore, the accuracy will improve as we claim in our hypothesis. To prove our point, finally, the models are built using accelerometer and gyroscope data taken from pocket and wrist altogether as features. Which provides 12 features in total. Looking at Table 3, it can be easily seen that all the models scaled up a lot. Our best-performed model is Bi-directional LSTM. It has a lower bound f1-score of 0.99 which is very accurate and perfect for a model that can recognize human activities with very much precision. In our dataset, we took data of activities that are very similar in action and hard to distinguish between. For example, sitting in a chair at home, sitting in a car both are very similar in action. Sitting and standing are very stationary in action. But all those were recognized very accurately by Bi-directional LSTM.

Conclusion
Activities can be recognized in different ways using different algorithms and sensors placing the devices in different places of the human body. But the aim is how accurately we can recognize activity and most importantly if we can detect complex activities that might be similar in actions. We have introduced an optimal way to recognize the activity by using accelerometer and gyroscope sensors from the pocket and wrist. This helped us to accurately identify the eight activities where some of them are difficult to distinguish between for the similarities among them. Different traditional and Deep learning algorithms have been applied to our dataset as found useful in several machine learning related works [39][40][41][42]. Among them, Bi-directional LSTM gives optimal accuracy. Bidirectional LSTM gives us 99.3% of accuracy which is comparatively better. Thus, the idea of sensor fusion prevails and sensor-body position mechanism has given quite significant results, in terms of accuracy. The prospects of our proposed methods could be applied in health-related solutions. For example, diabetes patient's wearable sensors [32] could be fused and sensor-body positions may affect the proper recognition of health conditions.