Effective and Efficient Video Summarization Approach for Mobile Devices

—In the context of mobile computing and multimedia processing, video summarization plays an important role for video browsing, streaming, indexing and storing. In this paper, an effective and efficient video summarization approach for mobile devices is proposed. The goal of this approach is to generate a video summary (static and dynamic) based on the Visual Attention Model (VAM) and a new Fast Directional Motion Intensity Estimation (FDMIE) algorithm for mobile devices. The VAM is based on how to simulate the Human Vision System (HVS) to extract the salient areas that have more attention values from video contents. The evaluation results demonstrate that, the effectiveness rate up to 87% with respect to the manually generated summary and the state of the art approaches. Moreover, the efficiency of the proposed approach makes it suitable for online and mobile applications.


INTRODUCTION
The increasing processing power, camera resolution, and memory size of mobile devices have resulted in an explosive growth of video capturing and streaming experiences. Video has been an important media for entertainment and communications between mobile users. Video is a complex multimedia which is composed of a sequence of images, audio tracks, and textual information. Also, the content of the video is huge and contains a lot of redundant information [1]. On mobile devices, browsing, indexing, retrieving, streaming and storing such a huge video content is quite difficult as compared to other formats of media, like audio and text [1,2,3]. Therefore, video summarization is an important approach for quick browsing, fast streaming, efficient storage, and quick retrieval of the video content [4,5,6].
Video summarization is the process of extracting the most important information and reducing the amount of redundant information from the video. The input video must be well processed in order to extract only the most useful contents [7]. But to generate a good video summary, a full understanding of the video is required, which is still a research challenge. In literature, many video summarization approaches have been introduced [8,9,10]. Farouk et al. [11] presented an analysis and a comparative study among various techniques of mobile video summarization according to a proposed set of criteria (For example Content structure, final summary representation, summarization features, summarization speed, summarization purposes, targeted devices, adaptability and com-plexity). The comparative study showed that most of these approaches are based on low level features, such as color and motion, to generate the summary. Unfortunately, these approaches are not effective enough because they don't take into account the human perception of the video content. In other words, there is a gap between low-level features of the video and its semantic meaning [12,13].
Recently, the Visual Attention/saliency Model (VAM) has been widely used in computer vision and multimedia processing researches and applications. By detecting the salient content, visual attention can reflect the user interest to some content and provide user targeted applications according to their preferences. In video processing, there are several VAM based applications such as video compression, summarization, retrieval, advertising and recognition [14].
The advantages of the video summarization include, but are not limited to, enhancing browsing, streaming, storage, and quick retrieval of video content. For example, people usually use the mobile devices to capture events and celebrations then publish the captured video to social networks (e.g. Facebook) or save it using personal storage service as private or public cloud storage (e.g. Dropbox). But if the size of the captured video is large, it consumes a lot of time and bandwidth in order to transfer it across networks. In this case video summarization can be used reduce the size even more than video compression, while preserving the main content and then publish it.
In this paper, we propose an effective and efficient video summarization approach for mobile devices based on VAM. In this approach, VAM is applied to bridge the gap between the low-level video features and its semantic interpretation by the HVS. Moreover, we introduce a Fast Directional Motion Intensity Estimation (FDMIE) algorithm to calculate the motion intensity between consecutive frames. We implemented a prototype to test our approach based upon the Android platform. Any mobile device with android version 4.0 or higher can run this prototype. We carried out experiments to measure the effectiveness and efficiency of the proposed approach. The results proved that the proposed system is more effective and efficient than other related approaches.
This paper is organized as follows: Section II introduces some related work about the visual attention model. The proposed approach is presented in Section III. Section IV presents the experiments and results of our approach. Finally, section V concludes the paper and suggests future work.

II. RELATED WORK
The human brain receives a huge amount of information in every second. About, 80% of this information (up to 10 billion bits) is received by our vision system. Furthermore, the computational power of the human brain is not sufficient to perform complex analysis of all the input visual information. Therefore, the human vision system (HVS) applies a visual attention/saliency mechanism. In this mechanism, the HVS concentrates on the important visual information, is called salient information. This salient information is quickly processed with high priorities than other non-salient information using the brain to increase the processing efficiency [14].
Therefore, some researchers try to design algorithms from the visual salience mechanism to develop an intelligent and efficient application. Although, it is difficult to fully simulate the human attention mechanism, the research in this direction has significantly been ameliorated to guide computers and devices to quickly process information like the human brain [15].
The visual saliency mechanism can provide a usertargeted service according to the users' preferences and interests by emphasizing on the salient content. Recently, the visual salience mechanism has played an important role for such intelligent computer vision and multimedia applications. One of the important applications of VAM is video summarization. Ma et al. presented a user attention model to summarize the video based on the visual saliency mechanism [16]. Then this model was enhanced to be a generic framework of user attention model, including the various attention models. Such as the motion attention model, the static attention model, the face attention model, the camera attention model and the speech attention model. Then these models are merged together by a nonlinear fusion scheme [17]. Unfortunately, this framework is computationally expensive and the combinations between visual, oral and linguistic features are difficult tasks.
Therefore, some improvements have been applied on Ma et al.'s [17] framework by Peng and Xiaolin [18]. These improvements are done by initially using a color histogram and the K-means algorithm to cluster the frames. Then key-frame candidates are selected from each cluster with the highest Visual Attention Index (VAI) descriptor. Because of the usage of the K-means algorithm, the outputs Keyframes don't reflect the time order and the video structure. Lai and Yi addressed this problem by using the time constrained clustering algorithm to preserve the sequential order of the video frames [12].
A Comparative study between static and dynamic saliency is introduced in [19]. There are two observations derived from this study. Firstly, the image saliency is often different from the video saliency. Secondly, the camera motions, such as zooming, panning or tilting, have a significant effect on the dynamic saliency detection.
Ejaz et al. [6,20] presented an efficient aggregated visual attention model for key frame extraction. This technique reduces the computational cost by using the temporal gradient based motion visual saliency detection instead of the traditional optical flow methods. Then, use a non-linear weighted fusion method to merge the static and dynamic visual attentions.

III. THE PROPOSED APPROACH
In this section, we give a detailed description of the proposed approach. We introduce the approach architecture in Section III A. Then, Section III B shows how to compute the static attention model. In Section III C, we show how to compute the motion attention model. Section III D describes the fusion of static and motion attention models to generate the final attention curve. Finally, Section III E discusses the extraction of static and dynamic (skims) video summary based on this attention curve.

A. The Proposed Approach Architecture
The goal of this approach is to generate a video summary (static and dynamic) based on VAM for mobile devices. The proposed architecture is shown in Fig.1. It consists of four modules: 1. Frames sampling and resizing, to reduce the computation complexity of the following modules. The aim of the pre-sampling and resizing module is to avoid the redundant frames and reduce the computational complexity to develop an efficient algorithm. The frame sampling approach is based upon the assumption of having a visual redundancy among consecutive Frames. Therefore, instead of analyzing all the video frames, only some frames are analyzed based on a predefined sampling rate. The sampling rate can be defined as a number of frames per second as in [21,22] or by a frame per a number of frames as in [23]. Based on the sampling size, the number of video frames to be analyzed is reduced. The PAPER EFFECTIVE AND EFFICIENT VIDEO SUMMARIZATION APPROACH FOR MOBILE DEVICES shorter the sampling size, the shorter the video summarization time. Nevertheless, the shorter sampling size can lead to loss of important information from the video and thus affect the quality of the summary. Therefore, the sampling size must be defined carefully to keep the important frames [21]. In our approach, the sampling rate is set to one frame per second. After that, each selected frame is resized to be w/4 ! h/4 where w and h are the width and height of the original frame.

B. The Static attention module
Static areas in the video may attract the user attention as well as the motion areas. When users watch a video, the interesting static areas (salient areas) can attract them (e.g. the traffic signs on a road) [12]. Therefore, the static attention module was developed to extract the important or interesting frames from the video content. The psychological studies suggest that, HVS is sensitive to the difference between the target areas and its neighborhood. Therefore, the contrasts of color, texture, and shape features are important for visual saliency detection [12,17,24]. Consequently, we applied the generic contrast definition proposed in [17] to compute the color contrast. For each frame ! ! , at a time t, the contrast value !" !!! of a pixel ! !!! is computed as in (1).
The symbol !!! !!! ! denotes the descriptor at the pixel ! !!! (Such as color value) and q is the pixel belongs to 8neighborhood of ! !!! !! ! !! !!! !!. The distance measure (!) between two pixels may be any suitable distance measure. In this approach, ! is computed as the Euclidean distance.
The HVS is more sensitive to luminance (gray level) than color [25], and to reduce the complexity. We consider the luminance value of the pixel ! !!! as the descriptor. After normalizing all the contrasts at each pixel to [-128, 127], a saliency map is created, as shown in Fig. 2

(c).
A saliency map is a gray image which contains attended/salient areas (bright areas) and unattended/non-salient areas (dark areas). The attended areas usually attract the user attention. In order to extract the attended areas of the saliency map, we use the following method.
Each Saliency Map (SM) is divided into nonoverlapping Macroblocks (MB), each MB is a 2-dimensional vector with size !!! and represented as Add !!" !!! to the attended set ! 6. Else 7.
Add !!" !!! to the unattended set ! 8. End End the frame size. Each !" !!! has a location !!! !! defined by the location of the upper left pixel of !" !!! in the SM. Accordingly, each SM is represented by two sets (A and U). The set A is the set of all non-overlapping attended blocks (areas). Similarly, U is the set of all nonoverlapping unattended blocks (areas). The two sets A and U are defined as in equations (2) and (3), respectively.
the average gray level (brightness) of the pixels in the block!!" !!! . A threshold ! !" is used to control the membership of !" !!! to a set A or U. According to the algorithm 1, a saliency map is computed and the probability of attended areas A in each SM for a given threshold ! !" is obtained by (4). Where ! denotes the cardinality of the set A.
After normalizing the value of ! ! !! ! ! !for each frame to [0, 1], a static attention curve (!") is obtained, as shown in Fig 3. The horizontal parts on the curve mean that the corresponding frames having the same attended areas probability and almost contain the same information. In the other hand, sudden changes in the curve mean that there is a difference in the content of the corresponding frames. The complexity of the visual static attention detection algorithm (algorithm 1) for each frame is!!
where D is the macroblock size.

C. The Motion Attention Module
The motion feature is important. It often increases the intensity of users' attention and keeps them locked on significant features and objects [6]. Therefore, most of the visual attention based video summarization approaches are based on the motion attention in different ways [6,12,20,26].
Motion estimation is a classical problem and has a long research history. The two key algorithms for motion estimation are: Optical Flow and Block Matching Algorithms (BMA), which received attention by the researchers because of their simplicity and efficiency. The BMAs are usually less complex than the optical flow algorithms. This is because, the optical flow algorithms are based on pixel processing technique while BMAs are based on a block processing technique [27]. Yaakob et al [28] introduced a comparative study among several BMAs in term of their efficiency and quality. . They concluded that, the FDGDS is a balanced algorithm which produces a high prediction quality and has a low computational cost.
In this approach, we introduce a Fast Directional Motion Intensity Estimation (FDMIE) algorithm. FDMIE is an adapted version of the FDGDS algorithm [29] and was introduced to detect the Motion Intensity (MI) between the consecutive frames. In general, motion estimation is an intensive computation task, especially if it performed for all regions in each frame of a video sequence. However, There are two ways of improving the efficiency of the motion estimation algorithm, one is to decrease the matching points and the other is to choose an efficient blocking matching measure to reduce the complexity [30].
Therefore, in this approach, the motion intensity estimation has been applied to the regions in each frame that could potentially attract users attention due to the motion (i.e. attended areas), hence, decreasing the computational cost significantly. Also, the Sum of Absolute Differences (SAD) is used to determine the matching between two blocks. The SAD is more used because it has a higher quality precision and involves lower computational cost [30,31].
According to the FDMIE algorithm (algorithm 2), the motion intensity between the saliency maps is computed. For each block in !" !!! , FDMIE computes the current minimum (! !"# ) distortion between this block and the corresponding block in !" ! by the equation 5. Then, FDMIE searches the eight directions around the target block (shown in Fig. 4) for the directional minimum (! !"# ) distortion. The Relative Distortion Ratio (RDR) between ! !"# and ! !"# is defined as in (6).
A threshold ! ! is used in FDMIE to control the convergence speed of the algorithm. If RDR is lower than ! ! then other directional searches will be skipped and a new round of search will be started. The FDMIE output is a numeric value that represents the motion intensity of the frame ! !!! . After normalizing the motion intensity value for each frame to [0, 1] a motion attention curve (!") is obtained, as shown in Fig 5. The complexity of FDMIE algorithm (algorithm 2) for each frame is

D. Attention Curve and Summary Extraction
After the static and motion curves are obtained separately, the two curves need to be merged in a meaningful way to construct the final attention curve (!"). In this approach, the final attention curve was constructed based on the linear merged scheme that is defined as in (7).
Since the human vision system is more sensitive to motion information than static information [19,32], we chose ! ! ! !!!!!!"#!!! ! ! !!!. Figure 6 shows an example of the final attention curve that has been created by our approach during the experiments stage.
The attention curve peaks indicate the corresponding video frames which most likely attract users attentions [17]. Based on this curve, static and dynamic (skims) video summary are extracted around the curve peaks. If the length of static summary (number of keyframes) "L" is specified by the user, then the L frames having the highest attention values from the sorted candidate keyframes are selected. If L is unknown then a percentage equal to 5-15 % from the set of candidate frames having the highest attention value are selected. The dynamic video skimming problem can be defined as selecting an optimal set of clips that minimize the distortion between the original video and its skimming [33]. Based on the attention curve, dynamic video skimming generation also becomes much simpler [17]. Given a skimming length or ratio, skim clips are selected around the peaks attention curve. If we have Z pre-sampled frames then the total complexity of our approach is computed as in (9).

IV. EXPERIMENTAL RESULTS
The Quality of Service (QoS) requirements are essential to multimedia and mobile applications. QoS is commonly defined as the capability of a system to provide better service to users with high degree of a satisfaction. There are several metrics used to evaluate and measure the QoS. They include delay, jitter, packet loss ratio, throughput, error rates and service availability [34,35].
This section presents the experiments of the proposed approach in term of quality and efficiency with a discussion of the results. More QoS metrics will be considered in a subsequent paper.

A. Data set and Testing Devices
This experiment carried out on 5 video files from the standard data set used by many authors and available at the VSUMM web site [36]. The descriptions of these videos are listed in Table I. All videos are in MPEG-1 format with resolution 352!240. Because of the input format to our approach is H.264/AVC, each video is firstly transcoded to H.264/AVC format with resolution of 320!240 to match the standard format of mobile videos. We implemented a prototype to test our approach using an Android platform. Any mobile device with android version 4.0 or higher can run this prototype. Table II, shows the characteristics of the mobile devices that were used in this experiment.

B. Evaluation Strategy
The evaluation strategy is based on the popular metrics of Recall (R), Precision (P) and F-measure (F) [6]. In this strategy, the quality of the automatically generated summary by the approach is compared with the users' (three different users) generated summary of the same video. Then, compute the metrics of R, P and F as in equations (10), (11) and (12) Where the number of true match frames (! !" ) is the number of frames that chosen as key frames both manually and automatically using the new approach. The number of false positive frames (! !" ) is the number of frames that have been chosen as key frames by the approach but not manually. The number of false negative frames (! !" ) is the number of frames that have been chosen as key frames manually but not by the approach. The recall metric represents the probability of a relevant key frame to be selected by the approach. Whereas, the precision metric represents the probability that an extracted key frame is relevant. Both recall and precision are complementary metrics and the highest summary quality was achieved when high values for both metrics are achieved. So that, F-measure is the averages of recall and precision metrics, the highest value of F-measure led to the highest summary quality.

C. Quality Evaluation
In order to evaluate the quality of the proposed approach, we compare it with other static video summary approaches. The compared approaches include Video SUMMarization (VSUMM) [21], and STIll and MOving video storyboard for the web scenario (STIMO) [37] which are non-visual attention based video summarization approaches. Also, we compare the proposed approach with other visual attention based video summarization approaches. They include Lai and Yi [12] and Ejaz et al [6].
The comparative results are provided in Table III and an example is shown in Fig. 7. The results demonstrate that, the proposed approach achieved an average Fmeasure of 0.87 with respect to the manually generated summary. Moreover, the results indicated that, the proposed approach has high values for both R and P metrics in comparison with the other approaches.

D. Efficiency Evaluation
Efficiency evaluation is an important issue when comparing similar approaches. The source codes of most of the video summarization approaches are not available, and the time complexity required for producing a video summary (static or skimming) depends on a particular hardware and the adopted features, it is almost impossible to produce a fair evaluation in terms of efficiency among these approaches [11]. Therefore the efficiency of the proposed approach is evaluated by counting the number of frames that can be processed per second. This includes the partial decoding/encoding time of each frame. This study was carried on the first mobile phone described in Table I and on all mentioned videos in Table II. According to those experiments, the proposed approach can process an average of 14 FPS, as shown in Table IV. For online applications, based on a maximum waiting time of 39s [38]. The proposed approach can process an average 546 frames in 39s. With sampling rate equal to 1 FPS. Our approach can be used for videos of duration up to 9 min (about 16200 frames at 30 FPS). Therefore, the proposed approach can be used for online applications with video segmentation and initial small delay. It is important to note that those results depend on the computational power of the target mobile device.

V. CONCLUSIONS
This paper proposes an effective and efficient video summarization approach which is suitable for mobile device usage and online applications. This approach is summarized as follows. Firstly, the static attention module is applied to generate a static attention curve. Secondly, we introduce a Fast Directional Motion Intensity Estimation (FDMIE) algorithm to calculate the motion intensity between consecutive frames. Then, the motion intensity values are used to construct a motion attention curve. Thirdly, the static and motion attention curves are merged together to form a final attention curve. Finally, static and dynamic video summary is extracted based on this attention curve. Our evaluation is experimental. We measure the quality and efficiency of our approach and compare it with other similar approaches. It is shown that our approach has a high quality (up to 87%) and efficiency with respect to the similar approaches.
In the future, we intend to build a content aware video summarization and streaming based on the proposed approach. For this, the QoS requirements will have to be taken into consideration.