A Sentiment Analysis Tool for Determining the Promotional Success of Fashion Images on Instagram

Sentiment Analysis (SA) or Opinion Mining is the process of analysing natural language texts to detect an emotion or a pattern of emotions towards a certain product to make a decision about that product. SA is a topic of text mining, Natural Language Processing (NLP) and web mining disciplines. Research in SA is currently at its peak given the amount of data generated from social media networks. The concept is that consumers are expressing exactly what they need, want and expect from a product but on the other hand the companies don’t have the tools to analyse and understand these feelings to satisfy these consumers accordingly. One of the applications that generate a high rate of reactions and sentiments in social networks is Instagram. This study focuses on analysing the reactions generated by the top 50 fashion houses on Instagram given their top 20 images with the highest number of likes. The approach taken in this study is to qualify the visual aesthetics of fashion images and to establish why some succeed on social media more than others. The basic question asked in this paper is whether there are certain visual aesthetics that appeal more to the user and are therefore more successful on social media than others as determined by a measure we introduce, ‘Social Value’. To do so, a sentiment analysis tool is developed to measure the proposed social value of each image. An input of comments from each image will be processed. Each comment will go through a preprocessing phase; each word will be placed through a lexicon to identify if it is positive or negative. The output of the lexicon is a score value assigned to each comment to identify its degree of positivity, negativity, or it has no effect on the social value. Adding to these results, the number of likes and shares would also be taken into consideration quantifying the image’s value. A cumulative result is then produced to determine the social value of an image.


Introduction
With the rise and dominance of social media in almost every day-life activity, the need to understand its power and the potential it can offer has become more pressing. Through user-generated content, which is enabled by social media, opinion mining becomes critical in order to investigate the provided content to identify feedback towards a certain product, political campaign, initiative …etc. There are different levels of opinion mining, each depending on the domain in question. The level that is of most interest with regards to social media is that of opinion mining at feature level. This paper focuses on Instagram and the effect of the pictures that major fashion houses on the actual buying decisions of the customers. This would entail extracting the features of the object in question (whether a comment or an image) and further determining whether the opinion is positive or negative This paper is organized as follows; section two starts by reviewing the literature to introduce different approaches used in sentiment analysis, and then gives a brief overview of machine learning and Lexiconbased approaches and finally discusses two recent techniques employed on social media platforms which are localized twitter opinion mining using sentiment analysis and unsupervised sentiment analysis in social media. Section three introduces the proposed application, its work flow and the different modules it's composed of. Following that, a brief summery is given to be followed by the future work for this application. This paper is the first of a series of papers tackling sentiment analysis in Instagram.

Literature Review
To determine the orientation of a particular opinion, there are two types of techniques that can be used; lexicon based methods as well as machine learning methods. Machine learning can be further divided into supervised learning, semi-supervised learning and un-supervised learning. Supervised learning generally refers to the use of classification techniques. There are a number of techniques that are used in opinion mining including Naïve Bayes, Support Vector Machines, and Multi-layer perceptron (Miwari, Singh, & Srivastava, 2015). With such techniques, the training data that is provided is already labelled with classes (in this case that would be the nature of a particular comment) and the labelled data is used to develop a certain model that would later be used to identify the class of unknown / unlabelled examples.
In the case where provided documents are unlabelled, unsupervised techniques are implemented. Some of the algorithms used include Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) (Arora, Patil, & Correia, 2015). The algorithms are used to extract hidden topics from within the document's text. These topics are considered to be features of the document (Benevenuto, Araujo, & Riberio, 2015). The challenge with such techniques lies in the fact that a large amount of data needs to be trained so as to provide valid and beneficial information.
Semi-supervised techniques attempt to handle the disadvantages of both supervised and unsupervised methods. It is a relatively new approach that learns from both labelled and unlabelled data in that the small amount of labelled data is used for learning, which is then applied to unlabelled data (Arora, Patil, & Correia, 2015).
Other techniques that can be used in opinion mining are lexicon-based approaches. In such case, those techniques belong to one of two methods: Dictionary-based methods and Corpus-based methods. Dictionarybased methods find the opinion words within a document or sentence and then search the dictionary for their corresponding synonyms and antonyms (Feldman, 2013). Corpus-based methods on the other hand, are provided with a list of opinion words and search the entire corpus for similar words with relevant context. This can be done using statistical or semantic methods (Arora, Patil, & Correia, 2015;Hridoy, Ekram, Islam, Ahmed, & Rahman, 2015).
Extensive work has been done using both approaches. The work done by Paltoglou & Thelwall (2012) proposed a lexicon-based classifier that predicts the degree of emotional valence in text so as to provide predictions that are intended to tackle the issue of sentiment analysis (Xiaowen, Bing, & Philip, 2008). The classifier is further enhanced by adding "an extensive list of linguistically driven functionalities" to the classifier, which would further enhance the prediction provided (Paltoglou & Thelwall, 2012). The reason that this work is considered unsupervised is directly correlated to the lack of using a reference corpus as well as the lack of need for training. The proposed solution is intended to determine the intensity of an emotion, being positive (+1), negative (-1) or neutral (0). The level of valence in each case is expressed as two separate ratings, one indicating the positivity (1…5) and the other indicating the negativity of the comment (-1…-5). The values 1 & -1 indicate the lack of the emotion in question. The score is expressed as follows: {Cpos, Cneg}. Within a given comment or sentence, the number of positive and negative tokens are taken into consideration, and the class associated with the maximum value is selected. Given a particular document, the algorithm identifies all the emotional words (by searching through an emotional dictionary) and identifies their polarity and intensity. The initial scores are then modified based on their identified nature (negation, capitalization, exclamations and emoticons, intensifiers, and diminishers). Each of these identified categories has a different weight. The assignment of weights is further explained in detail in (Paltoglou & Thelwall, 2012). Continuing with the above steps then provides the total score (Paltoglou & Thelwall, 2012). The work was tested on 3 real data sets yielding positive results that showed that the proposed work yields better results than machine learning techniques in the majority of cases.
Hridoy et al. proposed another technique that allows for the "utilization and interpretation of twitter data to determine public opinion proposed another more specific technique" (2015). A main and crucial step in this solution was the extraction and processing of data, given its unstandardized nature. The data was obtained from Twitter's API and was further cleaned up using Java. The Stanford Natural Language Processing tool was then used to label the provided data since SNLP provides the grammatical relations between words. Since not all relations are meaningful, a selection of 50 relations was taken into consideration to identify the pieces of useful information (Hridoy, Ekram, Islam, Ahmed, & Rahman, 2015).
A numeric value must be used to indicate the sentiment in a particular tweet. In order to do so, the SentiWordNet was used to assign scores for the provided tweets. By taking into consideration the provided word and the part of speech within the sentence in question, SentiWordNet assigns a numeric score for each word, which are in turn added together to provide the score for the entire tweet. The assigned score is a value between -1 and 1. The lower the value, the more negative the sentiment is and vice versa (Hridoy, Ekram, Islam, Ahmed, & Rahman, 2015). Since SentiWord can only identify words and not sentences, part of speech tagging is utilized to differentiate the various sentences, which is also provided with the SNLP tool. A custom programmer was implemented for the tagger since SentiWord can only acknowledge verbs, adjectives, adverbs and nouns. This can be further generalized to investigate comments related to a particular location, gender …etc. The experimental results showed that the adopted method provided meaningful and relevant information to the problem in question and that it can be easily generalized to fit any necessary problem definition (Hridoy, Ekram, Islam, Ahmed, & Rahman, 2015).
Upon investigating the surveyed literature, it is evident that lexicon-based approaches are better suited when unsupervised learning is in order. In both the surveyed works, the proposed lexicon-based approaches have proved to be more efficient than machine learning-based techniques with respect to unsupervised learning. As such, the solution proposed in this work follows a lexicon-based approach.

Proposed Application
The main purpose of this paper is to propose an application that can produce the social value/impact of a brand in the market through Instagram. Data will be collected about the brand from the images uploaded by the brand. Sentiment analysis will be applied on these images to identify the impact of each image has on the brand, then accumulate the whole social value as one result.
Based on the research done on sentiment analysis, there are various levels and techniques in which the social value can be extracted from the data provided. The proposed application focuses on the featured level mining to determine the value of the data to be positive, negative, or neutral and by what percentage. The technique at extraction which will be used is the lexicon based approach which is based on natural language processing and utilizes parts of speech tagging and WordNet. In this section we describe the model of the proposed application presented in Fig. 1

Data Module
The data required for the proposed application is collected from Instagram. The collected data are the comments written by the followers of a certain brand. This data will then be pre-processed to prepare it for the sematic analysis module. This part describes both the comment extraction and comment pre-processing components.

A. Comment Extraction
The comment extraction component will be done through the PHP API provided by Instagram. Instagram allows only the last 150 comments of each selected image to be extracted as a security measure (ref Instagram). The comments will be extracted and placed in a MySQL database, where each comment is linked to the image it was extracted from. Each image contains extra information such as the total number of likes, total number of comments, and brand name which will be used later.

B. Comment Pre-Processing
After collecting the data required for processing, the data needs to be pre-processed to remove any irrelevant comments which are not opinionated. The Stanford natural language processing (SNLP) tool will be used to aid in the pre-processing stage of the data. The SNLP tool is an open source natural language processing tool developed by Stanford University. The tool will in aid in retrieving parts of speech, and can identify the grammatical relationship between the words in the sentence.
The first step is to originate every word in all comments by using the SNLP tool to stem the words and remove any emojis, punctuations, or spaces. Emojis are removed to reduce complications to the preprocessing phase. Secondly we will use the tool to tag each word with its part of speech. After completing this task, we need to identify if the comment is opinionated or not. We need to identify the relationships between the words, the SNLP tool will help with this task; there are 50 dependencies that define the relationships between the words integrated in the tool, but only nsubj, amod, and dobj will be used to eliminate any nonopinionated comments.
The nsubj dependency will be used to identify the relations between the nouns and adjectives or verbs in a sentence which complement the noun. The amod dependency will be used to identify the adjectives that modify the noun phrase. The final dependency, dobj, will be used to identify direct objects that a verb is referring to in a sentence. The elimination process will target those comments that do not contain at least one dependency of the ones listed above. (Hridoy, Ekram, Islam, Ahmed, & Rahman, 2015)

Sentiment Analysis Module
After processing our data and filtering out unwanted comments, now we can start by assigning a sentimental value to each comment and define the social value of each brand. This module works in three phases defined by the comment analyzer, image evaluator, and brand evaluator components. In this section we will define each component and how they are linked together.

A. Comment Analyser
This component is responsible in scoring each comment with an accumulative sentiment score. Built in to the SNLP tool is the sentiWord tool which can classify a word as positive, negative or neutral by scoring it on a scale from -1 to 1; -1 being most negative, 1 being most positive and 0 being neutral or does not affect the comment in any manner. The sentiWord tool takes input the word and its part of speech tag and produces the sentiment score, where the final numbers of each of these words will be accumulated to define the total sentiment score of the comment at hand.

B. Image Evaluator
As stated before, each brand has a set of images from which we extracted the comments. In this component we will evaluate each image by identifying its own sentiment analysis score using equation 1.
The score of the image is defined by the accumulative score of all the comments related to the image over the total comments related to the image. 'i' represents the image that will be scored, 'n' represents the total number of comments related to image 'i' and 'j' represents the current comment.

C. Brand Evaluator
The final component is responsible of outputting the final social value of each brand. After the module has evaluated each image in the previous component, in this component we get the average of the image scores related to the brand using equation 2.
Where 'i' represents the brand at hand, 'n' represents the total images related to brand 'i' and 'j' represents the current image out of the 'n' images selected.

Output
The application will output the final social value of any selected brand, as well as brand experts will be able to fully analyze the social value that was produced by going through all the selected images and reviewing which ones had the least scores to try and modify them or remove them; on the other hand, they can also see which images scored the higher scores to promote those images more to gain more attraction in the market

Conclusion
By calculating a new proposed value which was called "Social Value" to pictures, the impact of a picture can be better quantified as reference to how people react to it. Such quantification provides a strong baseline for companies -in case of this paper fashion houses-to build on their future marketing campaigns given the Social Impact of their previous images and how their followers reacted to them. The aim is to come up with a self-sufficient application that can recommend to users the best features to include in their pictures to better suit the tastes of their various followers on social media.

Future Work
The results produced from this application will be represented to a domain expert to identify how accurate are the results. After receiving feedback from the domain experts, modifications can be made in order to improve accuracy of the application. The next step of the application is to include sentence level opinion mining in order to extract specific features that the brand followers had targeted in specific, as well as add a machine learning technique to further increase the accuracy of the social value of each brand. This paper presents the first of series of papers regarding sentiment analysis of Instagram precisely.