ECharacterize: A Novel Feature Selection-Based Framework for Characterizing Entrepreneurial Influencers in Arabic Twitter

Social media are widely used as communication platforms in the world of business. Twitter, in particular, offers valuable opportunities for collaboration due to its open nature. For that, many entrepreneurs employ Twitter for different reasons, such as mobilizing financial resources, get funding, and increase their innovation capabilities. Therefore, they keep looking for local entrepreneurial accounts to help them. Messages from entrepreneurial influencers opinion leaderincrease the information diffusion to entrepreneurs, helping them to find more opportunities. Discovering the characteristics of entrepreneurial influencers in Twitter networks becomes extremely important since it reflects the way to reach entrepreneurs. In the present paper, we propose a novel framework called ECharacterize based on feature selections techniques to discover the characteristics of the entrepreneurial influencer in the Saudi context in a robust manner. The framework extracts abundant influencers’ features and then employs seven state-of-the-art ranking methods to determine the characteristics of the most relevant influencer. It robustly aggregates the lists to come out with the accurate final list using Robust Rank Aggregation. The framework examined on 233,018 real-life Arabic tweets. The results show the ability of the proposed method to distinguish between the influencers by their popularity, reliability and activity level. Keywords—Twitter, characteristics of influencers, entrepreneurial influencers, robust ranked list.


Introduction
In the last decade, a variety of social media platforms have brought the new world of information. Starting by Myspace, which disappears with Facebook and Twitter. Then, a life-sharing social network such as Instagram, Snapchat [1] and others give us the capability to share our voice, make some new friends, become more social positively, and giving a chance to share some cultural information and build an extensive state-of-the-art algorithms for machine learning, supervised prediction models to prove its efficiency and correctly.
The rest of this paper is organized as follows: the theoretical literature review is discussed in section 2. In section 3, we explain the framework phases of its evaluation. Section 4 discusses the obtained results and interpret the phenomena. Finally, a conclusion and perspective work are presented in section 5.

Literature Review
This section reviews many interesting features could be used to characterize Twitter users in subsection 2.1. All the discussed features are extracted to characterize entrepreneurial influencers in the ECharacterize framework. Then, the ranking methods which embedded in the ECharacterize framework are explained in section 2.2, followed by explanation of the aggregation method in 2.3.

Features
The features are grouped into five categories. In Fact, the categorization of features does not follow any standard. So, authors usually tend to categorize them thematically. The next subsections describe these features in detailed based in their group.
User profile: The first group gathers features related to user profiles. Feature 1(Verified) indicates if the users' account verified by Twitter [13]. Feature 2 (Description length) is the number of characters written by the user to describe himself. In fact, this feature is considered an excellent feature to indicate the user presence on Twitter and his online presence. Generally, corporate accounts and professional bloggers tend to fill their profile [14]. Feature 3,4, and 5 (URLs, usernames (mentions), and hashtags) are appearing in the textual profile description. Previous studies [14] [15] show that some users use these features to indicate their professional, distinguished roles to gain visibility in a specific area. Feature 6 (Profile age) could be related to the user's visibility on Twitter since it needs some time to have an influential position [14].
Activities and publications: Publishing activity category focuses on the ways the influencers behaves regarding publishing the tweets. Feature 7 (Tweet count) represents the total number of tweets he posted in general, while Feature 8 (Topic Tweet) corresponds to the number of tweets he posted related to the entrepreneurial issues. Tweet count and topic tweet represent the user activity on Twitter [14] [15].
Interaction and responsiveness: This category focus feature describes how the user interacts with people. Feature 9, 10, and 11 are related to the reactions caused by the user's tweets. These features can be used as indicators to the tweet quality, and the highquality tweet may cause a tremendous other reaction. Feature 9 (Retweet) represents the total number of retweets of the user's tweets [15]. Feature 10 (Favorite) is the number that the user's tweet marked as a favorites. [15]. Feature 11 (Reply) represents how many times the user's tweet replies by others [15]. Feature 12 (User Favorites Count) represents another type of interactions, and it considers the total number of favorites chosen by the user [15].
User relationship: Relationship category describes how the user is popular and famous on Twitter and connect to the rest of the Twitter users. Feature 13 and 14 clarify how much others prefer the user' tweets on Twitter. Feature 13 (Follower) is the number of user's followers [16], while Feature 14 (List) is the count of lists include the user's account [16]. On the other hand, feature 15 (Friends) correspond to how much the user seeks information from others [16].
Lexical Aspects: The features of this category can be investigated in order to figure out the lexical aspects. These features are beneficial to distinguish users based on the ways users describe themselves on Twitter. For instance, if users belong to the same class used to describe themselves in the same way, the selected features will be useful and allow their identification. Features 16 focuses on the Parts of Speech (POS), while feature 17 focus on Named Entities recognition NER. POS and NER are Natural Language Process (NLP) techniques [17]. NLP is a field of linguistics in computer science with artificial intelligence that concerns with the interactions and defines the languages used by human in a comprehensive way to the computers [17]. The Part-of-Speech (POS) tagger is a process of tagging a sentence to a list of words. In general, eight main parts define the in which: adjectives, interjections, prepositions, nouns, adverbs, verbs, conjunctions and pronouns as cited in [17]. The profile may include more than one part. The output of this stage is tagged profiles (T.P) as shown by equation 1.
T.P = {V1...n, N1...n, Adv1...n,..., Adj1...n} (1) The Named Entity Recognition (NER) aims to classify named entities mentioned in a specific text into some predefined categories for example "cities", "companies", "organization", "individuals", "product " and others. The NER gives a wealth knowledge and meaning to the given text to be understandable. Thus, these feasters can be used to discover the relation between the named entities mentioned in the profiles and the users' influence. The output of this NER is Named Entities in profiles (N.E.P) shown by equation 2.

Ranking Methods
The ranking is one of the significant problems in the field of information retrieval, which aims to assign a score to a set of objects (for example documents), this rank will be used to sort these objects. For the feature, ranking is used to give a score to each feature in order to figure out the most relevant one for a specific study. Depending on its application, the ranking may give an idea about the relevance, importance of the studied case [11]. In the literature, several methods for features ranking have been proposed [11]. Based on state of the art, SVM-RFE, Correlation, Information gain, Chisquared, Gain ratio, and Random forest are chosen in order to rank the entrepreneurial features. We describe in the next subsections these methods briefly.
Random Forest: In data science, Random Forests RF are considered accessible, accurate, robust, and easy to use machine learning methods. RF proves its effectiveness in assess features importance. RF uses decision tree strategies which rank features according to its contribution in improving the node purity, decreasing impurity over all trees. Also, they provide a helpful feature called feature importance. The feature importance finds the most effective variable in the dataset [18].
In the decision trees, every node is considered a feature condition to divide the dataset into two sets: training and test. So, during training a tree, they compute how much each feature decrease the impurity. This could help us in the classification stage because it is based on both information gain/entropy. For regression trees, it is known by variance. Finally, the feature list is ordered based on this measure [18].
SVM-Recursive feature elimination: Support Vector Machines Recursive Feature Elimination (SVM-RFE) is a well-known approach for ranking. As mentioned in the study of Guyon et al. [19], this approach has shown superior classification results compared with other methods. Generally, this method is used to evaluate the importance of each variable. SVM-RFE can also find the best combination possible for the feature in order to have the best classification performance [20]. Moreover, this method uses a recursive way to classify some samples from the dataset with SVM then selects the best fit and ensure the tradeoff between accuracy and feature number. [19].
Information Gain: Information Gain, on the other hand, is one of the ranking methods that give a weight for the feature by measuring the gain vis-a-vis the class. It performs the feature selection based on Claude Shannon theory [21], based on the information value for the analyzed message. The formula can be expressed as follow: Where H (Y|X) is the uncertainty about Y for a given X and H(Y) is the entropy of Y. IG is a symmetrical measure, where the information gained with Y to X is the same as with X to Y. IG biases to high branching features even if it is not valuable for the study. Because of this bias, it is recommended to select a large number value for the attributes before performing the IG method.
Gain Ratio: The gain ratio is an extension of IG with less bias since it take into consideration the size and number of branches when choosing a feature [22]. This is done by normalizing the IG by "intrinsic information" of a split. intrinsic information is a positional information created by splitting the dataset into n portions. Gain Ratio is given by equation 4 Where ( ) is intrinsic information. GR biases to unbalanced splits in which one partition is smaller than the other.
Symmetrical Uncertainty: The symmetrical uncertainty SU criterion, giving by equation 5, is explained in order to compensates the inherent bias of IG [22].
The values of SU are selected and normalized to [0,1] range. If SU value is 1, that means this feature can be predicted successfully, else its value is 0, there is no correlation between X and Y. This method is pretty similar to GR in the bias because its selection is based on the features with lower values.
Correlation: The selection of characteristics based on correlation is the basis of symmetric uncertainty (SU) [23]. It is a symmetrical measure that can be used to measure the correlation between characteristics and characteristics. The value of symmetrical uncertainty ranges [0 to 1]. Thus, one indicates that one variable (either X or Y) ultimately predicts the other variable. The value of 0 indicates that both variables are entirely independent. The Pearson correlation coefficient is defined as the following equation 6 to predict Y.
where cov and var designate, respectively, the covariance and the variance.

Chi-squared:
Chi-square is one of the standard methods which is used to select feature [24]. As described in formula 7, this method evaluates feature values by calculating its statistic chi-squared. Starting by a hypothesis H0 which assume that there is no relation between a set of features (two or more) and perform the test by the following formula: Where Oij is the observed frequency and Eij is the expected (theoretical) frequency, asserted by the null hypothesis. The higher the value of χ2, the greater the evidence against the hypothesis H0 is.

Ranking aggregation
Ranking aggregation is the process of aggregating many ranked lists generated by individual rankers to one ranked list. This gives a better rank and resort the list based on the new rank [12]. In general, Rank Aggregation (RA) is an ensemble-based method for feature selection. Using this technique gives more accurate results with different kind of data as reported in [12]. Furthermore, the RA method can perform in both supervised and supervised methods, but overall, the unsupervised RA methods are mainly used in the literature [12]. Overall in this field of study, there are many studies to rank features using aggregation, we cite, for example, median, highest ranked, sum, mean, and lowest rank aggregation [25] Robust Rank Aggregation (RRA) is an aggregation method proposed by Dittman at el. 2013 [25] to aggregate results of many ranking methods in an unbiased manner. RRA is considered one of the statistically stable and computationally efficient algorithms. Authors proposed RRA to prioritize genes lists in genomic data analysis applications. RRA assigns an importance score for each gene, providing a robust way to retain only the relevant genes in the final list. RRA looks at how the feature is positioned in the ranked lists and compares it to the baseline case where all ranked lists are shuffled randomly. Then, RRA assigns a P-value for all features to decide their significance and for re-ranking the feature.

ECharacterize Framework
This research proposes ECharacterize framework in order to discover the traits which make certain users more influential in entrepreneurial ecosystem on Twitter. The ECharacterize assigns importance scores to each influencers feature; then, the features are evaluated by prediction validation. The feature scores have generated by aggregating the ranked lists created by seven state-of-the-art feature ranking methods. Figure 1 shows the ECharacterize framework components. Next subsections explain the components in detail.

Data Collection
A real dataset was collected from Twitter. The Twitter Search API1 was used to crawl the data from Saudi entrepreneurial hashtag "startups_saudi_forum " during Jan 2, 2018, to Des 31, 2018. Based on the collected tweets, Twitter REST API2 was used to get data of the users' profiles. As a result, we ended up with a total of 233,018 tweets from 656 users.

Features Extraction
All the seventeen discussed features in section 2.1 were extracted. Stakeholder, official, and contact channels are new features that added for this paper purpose. Those features are not discussed before in the literature. Stakeholder feature represents entrepreneurial stakeholder category which Twitter account belongs to. The stakeholders are categorized into six categories based on Andonova et al. 2019 [26]. They include government sector, universities, startups, entrepreneurs, accelerators and incubators, and unofficial accounts like news and initiatives. The official feature represents if the account is official or not. The entrepreneurial influencers must be in a place of trust, because of their tweets about crucial issues such as funding, government regulations and others. Therefore, this paper assumes that the users in the entrepreneurial ecosystem will be influenced by official accounts. Contact Channels represents the availability of contact channel in the profile, increasing the profile reliability. We categorized the two new features "official" and contact channels in the profile features. Stakeholders are considered a separated feature. http://www.i-jim.org

Fig. 1. ECharacterize Framework
MADAMIRA was used to apply NER and POS [27]. MADAMIRA is one of the state-of-the-art Arabic, accurate, and fast text processing morphological analysis for Arabic text. MADAMIRA can find the named entities in three categories they are Person (PER), Organization (ORG), and Location (LOC), we consider each category a separated feature. Regarding POS, MADAMIRA can find all the eight parts of speech, representing eight new features. The number of users tagged, and the number of hashtags were calculated by counting the @ and # symbols in the tweets. Profile description length extraction step, the words and spaces were kept, everything else was removed. Then, the number of letters were counted. This step must be done after extracting the features like hashtags; user tagged because this step will also remove '#' and '@'. Finally, we result in 35 features; they are listed in table 1.

User's Annotation
To ensure reliability, three expert coders were hired to annotate the top 200 users. Top 200 users were chosen according to the number of retweets they have gained. The first two coders independently annotated the users as entrepreneurial-influencers or non-influencers. Cohen's kappa was used to measure their agreement [28]. Cohen's Kappa showed a 'good' agreement with a kappa value of 0.633, reflecting 85% agreement between the two annotators. The third expert annotated the users independently, where the first two coders had disagreed. Based on the three coders' judgment, the dataset contained 28 influencers.

Data preprocessing
Data preprocessing transforms the raw data for further processing [29]. Based on the collected dataset, there are no missing data, and we did not remove outliers since they reflect some influencers' characteristics. This research used the encoding and normalization for data preprocessing.
• Normalization: It is the process of transforming the data of different ranges into a uniform scale so that they can be compared [30]. Z-score was used to scale the features due to its ability to handle the outliers. • Encoding is the process of converting categorical variables into numerical. Binary encoding technique was used to encode the verified and official features, '0' represents the account which is not verified or official, while '1' represents verified and official the account. The stakeholder feature was encoded using one-hot technique. One-hot encoding is binary style of categorizing, each categorical variable has one element for each label with the class label is 1 and all other elements are 0.

Features ranking and aggregation
In this paper, researchers consider seven commonly used features ranking methods based on learning algorithms, statistical and entropy-based with excellent performance in various domains. These are random forest, SVM-RFE, information gain, gain ration, symmetrical uncertainty, correlation, and chi-squared [11]. Robust Rank Aggregation (RRA) algorithm was used to aggregate the seven lists produced by the ranking methods. RRA returns the final aggregated list with associated P-value score of each feature. The P-value is used for deciding their significance and thus re-ranking the feature. Figure 2 shows the aggregated results and its P-value scores. P-value score becomes significant (smaller than 0.05) as the features become more important. Table 2 shows the results of all the ranking and aggregation methods. The numbers indicate to the position of the feature in the list, and the final column shows the RRA associated score (Pvalue).

Evaluation
To evaluate the final aggregated list, researchers used the concept of an incremental feature selection (IFS) [31]. In IFS, supervised machine learning algorithms are used to evaluate the features which sorted according to its importance. It works as follows: the algorithm is trained on only the first best attribute, then the top 2, then top 3 and continue until finishing all the features. In each iteration, the algorithm returns the accuracy. In this paper, we used precision as evaluation metrics [32]. As shown in equation 8 precision is the number of true positives (the number of correctly predicted influencers) divided by the total number of elements classified as positive class (influencers) (the sum of correctly and incorrectly predicted influencers) [32]. We used it due to its ability to deal with imbalanced class distribution. In this research case, there are 28 influencers out of 200 users.
Precision= True_Positive/ (True_Positive+ False_Positive) (8) Three different types of state-of-the-art algorithms trained in a train-test fashion, they are Support Vector Machine (SVM), Naïve Bayes (NB), and Random Forest (RF). The algorithms were fed the aggregated list incrementally. Figure 3 shows the precision results of all iterations. Each number represent the number of features in the iteration. For example, '1' means the best feature (highest significant P-value), while '2' means the two best features. The significant features are the first nine features.
As shown in the figure, the performance of NB starts with 0.86896 in the first iteration and then it increased incrementally until it reaches its highest performance with 0.948365 in the ninth iteration, then it becomes stable. SVM provides better performance reaching 0.95254 from the first and second iterations, but its performance declined in the third iteration to reach 0.8711111, then its performance is stable to the final iteration. Compared with SVM and NB, RF started with the lowest performance reaching 0.8292397, then the performance increased incrementally until it reaches its highest performance in the ninth iteration equal to 0.910169, then it declined and became stable. Table 3 shows the performance of the three-algorithms based on precision for the nine significant features.

Discussion
Only the first nine features with significant P-value are considered the essential features of entrepreneurial influencers since 0.05 is used as the cutoff for significance. The 'number of followers' is considered the most crucial characteristic of the entrepreneurial influencers followed by the number of the list. These two features reflect the importance of entrepreneurial influencer's popularity. The user's popularity may be increased by activity level. Therefore, we found the influencer's activity 'All Tweet' is the third essential features. This result is in agreement with Asadi at el. 2018 [16] who found that most of the influencers conversations ranged across different topics as personal experiences, travel, or politics.
Ranking 'Favorite' as the fourth entrepreneurial influencers reflects that many of influencers' audience is made up of followers who act as observers than participants in the conversation. The 'User Favorite Account' and 'Reply' are ranked as the fifth and sixth most essential features, reflecting the influence of the influences' interaction level. This also agrees with Asadi at el. 2018 [16] who reported that the majority of influencers spent their time in interaction with their audience. The quality of tweets 'Retweet' feature is the seventh feature distinguishes the entrepreneurial influencers. This is a logical result since the entrepreneurial users especially the beginner entrepreneurs, and the founder of Small and Medium Enterprises SMEs usually look for the information guide them. This result corresponds with the result of Kuffo at el. 2018 [33] who found that entrepreneurs rely more on local sources for information. The actively level of influencers again proves its importance in term of 'Tweet' feature which ranked as the eighth most important feature distinguish the entrepreneurial influencers .it is the number of influencers tweet related to entrepreneurial issues. This find agrees with As Kuffo at el. 2018 [33] who found in his research entrepreneurship-focused sources are more popular among entrepreneurs. Finally, the profile features, 'verified' is ranked on the ninth position on the ranking list, reflecting how much the influencers' account must be reliable.

Conclusions and future work
In this paper, researchers focused on the problem of detecting valuable features of entrepreneurial influencers on Twitter, in particular, Saudi's influencers. At the first stage, a wide range of features are collected in order to be investigated for the performed research. These features are coming from several research domains such as social media analysis, natural language processing, and retrieval information studies. It then proposed a robust framework called ECharacterize to rank the most relevant features distinguish the Saudi entrepreneurial influencers. Three state-of-the-art machine learning supervised algorithms are used to evaluate the final results to ensure the correctness and efficiency. Based on the experimental, we can highlight following main results. First, the entrepreneurial influence is based on the number of followers and the number of followers who have added those influencers to a list. Second, the level of activity distinguishes those account either on term of entrepreneurial tweets or general tweets. Third, their continue conversation are selected on the basis of evidence that they keep strong influence, passive members who participated by liking tweets are also considered. Finally, the influence also related to the reliability of the account.