Text Pre-Processing for The Frequently Mentioned Criteria from Online Community Homebuyer Dataset

Due to the competitive Malaysian residential market with a large number of residential projects that offered almost similar features lead to difficulties of residential purchasing among homebuyers. These days, homebuyers are very selective, careful, and required more time in deciding due to the high numbers of abundant, and problem residential projects in Malaysia. As a result, a high number of unsold residential projects were reported. Therefore, understanding homebuyer criteria in a residential purchase is crucially important towards successful Malasia residential projects in the long term. This paper identifies and prioritizing homebuyers criteria in mainland Penang, Malaysia from user-generated data in online property forums. 6000 data was extracted through RapidMiner software. Once data were processes, statistic analysis is used to determined and prioritize the homebuyer's criteria. The classification of criteria is made by the real estate experts. The result of the study provides fresh insight into homebuyers' criteria. The findings should offer developers, government, potential homebuyers, and real estate agents a better understanding of homebuyers criteria in Penang, Malaysia. Keywords—Residential, Criteria, Purchase, User-Generated Data, Text Analysis


Introduction
The residential purchasing decision-making process has been acknowledged as complex decision making and considered as high stakes, infrequent, and mostly irreversible decision-making process among the homebuyers [1]- [4]. The decision also is regarded as one of the most crucial decisions for home buyers as the decision has long-term influence on their life due to several reasons. For example, According to [5] the decision is relatively related to one of the largest investments for homebuyers approximately three or four of annual income that led to the long-term financial commitment. Moreover, most of the developers are required 60%-70% of total transaction as advanced payment before the residential project begin. In the future, if any problems occur, buyers might lose the advance payment or have to wait endlessly for the project to be completed [6]. Also, high numbers of the abundant and unfinished residential project have increased the decision complexities. For example, in Malaysia, there were 188 abundant residential projects have been reported within the year 1999 until 2019. Unlike many other types of purchasing, housing purchasing decision making is riskier and sometimes even 'traumatic' for homebuyers [7]- [9]. Limited homebuyer's knowledge and experience, in sync with a lack of market information available, led to uncertain outcomes of the decision. Thus, numerous previous studies have been performed to assist homebuyers in the residential purchase decision-making process. These studies mostly focused on the development of decision-making tools [5], [10]- [13], and most identification and prioritizing of residential purchase criteria [6], [14]- [19].However, this paper orbit around criteria influences the residential purchase decision-making process. Many criteria have implications during the purchase of residential such as a house, apartment, condominium, and flat. Furthermore, past research has proven that housing purchasing should consider numerous criteria and there are no fixed criteria for buyers available [1], [7], [11], [20], [21]. From the homebuyers viewpoint, a price criterion is no longer the main objective in residential purchasing, multiple other criteria need to be taken into consideration in comparing and assessing alternatives [5].
With the rapid development of the Malaysian residential industry, a high number of residential projects available on the market, homebuyers became very selection in residential decision making. Currently, until February 2009, there is RM 29.47 billion value of unsold residential in Malaysia. The value of unsold residential projects was due to indiscriminate buildings by the developer, lack of market studies, and financial feasibility studies. Thus, identification of the right set of residential purchase criteria is necessary to reflect the market demand from homebuyers [5], [10]- [13]. This study outcome is significantly beneficial to homebuyers, real-estate practitioners, developers, government policymakers, and as well as an academic institution.

Residential Related Criteria
The identification of the right criteria has been acknowledged as the most crucial stage that highly influenced the housing purchasing decision making [14], [22]- [26]. As a result, research on residential purchasing criteria has developed over the last few decades [5], [10]- [13]. Based on the literature review, various residential criteria to be relevant to the homebuyers. The past studies have sought to investigate the topic from a different theoretical angle and perspective. For an instant, several studies have focused on investigating homebuyers for single criteria such as location, developer reputation, and price [27], [28]. Meanwhile, others have to examine residential criteria from a different angle of homebuyers' demographics and socio-economy instead of focusing on the homebuyers' view, some studies investigated residential purchasing criteria from the developer's view [22]. On the other hand, several studies have classified such residential criteria into different main categories. Such as, For example, [29] conducted a review of American and British valuation and econometric literature and a pilot survey among buyers. The authors identified 25 criteria that consist of both qualitative and quantitative criteria that are frequently used by users and valuers for housing purchasing decision making. The authors themselves clustered the criteria into four categories such as property variable, distance, environmental and financial characteristics. Wheres [3], organize the criteria into three categories such as price, actual house, and resale value. Meanwhile, other authors established the criteria into two main categories; internal (intrinsic) and external (extrinsic) [30]. Internal criteria refer to size, interior layout, design, space, features, number of rooms, and space. Wheres, external criteria included exterior design and appearance, building quality, and materials.
Another focused on this area is criteria prioritization. There are differences in homebuyers criteria preference had been reported in the literature. For example,  findings revealed that location is far more important than other criteria such as financial, neighborhood, interior, developer itself, exterior design, and family life cycle. Evidence suggested theoretically the majority of homebuyer may inspire to live in the best location. However, in terms of a housing purchasing decision making other criteria such as neighborhood, price, affordability, and quality criteria take precedence. For example, research carried by in past emphasized the first time homebuyers believed location-related criteria as least important [25]. Meanwhile, another study carried out in 2017 has presented that price had been chosen as the most important criteria for housing purchasing followed by safety and security, public facilities, location, quality, design, and developer reputation [31]. Other authors such as [17] also presented a different level of criteria importance in house purchase decision making. People in Jakarta Indonesia believed that physical quality is the most important criterion in house purchase compare to price and location. Instead of location, quality, and price criteria, criteria such as developer reputation are increasingly often mentioned by home buyers. The driving criteria preferences for residential purchasing are constantly changing. Understanding consumer preferences will therefore be a key aspect to solving the residential problem such as the unsold projects.
Most of these criteria were investigated based on a survey, questionnaires, and literature. Instead of these methodologies, there are more residential related dynamic and information-rich data due to the advanced Information Technology (IT) available. The availability of Web 2.0 platform that based on context-rich and easy to use a concept such as a property online forum [59], blog, and social network site allows people to share and exchange experience on a one to word platform rather than one to one platform [32]. Based on a survey carried out by the National Association of Realtors shows that 90% of home buyers now searching the internet [58] as their information source in the residential purchasing process [33]. Homebuyers can post questions and, share experience and knowledge regarding specific residential projects without limitation such as time and space. In the Malaysian environment, an established developer such as IOI property, YTL Land Development, and Paramount Corporation also have already implemented interactive services such as online forums as their business strategies [34]. Utilization of user-generated data has been proved effective as data for customer satisfaction especially in product development [35]- [38]. Although there is an existing body of literature on homebuyers criteria based on surveys and questionnaires, this study findings could provide an interesting insight from users generate data.

Research Approach and Findings
The essential reason for this study is to identify and prioritize the most mention criteria that influence the buying decision of residential property in Pulau Pinang, particularly on the mainland. Pulau Pinang is one of the main cities in Malaysia that undergoing rapid growth in residential and urbanization development. Asurvey, questionnaire, and intervieware the only most common tool to measure homebuyer/investor sentiment in real estate which is costly and labor-intensive. Instead of carried out questionnaires and a survey approach, this study extracts online residential related data from user comments in Penang online property forum. 6000 user comments have been extracted from numerous residential projects in the online forum through RapidMiner software. The framework of the proposed approach consists of two-parts; data extraction and processing, and measurement of criteria.

Data extraction and processing
First, user comments on several from the residential project such asa house, apartments, flats, condominium posts in online property forums are crawled and collected through RapidMiner. Generally, user comments consist of residential criteria whether in terms of questions or evaluations about the residential project. 6500 comments were extracted through properties only forum and google review based on the project name, developer name, and location keywords. After extraction of raw data,our finding indicates that there are only 4285 comments were related to this study. Then, within these comments, there were in other languages such as Bahasa Malaysia and Chinese. Comments in other languages were translated into English using google translate. Next, Through Rapidminer software, processing data is performed through a sub-operator called process operator. In this process, raw data from the forum will be clean and formatted through several processes such as tokenize, transform cases, filter token, filter stopwords, stem (Porter), replace token, extract length, extract the token number, and aggregate token length. Each of these operators functions as follows: • Tokenize: This operator splits the text of the document into several of a token.
• Transform cases: transform all characters in a document to either lower cases or upper case • Filter token: used to filter token by its length • Filter stopwords: use to removes any English stopwords from a document such as the, is, are, and, or, etc. • Stem: use to stem English words using the porter stemming algorithm, identify the root words. For example, an explanation will be extracted as explain (root word).

Measurement of residential criteria
The text mining process such as frequency has been adopted in this study to identify the most mentioned residential criteria [57]. Term Frequency (TF) is based on how frequently a term occurs in a document represented by an equation as follows [39], [40]. Figure 4 below illustrates the overall criteria gathered in this study.

Fig. 5. Term of Frequency for each criterion
Based on table 1, our findings indicate that there are 33 criteria related to a residential purchase. With the assumption, a higher number of frequency scores indicates a higher level of importance by the user. The list of criteria in this study is categorizingas 'location', 'facilities', 'house itself', 'financial', 'developer', 'safety', and 'investment' which is based on previous studies.The classification of criteria is made by the practitioner in the area of real estate located in Penangthrough interviews. Table 1 illustrated7 main criteria such as location, facilities, house itself, financial, developer, safety, and investment. From 33 criteria, 9 criteria fall under location, 6 under facilities criteria, 8 under house itself criteria, 4 under financial criteria, 1 under developer, 3 under safety, and 2 under-invest criteria. This study found that residential location is the most mentioned criteria among online users with a total occurrence of 2071 times. The location was also found to be the most important criteria in past studies [11], [21], [24]. Criteria such as area, neighborhood, road, foreign, workplace, view, surround, traffic, and environment have been categorized by the experts under location criteria. Meanwhile, facilities have been falling under second place as most mentioned criteria with total occurrence 852 times. Criteria such as park, shop, mall, pool, school, and garden have been classified under facilities criteria. With slight differences in the number of occurrences with facilities criteria, House itself or house attributes criteria rates as the third most important residential criteria in Penang with 825 occurrences from overall comments. In this study, the criterion is related to other sub-criteria such as storage, design, quality, car_park, room, balcony, concept, and kitchen. Financial criterion has been considered as the fourth most important criteria in this study with 472 occurrences. This criterion was involved in sub-criteria such as loan, price, maintenance, and affordability. The three least important criteria in this study were developer, safety (guard, gate, and security), and investment (value and rent).

Discussion and Conclusion
The purpose of this study was to investigate and prioritize homebuyer criteria for residential purchasing in Penang, Malaysia based on user-generated content. As evidenced by the literature, delivering residential that matched homebuyer's requirements are crucial for providing quality and successful residential projects. Given the high number of unsold and abundant residential projects in Malaysia present, gaining a deeper understanding of homebuyer's criteria for residential purchasing is significantly important in assisting to devise a transparent residential solution that is successful in the long term. A qualitative study conducted with 4285 users generates data in the shape of comments from the Penang online property forum. Overall, of the comments are based on numerous of the residential projects (house, condominium, flat, etc) post from the forum. RapidMiner was utilized to extract the data. The text analysis method was used to measure the most mentioned criteria by homebuyers in the data.
The location has been addressed as the most important criteria of residential purchase in mainland Penang, Malaysia. This finding is also in line with past studies in Malaysia which indicate that location criterionis particularly important to the homebuyer [8], [11], [41], [42]. Moreover, the location criteriaare also relatedly positive to the neighborhood, safety, environment, crime, closeness to the working place, families, and friends [12], [28], [43]. Other authors highlighted homebuyers are willing to pay extra for a good neighborhood and quality environment [44]. As one of the economic and urbanization development cities in Malaysia, Penang has consisted of high number of foreign workers. The presence of foreign workers/ immigrants in the neighborhood is considered a crucial issue [45], [46]. There were many companies rented residential especially apartments and flats as a hostel for foreign workers [47]. Not only from the perception of residents, yet the fact also illustrates that foreign workers/ immigrants are related crimes such as sexual harassment, robbery, quarrel, and vandalism [48].
Interestingly, facilities have been placed in the second rank and significantly plays an important role in Penang residential decision. The online community in Penang appeared more concerned with facilities criteria that related to park, shop, mall, pool, school, and gardens. This study appears to support the view that facilities criteria are more significantly important in the urban area [49]. For example, the existence of gardens and green space is now evolving into a top priority residential criteria and homebuyers are willing to invest more for this [50].
House itself criteria such as storage, design, quality, car park, room, balcony, concept, and kitchen has fallen into third place overall by the online communities. In many other studies, design, and quality criteria such as external and internal finishing as a key criterion for homebuyers [50]. Moreover, past studies suggested building material as the most crucial criteria in homebuyer decision making and significantly have a strong relationship with house purchase attention. The next sub-criteria is the room criterion. This criterion might refer to the number or size of the room such as bedrooms and bathrooms. The number and size of these rooms have been addressed as the most critical criteria in the Malaysian home buying market [14], [15], [20], [31]. The car park is also one of the important sub-criterion in the house itself criteria. This finding is parallel with high car dependency in Penang, Malaysia. Thus, Penang homebuyers may be particularly concerned with the problem related to a lack of packing space.
The financial was addressed as the fourth most mentioned criteria consist of loan, price, maintenance fees, and affordability. This finding is supported by past studies that highly addressed that price is no longer the first preference among home buyers [1], [23], [51]. However, in affordable residential types for first-time homebuyers, a price that also includes sub-criteria such as loan and maintenance fees is still dominant as the most important criterion compared to other criteria [20]. There is no doubt that the domination of price among first-time homebuyers has been largely propelled by stricter rules to get mortgage loan approval from the bank since 2014 [16].
Meanwhile, the developer criterion has been regarded as the top five in this study. This is in line with other research in other states in Malaysia such as Melaka [15], and Johor Bharu [52]. Moreover, other countries such as Slovenia [24] Indonesia [22], China [19], Turkey [26], and Finland [53] have highlighted developer criteria such as background, brand, image, and previous project were more prioritized by homebuyers. Based on these studies, homebuyers were believed to a reputable and experienceare crucial decision foundation towards quality and on-time delivery of residential projects With slightly different occurrence numbers with developer criteria, safety has been placed into sixth-placed that related to guard, gate, and security. This is supported by past studies where some homebuyers may prefer a residential area with secure gates and guarded areas to ensure their safety and privacy [54], [55]. The desire for security is very much increased especially, for the majority of homebuyers living in metropolises with a high number of crimes. [54] found homebuyers are willing to pay more for a gated and guarded residential to ensure their safeness, well-being, and peaceful life. Realizing the need for safety amongst homebuyers, to be competitive, residential developers are required to provide a surveillance system and patrol service [56].
Compare to other criteria in this study, value criteria has been considered as the lowest mentioned. A value in a residential purchase is referring to reinvestment value which involves rental payment and capital gain from the increasing value in the property [16]. According to [22], value is always closely related to other criteria like price, developer reputation, and location. Questions such as, what will happen with the are in the future? does the government has any project to develop the area? and etc can determine the future of the residential project. This could be explained by the perception and belief of Penang homebuyers who are very much concerned about the location that would automatically lead to a value of the residential project.
The outcome of this study is expected to elucidate a deeper understanding of the residential criteria that influence homebuyers' decisions by prioritizing the criteria, especially in mainland Penang, Malaysia. Furthermore, This study has extended the existing methodology of residential criteria studies. Since, most of the studies in the identification and prioritizing of residential criteria were based on methodology such as survey, questionnaire, and interview, this study has introduced more property-related dynamic and rich data which in line with the development of Information Technology (IT). This property related online data generated by users could reflect better information on homebuyers' residential criteria. As mentioned in the introduction part, the findings are significantly beneficial to homebuyers, real-estate practitioners, developers, government policymakers, and as well as an academic institution.

Limitations of Study
This study only focused on online comments on who the users are might be from the potential homebuyer, first-time homebuyers, or industry practitioner. Demographics of homebuyers are not measured in this study. This study also focused on a variety of residential types of projects such as flats, condominiums, and houses. Moreover, this study online involved residential projects in the mainland of Penang Malaysia. The findings could be richer and more transparent if cover more state in Malaysia.