Towards an Architecture-Based Ensemble Methods for Online Social Network Sensitive Data Privacy Protection

—In 2014, the world woke up to a giant data breach that leveraged users’ personal information that was taken from one of the world’s biggest social network platform. Based on the literature, this was possible because of the Centralised Architectural-based Approach to protecting the privacy of users’ online data. Although the literature is inundated with decentralized approaches, there is none to the best of our knowledge that uses an ensemble of methods and draws on a consensus mechanism to address the challenges caused by the Centralised Architectural-based Approach. This paper presents a decentralized approach that adopts and adapts an ensemble of methods. These methods include cryptographic, hashing, and the plenum byzantine fault tolerance algorithms that present a consensus platform, protocol, and mechanism to use the technology of blockchain in a novel manner as a significant contribution. This paper adopts the descriptive approach in its presentation as the usable implementation of the presented proposal is near completion with issues of computational overhead addressed based on preliminary results that show promise of being able to support agreement up to 75% in terms of making changes by participants in the chain.


Introduction
The emergence of Web 2.0 and the development of the Internet introduced a new paradigm in the exchange of user-generated contents [1,3]. Web 2.0 remain a critical network infrastructure and knowledge platform for entities -man, machine, group, and even brain-like computer -to exchange and share information, knowledge, wisdom and data [3]. One of the most remarkable phenomena that blossomed in the Web 2.0 era is Online Social Networks (OSNs) that include Facebook, MySpace, also addresses the challenges of the Free-service provisioning capability of the OSNs model that supports targeted and retargeted marketing intentions that makes it easy for malicious users to target users' sensitive data [11][12][13]. Therefore, it will be difficult to give the users of OSN a false hope of control over their privacy of data [5]. Data breaches such as the one that resulted in the harvesting of millions of Facebook profiles -users' personal information -at the beginning of 2014, which was stolen for political advertisement can be checked [14]. The contribution in this paper is significant since the proposed method uses low-cost users' action-oriented attributes. These attributes are personal information of users delineated in a three-some mannerpersonal identifiable user information, potential personal identifiable user information, and users' posts. This is easy to come by since they appear un-useful on the wall of social media platforms, but they are really the basic items that are needed to compromise and leak user data and commit identity theft [15]. The rest of the paper is structured as follows, with Sections 2.0, 3.0. 4.0, 5.0 and 6.0 dedicated to the review of literature, Methodology, results and discussion, system implementable formalism, and the paper's conclusion respectively.

2
Literature Review

State-of-the-art in online social network sensitive data protection
In the past few years, there has been significant growth and improvements in the services offered by OSNs [15,[16][17]. Some Studies [18][19] have shown that users' private life are often in jeopardy whenever they post sensitive information when on any of the OSN space. It is even more worrisome that a great number of users of OSN tool are unaware of the importance of protecting their privacy [20]. Reports have shown that shared information on OSNs can reveal contents that are meant to be private and sensitive enough not to be published since malicious users can use them to invade individual's privacy [21][22][23]. Both security and privacy concerns have been highlighted as a major challenge with OSNs in the literature since they are built on the centralized architectural philosophy [24][25]. Efforts are rife in literature with the aim of logically decentralizing the functionalities of OSNs and mitigate privacy issues. Based on the literature, decentralized architectures can be implemented using multiple independent and trusted servers [2,13,30]. Some of these efforts used federated architectures as proposed in [26]. This approach used an architectural framework that protects users' privacy by shielding users' personal posts or messages from service providers and other third-party applications who are not authorized by users to view the content. In a similar effort by [27], the federated and decentralized social network used users' profiles to help individual users decide for themselves where their information should be stored. Usually, every user-generated content were encrypted using a random key, which in turn is distributed to every authorized user. The approach that was applied by [27] and [26] using the attributes of users' personal posts and users' profiles respectively was leveraged in [28] and [29] based on opensource to offer microblogging functionalities. In these efforts, users' identity played a major role as the attribute that was employed in the formulation of their federated architectural framework. However, the Federated Architectural Approaches (FAA) are vulnerable to information leakage. The FAA is porous, intruders easily carry out data breaches and other malicious attacks and abuses from central service providers is high.
In the literature, an alternate approach using the popular P2P architecture was applied to decentralize the OSNs with trust as the main challenge that was addressed. In a recent work, [31] proposed a decentralized approach that is built upon an overlay and relies on trusted nodes to ensure the security of the network. However, this approach is weak towards the detection of unauthorized users who could use fake profiles and spam messages to initiate security breaches. In another related work by [32], a trust-aware model was developed to securely shared knowledge using the Distributed Hashing Technique (DHT) and a predecessor replication technique that rely on social trust. The model allows trusted friends to be admitted with continuous security from unsuspecting malicious nodes. The DHT with static replication technique has also been used to secure the storage of static bulk data (videos, photo albums) from their basic profile information or social glue [30]. An ensemble of Encryption, decentralization and direct data exchange has been applied to solve privacy and connectivity problems [33]. The Open-DHT is a variant of the DHT method, which implementation is suitable for look-up service prevision. Based on the literature, DHT has been a useful technique to mitigate attacks from malicious nodes [34]. Sometimes, the technique is adapted to support users by anonymizing communications as well as replicate contents and profile information to trusted nodes. With DHT, low latency and high data availability depending on the number of trusted friends within a social connection is a critical issue. The drawback with the use of DHT is that it is difficult for users who maintain few social connections to maximize its potentials [34]. Aside from these successes, some other research efforts concerning the protection of sensitive data have attempted to use both the federated and P2P methods. [35] developed an approach that uses the Ciphertext-Policy Attribute-Based Encryption (CP-ABE) toolkit and Google Drive to hold encrypted messages. A cloudbacked P2P with decentralized and encrypting capability for personalized online social networking was proposed by [36]. In a related work, an infrastructure surrogate with content key that is symmetrically random, and in turn is encrypted with a proper ABE key was developed by [25]. However, based on findings in [5], third party platform like the cloud does not guarantee satisfactory privacy of user data. The same goes for federated and hybrid P2P architectures that also rely on third party policy by cloud providers with fake privacy assurance. Though, the P2P method allow users' data to be stored on DHT and home gateway, trust still remain a concern even with high latency challenges among peers [5].

Review of blockchain techniques
Blockchain Technology (Bloc-Tech) has been used as an immutable distributed ledger to resolve the trust issue in the P2P method [37]. Available and successful research efforts are in the literature that highlights satisfactory promise concerning the use of Bloc-Tech to ensure privacy, trust and availability of sensitive data among untrusted peers in a network. The work of [38] demonstrated this by proposing a decentralized technique based on Bloc-Tech that uses Ethereum Blockchain and Proof of Work (PoW) consensus algorithm. This approach was used with a 51% success rate to ward off attacks when managing a photo group that uses a decentralized social media Web-based photo sharing application. The potential of Bloc-Tech was also exhibited in the research work of [39]. They proposed a decentralized approach that a Delegated Proof of Stake (DPoS) consensus protocol. Similarly, [40] leveraged the technique of blockchain to develop a social networking service provisioning system with irrevocable peer review records and traceable reputation structure to distribute content. It was found from these research work (e.g. [37][38][39][40] that the bitcoin-based Bloc-Tech that was used, although it uses the proof of reputation consensus algorithm, it is permission-less with public inclinations. Therefore, it is prone to weak consistency, low transaction throughput, and vulnerable to malicious attacks, that include double spending attacks, eclipse attacks, and selfish-mining. The alternate PoW consensus algorithm that the Bloc-Tech implementation already stated employs encourages high computational power wastage and it is subject to selfish mining by intelligent miners. The DPoS consensus protocol by [39] was supposed to ensure decentralization, but in reality, this was traded for scalability, which can only support few more users. The need for a consensus technique that is robust and scalable like the Bloc-Tech is thus overarching. In [45] a secured network solution that enforces data control and overcome privacy concerns, and security compromises through blockchain is proposed. The solution is a decentralized solution based on the descriptions presented. However, the technique of enforcing the solution was not provided. On the contrary, we employed the Indy genre of Hyperledger to enforce self-sovereign identity [50]. Unlike the possibility of the model in [45] to provide a dais for authorities to assist users with privacy, this current work applied the Indy technicalities to forestall this. The plausibility of the method in [45] remain cynical since no clue was presented regarding the validation of their method.

The sensitive data protection model architecture
The Sensitive Data Protection Model (SDPM) is presented as an ensemble architectural-based method. This archetypical model is a Decentralised Application (DApps) that enables online users to communicate with the Blockchain to manage the state of network actors. At the backend of the Dapps, the models business logic is represented by one (or several) smart contracts that interact with Blockchain technology. The frontend is made up of decentralized storage networks, which is hosted on an inter-planetary file system. To manage cryptographic keys, a wallet is used to house the distributed identifier and the blockchain addresses (see Figure 1). The DApps interacts back and forth with the CP-ABFHE module. Here, a hybrid cipher-text policy and fully homomorphic encryption algorithm will encrypt the data of users that are stored in the Local Database (L-Dbase). Friend Recommendation (FR) is an essential part of the OSN platforms and will be implemented in the FRM. This module uses an attribute-based community detection algorithm built on community discovery and attribute dependency to satisfy the FR requirements of OSNs and to allow collaboration between trusted friends. The FRM therefore interacts with the CP-ABFHE module and the L-Dbase and through the L-Dbase with the BlockChain Module (BCM). The BCM houses the Hyperledger Indy BlockChain (HIBC). The HIBC interacts with the ChainCode (C-C), which is a "smart contract" that creates transactions while running on the peers and update the World state of the Assets (WsotA). The WsotA is in the Global Database (G-Dbase). A Secure Hashing Algorithm (SHA-256) is applied to strengthen the security of the already encrypted user's data from the CP-ABFHE module through the L-Dbase that is now stored in the G-Dbase in the HIBC module. The SHA-256 also acts as a compressing technique that reduces the size of the encrypted data stored in the G-Dbase. The choice of the SHA-256 is premised on its ability to be computationally infeasible for potential malicious nodes (or users) on a network. The role of the Plenum Byzantine Fault Tolerance (PBFT) algorithm is to provide the consensus mechanism to vote based on a consensus protocol in the HIBC module to add validated transactions to the mechanism of Blockchain. The PBFT algorithm enable validator nodes to take part in the process of voting to bring in the next block till there is a consensus. This consensus must be among more than two-thirds of the validator nodes that agree before a new block is added to the chain. The choice of the Hyperledger Blockchain technique stems from its permission-orientedness as a distributed ledger to provide tools, libraries, and reusable components that is purpose-built to allow the decentralization of identity.

Model validation of proposed method
The blockchain model from the SDPMA was validated for effectiveness by using the Evaluation Framework for Blockchain Hyperledgers (EFBH) based on the provisions in the literature [46,47]. Following documented best practices in the literature [46,47,48] regarding the use of EFBH, throughput, execution time (during query and invoke transactions), and block size was evaluated. Transactions up to 10,000 was experimented with. The transactions in the simulation using a version of the hyperledger caliper that is modified [49] was measured based on the submission from the consensus transaction by simulated peers. The execution time covered the time that is required to successfully execute a transactions after it is added. The throughput of the model is meant to capture the amount of successful transaction(s) for each (or per) second [46,47]. The evaluated blockchain (or block) size is meant to capture the number of transactions usually per second, which is an important design parameters [47]. With the execution time it is possible to observe the behaviour of the model during query and invoke transaction vis-à-vis the chain code during, which is important since the Indy distributed ledge of the Hyperledger project to enforce a better decentralized ledger solution that support self-sovereign identity.

Sensitive data scenario protection modelling
Sensitive data are data that must be protected against unwanted disclosure. Therefore, protecting it from unauthorized access to safeguard its privacy and security is of paramount important. This conception guided the modelling of the sensitive data scenario. Given the scenario (i.e., situation); let the sensitive data be 'a'; where 'a' is Where a = a vector of data attribute r = the recipient (other users) Formally, a = (a1, a2,. . . an) which is a vector of data attributes (e.g.. name, address, posts, etc.) that can possibly be requested by recipient 'r' in order to create a relationship(s) or interaction(s) such that s(a, r)∈ [0, 1] is a user-specified level of sensitivity of sharing information that relates to a j th data attribute with a recipient r.
This consists of rj = 1; if the j th data attribute is requested by a recipient r and rj = 0 otherwise. Mathematically, the SDPM (sd) was represented as a 6-tuple which is define as shown in Equation (iii) as follows; sd = {u, a, e, l, f, β) (3)

Where u = User a = Sensitive data e = Encryption algorithms l = Local database f = Friend recommendation algorithm β = Blockchain that is based on Hyperledger Indy framework
Additionally, the Blockchain (β) is a 3-tuple as shown in Equation (iv) as follows; Where c = chaincode p = consensus mechanism s 1 = Hashed of the encrypted data Whenever a user communicate with the SPDM, the DApps is downloaded and setup with user registration to create their profile (for new users), while existing users would login to perform interactions (i.e. transactions) such as posts, likes, follows, comments, etc. The downloaded DApps would contain both the L-Dbase and the HIBC while each user owns a wallet that contains the decentralised identification that enable them to generate private keys using the public key in their wallet.

Results and Discussion
It was important to choose these metrics -throughput, execution time and block size since the proposed method is novel in that it provides not just a decentralized solution as descriptively presented in [45] but a user-centric self-sovereign identitybased solution. From the preliminary result obtained it was observed that on all fronts execution time, throughput and block size results interestingly follow the pattern documented in the literature [46][47][48][49][50]. For instance, The Tables 1, 2, 3, and 4 show the simulation results of the throughput, block size, and execution tine for query and invoke of the method suggested using the SDPMA presented as follows.    Similarly, the results in Tables 1 to 4 of the throughput, block size, and execution time derived from simulating the model is also presented graphically as follows in   http://www.i-jes.org From Figure 2, the average throughput of the model is observed as clearly higher since it processed up to 300 transactions per second when compared to 40 transactions per second obtained in previous work [51]. As shown in Figure 3, only few transactions per block were identified to have undesirable influence on throughput. Though, there was a quick increase of throughput at 10 transaction per block, increase of performance was observed to diminish. The maximum throughput of 350 transactions per second plots around 100 transactions per block and thus did not exceed the recommended block size, generation, and mining time, which is consistent with highlighted requirements in [47,52,53]. The same pattern of using more time for more transactions that is found in the literature [46,51] can be observed in Figures 4  and 5. This informs and validate the proposed method in this paper as plausible. Since execution time is the time required for a method like the one presented in this paper to execute a transaction after adding one successfully [46], the result in Figures 4 and 5 is consistent with what obtains in the literature [46,51] and highlight a good consensus provision. It can be inferred based on the provisions in [52] that the proposed method would show capability in respect of supporting agreement up to approximately 75% regarding making changes by participants in the chain. This is a good performance and consistent with the behaviour described of Hyperledger-based model solutions [46,48,51,52,53].

System Implementable Formalism
This section presents implementable UML OO formalisms to contemplate in the implementation of SDPMA. Three of this formalism are presented as shown in Figure  6, 7, and 8. The rational for this is to present varieties of algorithms, which can be complex, difficult to present and understand in an easy and simplified way. This inadvertently ensures when an ensemble of methods is proposed as done in this paper. Cognizance of this, in Figure 2, the block diagram is presented to show relevant modules and their description showing tier role vis-à-vis their responsibilities. The block diagram helps to visualize the detail flow and communication between existing components and show the convenience in implementing the proposed processes involved in protecting the privacy of users' sensitive data. Similarly, the Activity Flow Diagram (AFD) was applied to show both the user-based functionalities and the blockchain operation in the SDPMA. Both models in Figures 7 and 8 show the flow of control from activity to activity, thus shifting the focus of the flow of control from object-orientation as shown using the model in Figure 6 to specific activities as shown in the models in Figures 7 and 8. The dynamic nature of the Executable and Implementable System (E&IS) from the architecture presented in Figure 1 is presented using the AFD (see Figures 7 and 8). The behaviour of the E&IS in dynamic terms showing the concurrent as well as sequential processes are shown using the model in Figure 7.

Conclusion
The main goal of this research work is to develop a Sensitive Data Protection Architecture-based Model (SDPA-bM) that delivers secure solutions. This aim was achieved by proposal presented on a sensitive data protection model-based architecture that preceded the modelling of a sensitive data protection scenario and presentation of OO-UML-based formalisms to implement the proposed SDPA-bM. This paper contributes an architectural-based ensemble of methods that uses the blockchain technique to protect the privacy of users' sensitive data. The architecturalbased ensemble of methods is used to integrate trust in the network itself to enable identity owners have sovereignty of their identity and control access to their records while ensuring integrity and content availability. The ensemble of method, which approach is presented applies a fully distributed and secure methodology to offer high-quality services with no operational cost, despite running on unreliable, unsecure and sometimes malicious user devices. The paper employs a novel approach that uses the technology of Blockchain in synergy with cryptographic techniques, hashing and consensus mechanism to enforce privacy, trust and availability of data among untrusted peers on OSNs. However, the research work that resulted in the proposal reported in this paper is still ongoing with implementation of the prototype model for deployment in a real social network environment already at advanced stage. Based on preliminary result, the computational overhead incurred by applying the ensemble method is significantly less. This is consistent with the belief in literature (e.g. [41][42][43][44] that ensemble methods are computationally feasible with the use of less resources and computational cost since the computational time scales linearly.