Unique and Universal Proteins in Human Genome

One of the major troubles with a comparative analysis between human and other species is that only similar amino acid sequences are selected for analysis. To find the connection among the species and find out the unique, the common and the universal proteins, the entire genome of 40 species are compared with the human genome which is utilized as reference genome. More than 11 billion pairwise alignments are performed using blastp. Several findings are introduced in this study, for example, we found 330 unique proteins in human genome and have insignificant hits in all tested genomes, the number of universal proteins in human genome and conserved in all tested species is 82, and there are 180 proteins common in vertebrates genomes, but have insignificant hits in the other tested species. In contrary to the previous studies which use selected set of the genes and do not consider the whole genomes, this study proves that the similarity between human and chimpanzee is only 94.8. Keywords—Genome, Species, blastp, unique protein, universal protein.


Introduction
The previous two decades have seen a blast of the hereditary information. Countless DNA sequences and genotypes have been produced, and they have prompted noteworthy biomedical advances and provided new insights into biology [13]. In addition, this information has significantly expanded our comprehension of patterns of hereditary variety among individuals and populations [4] Interpreting of a given genomic sequence is one of the focal difficulties of science today. Maybe the most encouraging way to deal with this problem is based on the pairwise alignment and multiple sequences alignment methods. For example, protein-coding subsequences tend to be conserved between species. Subsequently, a straightforward strategy for recognizing a functional exon is to look for its homologue from related species using the whole genome alignmentical. Hence, enthusiasm for quicker, estimated, or heuristic (instead of ideal) alignment algorithms has increased [6] [8] [15]. Two of the most well-known heuristic alignment procedures are implemented in the FASTA and BLAST packages. Comparisons of full genome sequences empower scientists to make inquiries that were unthinkable with small subsequences. Large-scale comparisons can uncover the genetic basis of speciation and variation, increase our understanding of the biological processes in liv-ing cells, recognize shared biochemical functions, expand our knowledge in human diseases and offer important information about evolutionary histories of extinct and living kinds [3] [9]. The whole genome is used in several studies such as utilizing data from one genome to understand another, identifying potential orthologs, comparison of genome content genome alignment and genome signature analysis based on di-nucleotide abundance among others [1] [10] [11] [12].
Alignment of genomes implies identify differences that generated from mutational changes. In considering genome modifications, one differentiates between three important evolutionary operations: DNA mutations, genome rearrangements, and content alterations [2] [5]. DNA mutations impact on one or few nucleotides, while genome rearrangements work on bigger genomic subsequences and lead to change the orientation and the order of genes. Lastly, content alterations are an outcome of gene losses and duplications. Genome duplication has clearly permitted the development of more complex life forms; it equips an organism with a cornucopia of extra gene copies, which are allowed to change to fill unique needs. While one copy evolved for use in the brain, say, another evolved for use in the liver or adjusted for a novel reason. Therefore, the duplicated genes allow for increased sophistication and complexity [7]. In this study, we used 40 full genomes from 11 organisms to find the relationship between the species and discover the unique, the special, the common and the universal proteins. To trace the genes using top down approach, the human genome is used as reference genome.

Data Collection and preprocessing
To find the distinguished genes and quantify sequence similarities, the full genome of 40 species from 11 organisms are downloaded from KEGG site (Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/catalog). The species are selected to represent various branches of the phylogenetic tree of life and provide adequate coverage of main kinds within the evolutionary tree, including, seven bacteria, three protists, three fungi, three archaea, seven mammals, three birds, three fishes, five insects, a tick, a mollusk and four plants. Tables 1-6 summarize the name of the selected species, the number of proteins and the average length (number of the amino acid) of each one.

Genomes comparisons and mining
Before comparing the human genome with other genomes, the similar proteins in the human genome with hit <10 -10 is removed. Thus the total number of human proteins is reduced to 16614. To align two proteins, blastp is downloaded and called using Matlab as follows: system(['blastp -query human.fa -db sp1 -out results.outevalue .01 -num_align- where human.fa is a query that is formatted as fasta file which will be compared with the genome sp1. The results are saved as NCBI file for each pair has expectation value < 0.01, and then the results are interpreted and saved as a matrix: M=ParseNCBI('results.out'); four important values are extracted for each pair of the compared sequences, the values are the score, the expectation, the percentage of identities and the match: Where num is equal to 40 for universal genes, more than 38 for near-universal genes and less than 3 for special and unique genes. Algorithm 2 is used to find the common proteins in one organism but not in the other organisms

Unique and universal proteins
Five algorithms are implemented using Matlab and the package Blastp, where the human genome is used as reference genome, the implemented algorithms are to compare the proteins, interpret the results, find the common, the universal and maximum identical proteins. Human genome contains 16614 proteins, while the Chimpanzee genome contains 79947 proteins. Hence, to compare the both genomes, we have to implement 16614×79947 pairwise alignments, which took 25.6 hours using 2.3 GHz dualcore CPU. To mine all the selected genomes, more than 11 billion pairwise alignments are implemented and took about 36 days. Figure 1 shows the score of first 100 proteins of human genome after aligning it to Chimpanzee and wheat genome, which illustrates the relationship between the both species and human protiens. The following are some important findings: • 330 unique proteins are found in human genome and have insignificant hits in all tested genomes, such as protein ID 99032 and 107876.
• Number of significant proteins with p-value <10 -10 and conserved in all tested species is 82 (universal proteins) such as protein ID 25020. While Number of significant proteins with p-value <10 -50 and conserved in all tested species is 3, namely protein ID: 7833, 10309 and 25020.The corresponding proteins name according to NCBI site are: signal recognition particle, beta-enolase isoform 2, and tRNA ligase. These proteins seem to be the core biological functions in all living cells. Figure 2 shows the number of matched amino acid for each species when aligned to protein ID10309, around 98% from this protein is the same in all the mammals. The following is the amino acid sequence of the protein ID 10309 in FASTA format: There are 180 proteins common in vertebrates genomes, but have insignificant hits in the other tested species, such as protein ID 99123, 91265 and 36. Octopus is seem to be the closest species to the vertebrates among invertebrates species, where there are 19 proteins common with the Vertebrates. However, more studies should be accomplished and more genome should be included to decide what is the closest species to the vertebrates or to mammals. Figure 3 compares the score of a universal protein (ID 25020), a vertebrate protein (ID 36) and a Mammal protein (ID 540). The conserved proteins in the mammals are compared with other organisms, the following results are obtained with expectation value <10 -50 : • Three proteins are common in mammals and birds, and not exist in other tested species. • Four proteins are common in mammals and fishes.
• 127 proteins are common in all the tested species except plant genome.
• Two proteins are common in mammals, birds, fishes, insects and plants. Figure 4 shows the number of matched amino acid in first 50 proteins when aligned to chimpanzee, coelacanh, octopus and wheat.

4.2
The species similarity Figure 5 shows the number of accepted proteins in human genome when aligned to each species and each category (the proteins is accepted if expectation value of the alignment is less than 10 -50 ). The histogram suggests that the bactria genomes (ID: 1-7), protists genomes (ID: 8-10) and archaea genomes (ID: 14-16) have the lowest homology, and the mammals genomes (ID: 18-23) have the highest homology with human genome and contains the most conserved proteins.  Figure 6 can be used to sort the families of species according to its distance from human genome, where the closer families (sorted ascendingly) are the mammals, fishes, birds, mollusks and then the insects. The farther families are plants, fungs, protists, bactria, and the farthest is the archaea genomes. Thus, the appearance time of these species will be similar to this order.  Figure 7 shows the match percentage of the amino acid for each species according to all proteins in the genomes. In contrary to the previous studies which use selected set of the genes and do not consider the whole genomes [14], this study proves that the similarity between human and chimpanzee is only 94.8. If Figure 7 is compared with the previous two figures, we can conclude that the Octopus (ID 35) is closest species to the vertebrates among invertebrates species. The three figures have the same order of the species categories, but disagree whether the birds or the fishes are the closest to the mammals. Moreover, it is not clear whether the coelacanth fish (ID 29) is the closest species to the mammals (as it is given in Figure 5) or the chicken (ID 24) is the closest to the mammals (as it is given in Figure 7). Thus, we have two perspectives, the first based on the number of accepted proteins in the whole genome, and the second based on the similarity of the proteins content in the whole genome. To find a relative relation between all the tested species and build a phylogenetic tree using human genome as reference, a distance matrix is constructed as following: Where Distanceij is the distance between the species i and the species j. Therefore, its size is 40×40. The value Scorek is the highest score of human proteins when aligned to the species i. The length of the vector Score is 16614. Figure 8 shows phylogenetic tree based on the scoring of universal proteins (82 proteins). While figure 9 based on the scoring of all proteins. Neighbour-joining method is used in the both trees. All the tree branches are consistent with the previous figures except the birds and the fishes again. The first tree shows that the fishes and in particular coelacanth fish is closer to the mammals. While the second tree shows that the birds are closer to the mammals. This contradiction can be understood when we consider that the fishes seem to be closer to mammals from the common ancestry perspective, but the birds seem to be closer to the mammals from phenotype perspective.

Conclusion
The aim of whole genomes alignment is to utilize an ensemble of related genomes to better see every individual genome in the set and to discover the core biological 54 http://www.i-joe.org functions. Comparison of proteins encoded in fourty complete genomes from ten major phylogenetic lineages allowed to identify the unique and the universal proteins in the human genome. This study found 330 unique proteins in human genome, no species besides humans have these proteins. The uniquely human proteins drive uniquely human traits which play an essential role in human dexterity, brain function, reasoning, language, speech, sensory perception and other strong cognitive components. On the other hand, 82 universal proteins are found, a significant number of them have unknown function, but they are likely to play key roles in cellular processes. Hence, there is a need for more intensive studies for these proteins.