ORIGINS OF GENE, GENETIC CODE, PROTEIN AND LIFE
Department of Chemistry, Nara Women's University,
SummaryWe have investigated the origins of gene, genetic code, protein and life by using 6 indexes (hydropathy, α-helix, β-sheet and β-turn propensity, acidic amino acid content and basic amino acid content) necessary for appropriate three-dimensional structure formation of globular proteins. From analysis of 7 microbial genome data, it was found that the six indexes are almost independent of the change of GC content of a gene followed by the concomitant change of about half number of amino acid compositions of proteins. By using these properties, we firstly obtained a conclusion that newly-born genes might be produced from the nonstop frames on antisense strands of microbial GC-rich genes (GC-NSF(a)) on the present earth and from SNS repeating sequences ((SNS)n) similar to the GC-NSF(a) on the primitive earth (S and N refer to G or C and either of four bases, respectively). We have further proposed that the universal genetic code used by most organisms on the present earth could be derived from the SNS primitive genetic code. Next, by using four basic conditions for globular protein formation (hydropathy, α-helix, β-sheet and β-turn formations), we searched for a much simpler code than the SNS code that can still encode water-soluble globular proteins at a high probability. From the results, we obtained another conclusion on the primitive genetic code, stating that the universal genetic code was originated through the SNS code from the GNC code encoding four amino acids (Gly [G], Ala [A], Asp [D] and Val [V]) as the most primitive genetic code. Furthermore, we have proposed the [GADV]-protein world hypothesis on the origin of life, based on the GNC-primeval genetic code hypothesis, which is quite different from the RNA world hypothesis accepted by many researchers world-wide. We have also provided another hypothesis on protein production, suggesting that proteins were originally produced by random peptide formation of amino acids restricted in specific amino acid compositions termed as GNC-, SNS- and GC-NSF(a)-0th order structures of proteins. Thus, we have postulated the four hypotheses, which may reasonably explain the origins of gene, genetic code, protein and life in the fundamental system of life, mainly based on the comprehensive GNC-SNS primitive genetic code hypothesis. It is also expected that basic properties of extant genes and proteins could be made clear, based on the above four hypotheses.
(Key Words: Origin of Gene, Origin of Genetic Code, Origin of Protein, Origin of Life, GNC-SNS Primitive Genetic Code Hypothesis)
1. IntroductionSince the Institute for Genomic Research (TIGR) published the first complete microbial genomic sequence of Haemophilus influenzae in 1995, more than 40 microbial genomes have been sequenced and published. In recent several years, genomic sequences, which are necessary for creatures to live, of Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans and Human beings were also determined and published . In parallel with the determination of genomic sequences, information of primary sequences or amino acid sequences of proteins has been rapidly accumulated. Tertiary structures of many proteins and of even complex ribosomal particles have also been determined . Who would had imagined the present situation about 30 years ago, when I was a student in a university? In those days, even several base sequences of DNA could not be determined, since restriction enzymes were yet to be discovered. At that time, the research to determine tertiary structures of proteins, such as lysozyme with only 129 amino acids, had only started. In spite of the rapid progress of biological knowledge, it is not well known how genes and proteins were created, and how base compositions in the codon positions and average amino acid compositions in proteins have been determined. That is, it is little known how the fundamental system of life was created.
On the other hand, we started a research work on the origin of genes about 10 years ago based on the following considerations. (i) At which kind of a field, genes were produced on ancient earth. (ii) At what kind of a field on the present earth, new genes have created, if new genes could be produced in these days. Next, we carried out an analysis on the origin of the genetic code. Consequently, we have reached a novel GNC-SNS genetic code hypothesis. It anticipates a possible evolutionary pathway, suggesting that the universal genetic code may have originated from GNC code, which may have been followed by a SNS code, where S and N refer to G or C and either of four bases (A, G, T (U) and C), respectively [5-8]. We also suggest a route for the creation of proteins on the primitive earth and for the evolution of ancestral proteins to those used in organisms on the present earth (Ikehara et al., unpublished data). Based on our hypotheses on the origins of genes, genetic code and proteins, we also present a novel hypothesis, which could reasonably explain the origin of life [8, 9].
In this review, we will first discuss on problems on the origins of genes, the genetic code and proteins, along with a flow of genetic expression. Next, we will discuss on the origin of life, which is also involved in the fundamental life systems and is interrelated to the three origins. Then, our hypotheses will be compared with those of other researchers. The comparison will help the readers of this review to understand our novel hypotheses on the four origins. Our idea, which could provide a systematic understanding about the fundamental life system mainly based on the GNC-SNS primitive genetic code hypothesis, is quite different from ideas provided by other researchers, in which the four origins have been treated as rather independent problems. In the last section in this review, we will also provide some evidences supporting our hypotheses.
2. Origin of Genes2.1 Our hypothesis on the origin of genes
2.1a GC-NSF(a) hypothesis on creation of genes on the present earth: The GC contents of microbial genes are distributed in an extremely wide range from about 20 to 75% (Fig. 1). It is generally considered that the wide distribution of GC contents observed in microbial genes would be produced by a GC-mutation pressure and by an AT-mutation pressure, which acts on the genes in directions to increase and to decrease of the GC contents, respectively . As a result of this mutational pressure, about half the number of amino acids in any given protein is different from species to species (Fig. 2). But, it will be difficult to vary the indexes, which determine abilities of secondary and tertiary structure formation in proteins. Thus, basic properties of globular proteins, which are represented by six indexes (hydrophobicity/hydrophilicity, α-helix, β-sheet, β-turn formabilities, acidic and basic amino acid contents), should be invariable against changes of the GC content of a gene and amino acid composition of a protein. From analyses of the six indexes required to form tertiary structure, it was uncovered that all six indexes are actually almost constant to the change of GC contents of genes encoded by 7 genomes in bacteria and archea (Fig. 3). Results on the dependencies of acidic amino acid content and of basic amino acid content on GC content are omitted on account of limited space in the figure. This means that the six structural indexes of proteins can be used as necessary conditions, when we determine whether the polypeptide chains can be folded into water-soluble globular structures to exhibit enzymatic functions (Table 1). Thus, we first searched for a field by using the conditions, where new genes could be created on the present earth. For the purpose, we investigated on the six structural indexes of hypothetical proteins encoded by 5 possible reading frames of extant bacterial and archeal genes, which we obtained from the data-bank. From the results, it was found that the hypothetical polypeptide chains encoded by antisense sequences on genes with high GC contents (more than 50%) would be folded into similar globular structures to actual proteins at a high probability (Fig. 4). Moreover, the probability (pNSF), at which any stop codon does not appear in the frame, abruptly increased in a region higher than 60% GC content, caused by unusually biased base compositions at three codon positions (Fig. 5) . Hereafter, we call a nonstop frame on antisense strand of GC-rich gene as GC-NSF(a). Proteins encoded by the GC-NSF(a) have properties favoring adaptation to novel substrates, since the proteins should have some flexibility. That is because glycine contents and hydrophobicity indexes of the proteins are slightly larger and smaller than those of actual proteins, respectively . Therefore, we believe that the GC-NSF(a), which is frequently found on GC-rich microbial genomes must be the field, where new genes could be produced on the present earth.
2.1b (SNS)n hypothesis on the formation of genes on primitive earth: As described above, it is considered that the GC-NSF(a) may have been used as a field for creation of new genes [3, 4]. Base compositions of the hypothetical GC-NSF(a) genes as well as GC-rich genes are similar to S, N and S at the first, second and third codon positions, respectively (Fig. 6). Thus, base compositions of the GC-NSF(a) can be approximated as SNS or [(G/C)N(C/G)] at a limit to GC-rich side. This implies that, at the very least, bacteria and archea with GC-rich chromosomes using an approximated form of SNS-repeating sequences (Fig. 6), and that SNS-repeating sequences themselves could be utilized as functional genes encoding globular proteins at a high probability.
SNS compositions in the codon were generated by using random numbers by a computer, to confirm whether main chains of hypothetical proteins encoded by (SNS)n can be folded into similar structures to actual proteins. Then, we selected those hypothetical SNS compositions, which can satisfy the six structural conditions obtained by the extant proteins. From the results, it was found that the contents of G and C at the first codon position were optimal at around 55% and 45%. Moreover, formation of globular proteins was optimal when all four bases were present in an equal ratio at the second codon position (Fig. 7). Base compositions at the third position were not restricted in a narrow range, due to degeneracy of the genetic code at the position. But, this means that proteins encoded by hypothetical (SNS)n genes can satisfy the six conditions for protein structure formation, and that polypeptide chains composed of 10 SNS-encoding amino acids (L, P, H, Q, R, V, A, D, E, G) could be folded into globular structures similar to extant proteins at a high probability, when base compositions were given as 55%G and 45%C at the first, 25%A, 25%T, 25%G and 25%C at the second, and 55%C and 45%G at the third codon positions, respectively [5, 6]. Furthermore, we examined secondary structure and hydrophobicity profiles of the proteins encoded by the (SNS)n hypothetical genes, which are generated by a computer. The results gave an appropriately mixed profile of three secondary structures (α-helix, β-sheet and β-turn) as in the cases of actual proteins (Fig. 8). Both hydrophobic and hydrophilic regions were also appropriately mixed in hydrophobicity/hydrophilicity profiles as in extant proteins (data not shown). This indicates that SNS repeating sequences, (SNS)n, may have used as the origin of genes on primitive earth.
2.2 Theories on the origin of genes provided by other workers
2.2a Gene duplication theory: Consider that a gene in an organism was duplicated by any reason. The organism can maintain life activities by using one genetic function as similarly as before the gene duplication, even if mutations were accumulated on the other gene. This means that it is possible to accumulate forbidden mutations on one of the duplicated genes. Based on the consideration, the gene duplication theory was provided by S. Ohno, which suggests that novel genes with new functions can be created by accumulation of mutations on one of the duplicated genes (Fig. 9) . It is well known that there exist a large number of homologous protein families, in which proteins with different amino acid sequences have similar catalytic functions, and proteins with similar amino acid sequences exhibit different catalytic functions. When amino acid sequences belonging to the same protein families are compared with each other, a significant ratio of amino acids on a sequence, usually more than 30% coincide with the other. That is, there are a large number of genes, which can be considered as derivatives produced from the same ancestor gene by gene duplication. Thus, nowadays, the gene duplication theory is regarded as a correct theory.
2.2b Exon-shuffling theory: The exon-shaffling theory, which is quite different from the gene duplication theory described above, also describes a process for the formation of a new gene . The theory attaches importance to existence of introns on eukaryotic genes. It insists that original genes would be exons encoding small polypeptides composed of only 15 ~ 20 amino acids, and that various kinds of genes could be created by mixing exons through introns (Fig. 10). Many researchers have discussed on whether the exon-shuffling theory is correct or not.
Weak Points: The above theories discuss only the processes by which new genes form from previously existing genes. Therefore, the theories can never explain the processes, how original genes would be created.
3. Origin of the Genetic Code3.1 Our hypothesis on the origin of the genetic code
3.1a SNS primitive genetic code hypothesis: As described above, we have proposed (SNS)n sequences as a field for creation of genes, when life was appeared on the primitive earth. Since genes must code for only a restricted number of amino acids (10 kinds of amino acids) at that time, the primitive genetic code must have encoded the same number of amino acids. Based on such an intimate relationship between genes and the genetic code, we have provided the SNS-primitive genetic code hypothesis for the origin of the genetic code (Fig. 11) [5-7].
3.1b GNC primeval genetic code hypothesis: The SNS primitive genetic code is considerably complex, since the code is composed of 10 amino acids encoded by 16 codons (Fig. 11). Thus, it must be very difficult to create the SNS code in one stroke at the beginning on the primitive earth. To solve this problem, we searched for a simpler code than the SNS code, which can still encode water-soluble globular proteins with appropriate three-dimensional structures at a high probability. For the purpose, four conditions (hydropathy, α-helix, β-sheet and β-turn formations) out of the six conditions for globular protein formation were used to determine which codes could encode primitive water-soluble globular proteins. Then, we excluded both the acidic amino acid content and basic amino acid content from the conditions, because it would be difficult to contain both acidic amino acid(s) and basic amino acid(s) in proteins containing less than 10 amino acids. Moreover, in the case of proteins, containing acidic amino acids but not basic amino acids, cations with positive charge, such as metal ions, could compensate for the negative charges on the proteins. In the cases of proteins, which contain basic amino acids but not acidic amino acids, anions with negative charge, such as halogens, could neutralize the positive charges on the proteins. From the search for a simpler genetic code, it was found that four amino acids (Gly: [G], Ala: [A], Asp: [D], Val: [V]) encoded by G-start codons, GNC, and its modified form, GNG code, well satisfied the four structural conditions (Figs. 12 and 13) . This means that 4 amino acids encoded by GNC have abilities necessary to form secondary and tertiary structures similar to presently existing proteins, and that primitive water-soluble globular proteins could be produced from [GADV]-proteins at a high probability. We have also confirmed that every combination of 4 amino acids encoded by codons standing in rows (CNG, CNC, ANG, ANC, UNG, UNC) and in columns (NUC, NUG, NCC, NCG, NAC, NAG, NGC, NGG) in the universal genetic code table, can not satisfy the four conditions with the exception of the GNC and GNG codes(Fig. 13).
Three out of the 4 amino acids have excellent abilities required for the formation of respective secondary structures ([A]: α-helix, [V]: β-sheet, [G]: β-turn or coil). Moreover, [D] has a functional group (carboxylic group), which is indispensable to construct a catalytic center on primeval proteins (Table 2). Both hydrophobic amino acid ([V]) and hydrophilic amino acid ([D]) are also fortunately included in the four amino acids encoded by the GNC code (Table 2), which are necessary for folding of polypeptide chains into stable globular structures in water. Furthermore, we confirmed that the [GADV]-amino acids is the simplest set, out of which are required for the formation of appropriate secondary and tertiary structures (Fig. 14) . It is also known from Miller's electric discharge experiments that the four amino acids ([G, A, D and V]) could be easily synthesized on primitive earth . We further investigated on whether there are three amino acid systems, which can satisfy the four structural conditions of globular protein formation. Although three sets of three amino acid systems ([D], Leu and Tyr; [D], Tyr and Met; Glu, Pro and Ile) satisfied the above four conditions, amino acids contained in the sets are scattered in the universal genetic code table. In addition, it can be seen that structure of at least one amino acid in the three systems is far more complex than those encoded by the GNC code. These results also suggest that the GNC primitive genetic code was used before the SNS primitive genetic code, and that a genetic code simpler than the GNC code had never existed. Thus, the universal genetic code was originated not from a three-amino acid system but from a four-amino acid system, the GNC code encoding [GADV]-proteins. From these considerations, it is concluded that the 4 amino acids encoded by the GNC code were used to create globular proteins on primitive earth, and, thus, that the GNC code was the earliest genetic code to appear on the earth.
3.2 Hypotheses on the origin of the genetic code provided by other researchers3.2a Mitochondrial-type primitive genetic code hypothesis: The mitochondrial-type primitive genetic code hypothesis composed of 20 amino acids and 62 codons has also been postulated, chiefly based on the simplicity of the code compared to the universal genetic code and on utilization of a minimal set of tRNAs required for translation of genetic information in mitochondria .
Weak Points: If the mitochondrial-type genetic code was used as the most primitive genetic code, about 60 codons and 20 amino acids would have been necessary to establish the primitive genetic code in the beginning of the evolutionary process. Such a code would have been too complex to create in one stroke (Fig. 15).
3.2b WWW primitive genetic code hypothesis: The WWW primitive genetic code hypothesis has been presented (where W means A or U), from considerations of metabolic pathways for synthesis of nucleotides and of RNA-world hypothesis on the origin of life (we will more precisely discuss on it at a later section in this review). Nucleotides, A and U, could be produced at an earlier time than other nucleotides, G and C, on primitive earth. It can be reasonably deduced from the facts that the structures of A and U are simpler than the other nucleotides, G and C, and that the former nucleotides are synthesized at earlier stages of metabolic pathways than the latter nucleotides. Thus, the WWW hypothesis insists that the original genetic code must be triplets composed of only A and U, or WWW (Fig. 16) [15, 16].
Weak Points: According to the hypothesis, the most primitive amino acids, which were used for protein synthesis, should be the following 6 or 7 amino acids, Phe, Leu, Tyr, Ile, Asn and Lys, or plus Met (Fig. 16). But, structures of these amino acids are far more complex than [GADV]-amino acids and the WWW code composed of 6 or 7 amino acids cannot code for any acidic amino acids. Moreover, we know that the WWW code cannot satisfy the four structural conditions, because too many hydrophobic amino acids (Phe, Leu, Ile (and Met)) are contained, whereas only one weak β-turn formation amino acid (Asn) is included in the WWW code. Therefore, there exists a big defect that polypeptide chains synthesized according to the WWW code could not form water-soluble globular proteins.
4. Origin of ProteinsNext, we will consider how proteins were created. It is well known that an amino acid sequence is synthesized according to a genetic information or a gene, which is given as repeats of triplets (three nucleotide sequences). Tertiary structures of proteins are formed based on the amino acid sequences specified by the genetic code. Therefore, the origins of genes and the genetic code should be intimately related to the origin of proteins. From these considerations, our hypotheses on the origins of genes and the genetic code can be directly applied to the problem of the origin of proteins or mechanisms for producing proteins. Here, at first, we introduce the hypotheses provided by other researchers on the origin of proteins and their weak points, in order to make it easy to understand our novel hypothesis on the origin of proteins.
4.1 Hypotheses on creation of proteins by other researchers
4.1a Amino acid sequence hypothesis: The sequence hypothesis on the origin of proteins or a mechanism for producing proteins is based on the fact that a unique sequence must be synthesized to produce a specified protein with a globular structure. In other words, synthesis of an appropriate amino acid sequence is necessary to produce an active protein. According to the idea, a unique sequence exhibiting a specified function must be selected out from all possible amino acid sequences produced when amino acids were one-dimensionally and randomly located. (Fig. 17) .
Weak Points: However, even in small proteins with 100 amino acid residues, there are enormously large sequence diversity as 20100 = ~10130, since natural proteins are composed of 20 kinds of amino acids . The number is much larger than the number of all atoms in the universe, which is estimated as 9 x 1078 . There is a big defect in the sequence hypothesis that it is impossible to select out a unique sequence from a sequence space with the enormously large diversity, by examining all sequences, one by one, whether one amino acid sequence is useful for a required protein. Therefore, another hypothesis has been provided to avoid the difficulty, how one specific amino acid sequence was selected out as its own sequence of an active protein, as described below.
4.1b Protein structure hypothesis: This hypothesis gives an importance to tertiary structure of a protein more than a primary structure, or an amino acid sequence itself, for the formation of an active protein. It is considered in the hypothesis, that a protein should exhibit the same enzymatic activity as one specific protein when it forms a similar or homologous tertiary structure to the protein. According to the hypothesis, it can be considered that a ratio of hydrophobic to hydrophilic amino acids and the number of interactions in a protein are important to fold the main strand into a similar water-soluble globular protein to the protein with a specified amino acid sequence. A calculation was carried out as an example of ribonuclease based on a lattice model of proteins and gave a value of 10120 to the sequences which have similar polypeptide conformations as ribonuclease (Fig. 18) . Therefore, it is considered that an active enzyme accidentally uses one amino acid sequence in a sequence space with the large diversity. The structure hypothesis on the origin of proteins seems to be a correct idea, because a large number of homologous proteins with the same catalytic activity as or with similar tertiary structure to one unique enzyme can be found out in presently existing proteins.
Weak Points: However, the presence of peptide bonds in actual proteins would be also important, since free rotation is restricted around the planar peptide bond, which contributes to the formation of secondary structures. In addition to hydrophobic and hydrophilic interactions, secondary structure propensity, such as α-helix, β-sheet and β-turn, should be actually important to form tertiary structures from primary structures of proteins. Therefore, it appears that the structure hypothesis on the folding problem of proteins is oversimplified, because only hydrophobic and hydrophilic interactions are considered as the dominant folding force, and the lattice model ignores the presence of peptide bonds in proteins. In addition, amino acid sequences of proteins with the same catalytic activity are not always necessary to be similar each other, according to the hypothesis. However, this is inconsistent with the fact that about 30 to 40% of amino acid sequences are usually conserved among homologous proteins with the same activity and with similar tertiary structures. The existence of conserved regions among homologous proteins clearly indicates that the proteins were produced from one common ancestor protein, but not selected out independently from sequence space with large diversity, as expected from the structure hypothesis. From such considerations, we would like to insist that it is impossible to explain the origin of proteins from a standpoint of the structure hypothesis.
4.2 Our hypothesis on the origin of proteins
Then, how should we consider about a field and an evolutionary pathway for production of proteins? As a matter of course, the process by which proteins were produced should be intimately related to the origins of genes and the genetic code. Therefore, we consider that proteins were not created as suggested by amino acid sequence hypothesis and by the protein structure hypothesis, but originated from the proteins, which were created in a field producing globular proteins at an extremely high probability (Fig. 19 (A)). The high probability is assured by the use of a small number of amino acids and amino acid compositions, which are restricted by the GNC and SNS primitive genetic codes. For example, the diversity of proteins with 100 amino acid residues is calculated as 4100=about 1060 in the case of the GNC code, and as about 10100, in the case of the SNS code. The diversities are extremely smaller than that (about 10130) of extant proteins composed of 20 kinds of amino acids. The former is corresponds to about 10-70 and the latter is about 10-30 that of the contemporary genetic code. Thus, we refer to the novel hypotheses on the origin of proteins, the GNC 0th-order structure hypothesis and the SNS 0th-order structure hypothesis, because the GNC- and the SNS-primitive genetic codes should be responsible for the production of primitive proteins in the respective ages. Since it may not be easy for the reader to understand the meaning of the 0th-order structure of proteins, I will explain the meaning more concretely by using the Figure 19 (B) as follows.
The principle is widely accepted that a protein conformation is specified only by its amino acid sequence, which is determined by one-dimensional genetic information. Therefore, it has been generally considered that amino acid sequences or primary structures must be start points to produce globular proteins. But, during the process on consideration of the origins of genes and the genetic code, it appeared that specified amino acid compositions are far more important in creating new proteins than the primary structures. According to our idea, water-soluble globular proteins could be produced at a high probability, even by random joining of amino acids determined by specified amino acid compositions. To explain such a novel idea straightforwardly, I would like to use the word, 0th-order structure, since an amino acid composition is a concept at a lower level than the primary structure or an amino acid sequence. We consider the compositions consisting of 4 kinds of amino acids encoded by the GNC code and of 10 kinds amino acids encoded by the SNS code, as the fundamental 0th-order structures. And, we refer to these ideas as the GNC 0th-order hypothesis and the SNS 0th-order hypothesis. In other words, the GNC 0th-order structure and the SNS 0th-order structure were utilized at an early stage, and at later stages to create new proteins efficiently on primitive earth, respectively. Further, we consider that the amino acid compositions specified by GC-NSF(a) (GC-NSF(a) 0th-order structure or a modified form of the SNS 0th-order structure) must be used to produce new proteins on the present earth, if necessary (Fig. 19).
As a matter of course, genes, the genetic code and proteins are the most fundamental and important to biological functions of life. As described above, the GNC code should be the genetic code used in the most primitive life on the earth (Fig. 11). In addition, the simplest set of 4 kinds amino acids, which can produce functional globular proteins at a high probability, is [GADV]-amino acids encoded by the GNC code (Figs. 13, 14). Taking these into consideration, we have also provided a novel hypothesis on the origin of life [8, 9].
5. Origin of Life5.1 Our hypothesis on the origin of life
5.1a [GADV]-protein world hypothesis: According to the GNC primitive genetic code hypothesis, the most primitive proteins should be composed of only four [GADV]-amino acids. In spite of the simple amino acid compositions, several important properties exist in the [GADV]-proteins, which are required to exhibit enzymatic functions, as described below.
(i) The [GADV]-proteins can form water-soluble globular structures at a high probability, judging from the similarities in indexes of hydropathy and secondary structure formation of hypothetical [GADV]-proteins and presently existing proteins (Fig. 12).
(ii) [GADV]-proteins contain not only amino acids necessary to form secondary structures ([A]: α-helix, [V]: β-sheet, [G]: β-turn), but also an amino acid, [D], with a carboxyl group to act as a catalyst (Table 2).
(iii) Guanine is usually contained at the highest level at the first codon position of extant genes (Figs. 20 (A) and (B1)). This suggests that the [GADV]-amino acids encoded by GNC are the most fundamental and important amino acids out of 20 kinds of natural amino acids.
Therefore, it can be deduced that even [GADV]-proteins, which have such simple amino acid compositions, could catalyze the formation of peptide bonds between [GADV]-amino acids at a high probability. If the deduction were correct, [GADV]-proteins could synthesize similar [GADV]-proteins at a high probability even in the absence of genes. That is because only 4 kinds of amino acids are used in the [GADV]-proteins. In addition, properties of [GADV]-proteins synthesized should be similar to each other, since [V] with a hydrophobic side chain and [D] with a carboxylic group should locate at an inner and a surface parts of globular proteins in water, respectively. (Of course, the diversity of [GADV]-proteins with 100 residues is as high as about 1060. But, it is supposed that every 4 amino acid sequences would be detected at a probability of one time in proteins with only 44=256 residues. Therefore, it can be easily imagined that middle-sized [GADV]-proteins composed of 256 amino acids should be similar to each other.) This means that [GADV]-proteins could be pseudo-replicated to produce similar [GADV]-proteins in the absence of genes. By taking these points into consideration, we reached to a [GADV]-protein world hypothesis on the origin of genes (Fig. 21) (Ikehara 2000). It has been widely believed that proteins can not be used as the first materials for creation of life, because proteins cannot be self-replicated, while amino acids in proteins should be more easily synthesized than nucleotides in RNA and DNA on the pre-biotic earth. However, our novel hypothesis based on the characteristics of [GADV]-proteins would overcome the weak point of the protein world on the origin of life by the pseudo-replication of [GADV]-proteins in the absence of genes. Contrary to that, we found that there are several big defects in "RNA world hypothesis", which it would be impossible to solve, although the hypothesis has been supported by many persons at present. The weak points of the RNA world hypothesis are described in the following section.
5.2 Hypothesis on the origin of life provided by other researchers
5.2a RNA world hypothesis: It is generally considered that replication of genetic materials is the most fundamental and important to life, in which genetic information on DNA is propagated and maintained in lives on the present earth by proteineous replication enzymes. As well known, DNA does not have catalytic functions, whereas proteins cannot be used as genetic materials. Thus, DNA carrying genetic information cannot be replicated without proteins and proteins cannot be reproduced without genes. Many persons have thought that the difficult problem on the origin of life might be solved, since the discovery, indicating that catalytic activities also exist on RNA having nucleotide sequences similar to DNA [18, 19]. That is, based on the facts that RNA has not only genetic functions but also catalytic functions, RNA world hypothesis has been provided, suggesting that RNA had been amplified by self-replication and increased their diversity in the RNA world on primitive earth (Fig. 22) [20, 21].
Weak Points: Many defects exist in the RNA world hypothesis, as described below. (i) Nucleotides in RNA are organic compounds far more complex than amino acids in [GADV]-proteins. Thus, it would be apparently difficult to synthesize four nucleotides, A, U, G and C, than four [GADV]-amino acids under pre-biotic conditions, which is apparent from comparison the numbers of atoms and isomers of the nucleotides compared with those of the amino acids. (ii)Nevertheless, even if we assume that nucleotides could be synthesized under pre-biotic conditions, judging from the number of hydroxyl groups in the nucleotides, it would be more difficult to synthesize RNA by joining nucleotides in the absence of effective catalysts than the peptide formation between amino acids (iii) Furthermore, assuming that RNA could be synthesized under the pre-biotic conditions, it would be difficult to self-replicate RNA, because RNA without any stable tertiary structures is required to exhibit a genetic function on nucleotide sequence as a template. Simultaneously, RNA must hold a stable tertiary structure to exhibit catalytic function on RNA. Therefore, it would be impossible to self-replicate RNA in a usual sense due to the above self-contradiction. In fact, experimental results, indicating that RNA molecules were actually self-replicated, have not been reported until now, in spite of many research works on RNA self-replication . (iv) In addition to the above difficulties, there is another difficulty in the RNA world hypothesis. There should be no relationship between the ability for self-replication of RNA and the genetic function on RNA sequence for synthesis of a protein. Therefore, even if RNA was actually self-replicated on the primitive earth, it is difficult to imagine that self-replicated RNA could simultaneously acquire any genetic information for protein synthesis. Thus, we consider that the RNA world was never realized on primitive earth [8, 9].
Although many researchers have discussed on the origins of genes, the genetic code, proteins and life, they have treated with the four fundamental problems rather independently, hitherto. Moreover, we have now recognized that there are big defects in their hypotheses on the origins of fundamental life systems, as described above. Contrary to that, it could be possible to explain comprehensively the four origins from a standpoint of the GNC-SNS primitive genetic code hypothesis (Fig. 23). Thus, I firmly believe that our hypotheses on the origins of genes and genetic code, proteins and life are far more reasonable than the hypotheses provided by other researchers.
Are our scenario with four stages really correct? To confirm this, we investigated further, whether it is possible to explain some properties of presently existing genes and extant proteins according to our hypotheses or not. The results will be described in the following section.
6. Evidences for our hypotheses on the fundamental life systems6.1 On the Field for gene creation and the evolutionary directions of original genes and proteins
If our hypotheses on the origin of genes (the GC-NSF(a) hypothesis and the (SNS)n hypothesis) is really correct, genes would have been created originally as GC-rich genes, and genes homologous with the ancestors would be produced during an evolutionary process, as GC contents of the genes gradually decreased under AT mutation pressure (Fig. 24). Thus, according to our hypotheses, proteins could be originally created by expression of the ancestor genes with high GC contents, and the original proteins should evolved to homologous proteins encoded by the genes with low GC contents (or AT-rich genes). If it is true, characteristics of the ancestor proteins should be maintained in conserved regions among homologous proteins. Therefore, evolutionary directions of genes and proteins could be deduced by investigating the properties, or conserved amino acids and non-conserved amino acids, which are observed among homologous proteins after alignments of the proteins (Fig. 25). To confirm this, the amino acid sequence of P. aeruginosa gyrase A (GyrA), which is encoded by a GC-rich gene, was compared with other homologous GyrA proteins, which are encoded by genes with lower GC contents than the P. aeruginosa gyrA gene. From the results, it was found that the contents of SNS-encoding amino acids (SNS-AAs) in conserved regions were similarly as high as to SNS-amino acid contents in non-conserved regions of P. aeruginosa GyrA protein (Fig. 26 (A)). It was also confirmed that SNS-AA contents decreased gradually in non-conserved regions of other GyrA proteins encoded by genes with lower GC contents, as GC content of the genes decreased (Fig. 26 (A)).
The SNS-AA contents in conserved and in non-conserved regions were also investigated by using SpoT/RelA proteins (Fig. 26 (B)), glutamine synthetase (GlnA), a-subunit of RNA polymerase (RpoA) and other 10 kinds of homologous proteins. Similar results to the case of GyrA were obtained in all cases examined (data not shown). The results described above clearly indicate that water-soluble globular proteins examined, such as GyrA and SpoT/RelA, were generally created as ancestor proteins encoded by GC-rich genes and evolved uni-directionally to the proteins encoded by genes with lower GC contents, as was expected.
6.2 Process of protein formation Assuming that proteins were originally produced in the field expected by our hypothesis on the origin of proteins, new proteins should be created by random joining of amino acids restricted in SNS 0th-order structure and in GC-NSF(a) 0th-order structure. The SNS and GC-NSF(a) 0th-order structures are the amino acid compositions determined by SNS-repeating sequences ((SNS)n) and nonstop frames on antisense strand of GC-rich genes (GC-NSF(a)), respectively (Fig. 19). The GNC 0th-order structure is omitted only to simplify the discussion in this section. If new proteins were actually produced by random joining of the amino acids, frequencies of two neighboring amino acids, which are observed in presently existing proteins, should be coincident with those obtained by multiplication of the two amino acid compositions. The results of obtained using H. influenzae genome data clearly indicate that 400 spots, representing all combinations of two neighboring amino acids, were closely distributed around the linear line with slope 1 (Fig. 27). Genome data of H. pyroli, E. coli, M. genitarium and B. subtilis gave similar results to the case of H. influenzae. These results indicate that proteins would be fundamentally created by random joining of amino acids in the amino acid compositions, which are restricted by the SNS and GC-NSF(a) 0th-order structures.
6.3 Simulation of origins and evolutions of genes and proteins with a computer
As was explained above, it seems to us that our hypotheses on the origins of genes, the genetic code, proteins and life are really correct. If that is true, it is considered that genes were produced as GC-rich genes and decreased their GC contents as accumulating mutations on the nucleotide sequences. Then, every protein encoded by genes, which were produced by accumulation of mutations on the sequences, should satisfy the six conditions (hydrophobicity/hydrophilicity, α-helix, β-sheet and β-turn formabilities, acidic amino acid and basic amino acid contents), which are required to form appropriate tertiary structures of proteins. If our speculation is correct, changes of base compositions in the codon positions of genes, as well as of amino acid compositions of proteins, could be simulated according to our hypotheses, expecting the field of gene production and the evolutionary processes of genes and proteins.
We actually carried out the simulation of genes by using a GC-NSF(a) or an antisense sequence of a M.tuberculosis GC-rich gene, which is composed of about 1,500 bases (about 500 codons). Before the simulation, we had confirmed that the protein encoded by the hypothetical ancestor gene satisfy the six conditions required to form water-soluble globular proteins. Next, mutations were introduced onto the nucleotide sequence at a rate of 1% at every base position. Thus, the mutation probability corresponds to 15 base substitutions on the gene or to about 5 amino acid replacements in the protein. Then, if a simulated protein did not satisfy one or more of the six conditions, or if stop codon(s) appeared in the frame of translation, it was removed from the simulation as an inactive gene or a mutated protein. The procedure for introduction of mutations was repeated from one step before. These steps were repeated until 1,000 hypothetical proteins satisfying the six conditions for globular protein formation under a mutation pressure were obtained. The resulting simulations, which were carried out under 9 different mutation pressures, are given in Fig. 28. A linear line, which was obtained by the least square approximation of G compositions at the first codon position (G1) of 7 extant microbial genomes, is shown in Fig. 28 to compare it with the simulated results. Twelve base composition dependencies on GC content of a gene were obtained from the simulation, because 4 kinds of bases (A, T, G and C) exist at every three codon positions. Although the simulated base changes reproduced well those of actual genes in 7 microbial genomes upon the change of the GC content, only G1 was considerably deviated from the linear line (Fig. 28). To understand the causes of the deviation, simulation was similarly carried out again as 200 amino acid residues from N-terminal end of the ancestor protein composed of about 500 amino acids were remained unchanged. The size of the conserved region was taken in the simulation, after consideration of the facts that about 40% amino acids are generally conserved among homologous proteins. From the results, it was found that the simulated G1 also expectedly reproduced the change of G1 in presently existing genes (Fig. 29). In addition, amino acid compositions of the simulated proteins were also fairly coincident with those of extant proteins encoded by microbial genomes except of arginine, lysine and some other amino acids (Fig. 30).
To confirm that the distributions of amino acid compositions have not been determined simply by characteristics of 20 natural amino acids, amino acid compositions were simply generated with 20 random numbers by a computer, independently on an ancestor gene and its evolutionary pathway. Next, when proteins with an amino acid composition specified by the 20 random numbers satisfied the six conditions for globular protein formation, the proteins were selected out as functional proteins. Average amino acid compositions of the selected hypothetical proteins are shown as a bar chart in Fig. 31. From the results, it was found that all amino acids were contained at similar ratios in the selected proteins. It is in a marked contrast to the facts that the amino acids, such as Leu, Ala, Gly, Val, Ser and Ile, are usually detected in extant proteins at high frequencies, but the amino acids, such as His, Met, Trp and Cys, were usually observed at amounts less than other amino acids. Therefore, the results described above indicate that the average amino acid compositions in presently existing proteins have been determined by the facts that genes had been originally created as GC-rich genes and have evolved to the direction of lowering the GC content, not by the characteristics of natural amino acids (Figs. 29 and 30). Simultaneously, this shows that proteins have been produced by translation of genetic information of genes appeared during the evolutionary process. These results also support our hypotheses on the origins of genes, genetic code and proteins and their evolutionary processes.
7. ConclusionAs discussed in the above sections, we have provided four novel hypotheses on the origin of genes, the genetic code, proteins and life, by using chiefly the six conditions for the formation of globular proteins. As a matter of course, we are confident in the four origins on the fundamental life system, which can be reasonably and comprehensively explained by our hypotheses. Contrary to that, it is difficult to understand the origins by using the hypotheses provided by other researchers. The main reasons, why the four problems could not be well interpreted until now, in spite of the many efforts by other researchers, would be attributed to the facts that they have discussed on the problems rather independently. Therefore, although the main current of the origin of life would be "RNA-world hypothesis" at present, we also believe firmly that our [GADV]-protein world hypothesis is far more reasonable to explain the origin of life than the RNA-world hypothesis. In addition, we hope that it makes possible to interpret various properties of extant genes and proteins, and even evolutionary pathways of the present metabolism, by reconsideration of the fundamental life systems on genes, the genetic code and proteins from the standpoint of our hypotheses. In facts, studies are in progress in our laboratory to solve these problems on fundamental systems of the present life.