“Without a doubt, this is the most important, most wondrous map ever produced by human kind.” […] “Today we are learning the language in which God created life.”
Bill Clinton, 2000 [1]
That’s how vividly Bill Clinton expressed it back in 2000, when the first full sequencing of a human genome was announced. But does the matter do justice to these weighty attributes? What was it really, that was made known back then? To do this, it must first be clear what a genome truly is.
The Structure of the Genome
What is the genome actually? Well, the genome describes the entirety of the genetic information that sits (largely) in our cell nuclei. In fact, in all of our cell nuclei, since all cells have the entire genetic information at their disposal, even if – depending on their task in the body – they only read certain parts. Our genetic makeup is stored as a specific expression of a special molecule: DNA. But the nuclear genome consists of more than just DNA: proteins contribute a considerable proportion of the weight to the structures that are referred to as chromatin or chromosomes.
The whole thing has to be imagined as a multiply twisted ribbon: on the smallest level, the DNA is present as a famous double helix. Two strands, two opposing DNA molecules, form a twisted ladder in which two bases are opposite each other: Adenine (A) and Thymine (T) can form such rungs by forming two hydrogen bonds, Guanine (G) and Cytosine (C) can also form such a rung by means of three hydrogen bonds (a picture and a little more explanation can be found here).
Since only these pairings are possible, if you want to clearly specify a sequence on the DNA, you only have to specify the base sequence of one strand, e.g. ACGTGCTGACTTTG. Together with the corresponding opposite strand, the two strands form the famous double helix. This double helix is wrapped around very specific protein complexes, the histones. This DNA histone thread is in turn twisted several times in itself in order to ultimately form the well-known chromosomes, which were sketched as thread-like structures in the cell nucleus as early as the 1870s after observation with the help of simple light microscopes.
Although it could be shown as early as 1910 that these strange, thread-like structures, the chromosomes, had to be the carriers of genetic information, it took a long time until the number of human chromosomes, for example, was correctly stated. After it was believed for years that humans have 48 chromosomes, a Spanish-Swedish research duo was able to show in 1956 that humans probably only have 46 chromosomes by significantly improving the technique for creating a so-called karyogram. Karyograms of this type (in Figure 2 is an original image from the publication from 1956) are still made today, for example to detect a change in the number or shape of chromosomes.
The Combination of Two Chromosomes
So humans have 46 chromosomes in all of their body cells, two of which are homologous, that is, they carry the same genes (attention: not necessarily in the same form). We received one of the two homologous chromosomes from our mother and the other from our father. The expression of a certain gene results either from mixing the maternal and paternal variants or from the fact that one of the two variants is dominant and the other recessive; that is, the effect of one masks that of the other. But not all people have exactly 23 pairs of chromosomes in all cells (which, by the way, are conveniently simply numbered according to size). As early as 1959 it was discovered that, for example, Down’s syndrome, described in 1862, is caused by a changed total number of chromosomes, namely by three times (instead of just double) the occurrence of chromosome 21. Today we know a lot of structural chromosome aberrations that cause certain clinically and neurologically relevant syndromes.
But there are anomalies that can lead to health problems not only on a large scale, but also on a very small scale. Even the absence or replacement of a single base pair can mess up the sequence of a gene in such a way that no functional protein can be built from it. The mutation ΔF508 in a gene that called Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) is only missing three base pairs (the protein produced from it is therefore missing exactly one amino acid). However, this small change causes the protein to be broken down by the cell’s own clean-up system instead of doing its job in the intended place. Unfortunately, the ΔF508 mutation is very common, around one in 30 Caucasians carries it. If two of these mutations meet, the affected child develops cystic fibrosis. Due to the lack of CFTR protein, an ion channel that is involved in regulating the salt concentration and therefore also the fluid balance in the mucous membranes, these patients suffer from viscous mucus on all mucous membranes, which makes them extremely susceptible to infections.
Numbers and Facts (are not Everything)
Besides the CFTR gene, there are a lot of other genes. According to a current version of the human genome, exactly 20,363 (ensemble human genome ver38). This number includes all the so-called coding DNA sequences that are first transcribed and translated within the cells into RNA, then into protein. This number caused great astonishment. When biologists first thought about a specific number of human genes in the 1960s, they estimated it to be around 2 million. When the total sequencing of the human genome was tackled around 1990, the most popular estimates were only 100,000 and at the time the human genome was decoded, i.e. at the time from which the opening quotation comes, they would land at around 25,000 to 30,000. Since then, it has only been revised downwards to an insignificant extent, but it is still steadily increasing. So today we assume a little more than 20,000 genes, so humans have about as many genes as the mouse or the microscopic nematode Caenorhabditis elegans, which consists of only about 1000 cells.
So we had to learn that higher complexity (we like to think of ourselves as a more complex form of life than a roundworm) doesn’t necessarily require a higher number of genes. But what is it then? Well, it must be remembered that protein coding sequences make up less than 2% of the total genome and the rest, over 98% of the genome, has long received little attention. In fact, it has often been referred to as “junk DNA”. Today we know that a very high proportion (perhaps up to 50%) of the genome is very well transcribed into RNA, even if this RNA is not read by a ribosome and thus never acts as a blueprint for a protein. At the moment we are still in the early stages of research into this so-called non-coding RNA. However, it is already becoming apparent that RNA, which we had reduced for so long to its function as a carrier of information between the DNA of the cell nucleus and the protein production machinery in the exterior of the cell, can have an unbelievable variety of biological functions. Research on this non-coding RNA is booming; legitimately.
Same Genome – Different Function
In addition, we had known about the regulatory function of many genome segments for decades. If you keep in mind that every cell in the human body has the entire genetic code in its nucleus, but that different body cells perform so enormously different tasks, then you have to ask yourself how this diversity is achieved. Because a beta cell in the pancreas, for example, mainly produces insulin; that is its job, its contribution to the maintenance of the organism. The fiber cells in the lens of the eye, in turn, are characterized by their unique transparency, a property that they owe to the biochemically highly developed crystallines. Both insulin and crystallins are proteins; they are made according to a blueprint that is stored in the genome in the form of a specific DNA sequence. So how does a cell “know” which of these sequences (genes) it should read in order to produce the corresponding protein and which not? The answer lies in a regulatory network consisting of signaling pathways, transcription factors and the regulatory genome segments.
Signals are constantly beating down on a cell inside (or outside) an organism. These can be atoms and molecules of various classes. For example, hormones and neurotransmitters as well as nutrients and environmental toxins are very well known. In addition, there are many more signaling substances that either act on the cells of an organism or are used by the organism itself to orchestrate the various functions of the cells of different organ systems. These molecules are perceived by innumerable different very specific receptors on the cell surface, whereupon the structure of these receptors is changed. This in turn triggers a chain reaction within the cell in which various cell-specific substances react by forming or breaking bonds with one another. Often these signaling pathways then lead to DNA-modifying proteins acting on the genome to chemically change it. These changes in the chemical structure of the genome mean that certain areas are now very accessible to the transcription machinery and are therefore increasingly read. This more accessible area of the genome is called Euchromatin. Conversely, however, certain parts of the genome can also be temporarily or permanently inactivated; they are packed and tangled so tightly that the proteins being read can no longer bind to these sections. These sections are called Heterochromatin.
If you want, you can think of the genome as a kind of book that describes exactly how to assemble, repair or operate and maintain all parts of the human body. But just as hardly any individual people can take over every single step in the production or function of a very complex machine, the individual cell cannot and should not take over all functions of the human body. With these regulatory sections, it must be achieved that certain cells open the book to very specific pages in order to read what exactly is there: for example, the instructions for building insulin. The cells of the eye do not open the page with the insulin blueprint, but completely different ones.
“Postgenomics”
If you look at these diverse properties of the genome, then you can also understand that the decoding of the human genome was accompanied by such an incredible amount of media interest. When two independent variants of the human genome were published simultaneously in February 2001 (the Human Genome Project published a genome in the journal Nature [3] and Celera, a company that only entered the race for the human genome in 1998, published the genome of their boss, Craig Venter, in Science [4]), a scientific revolution appeared to have taken place. But soon the impression appeared that a kind of bubble, consisting of the enormous expectations of those involved and observers, would burst. Medical research in the age of “postgenomics” era did not seem to be so much different from earlier medical research distinguish.
The term “decryption” promises a lot, but what does it really mean in this context? It means that the entire base sequence of the genome (initially of a single person) has been discovered. This is a combination of A, C, G and T (the designation for the four bases / nucleotides adenine, cytosine, guanine and thymine) with 3.2 billion digits. One has to imagine the boring of the immediate result of this research: ACGTGTACGGTGACGTTACGTCGATTCAGTC, only a hundred million times longer. Without knowing more about the functions of certain sequence sections and their interactions with the other components of the cell under certain conditions, this genetic code does not seem to be of much value at first. But this appearance is deceptive.
Imagine if a certain gene were very important for a certain function in the human body; so important that this function can no longer be properly fulfilled if the gene is defective. A person who has this defect has a certain disease that medical professionals have probably known for a long time, as it occurs again and again in the population. And now let’s assume we don’t know anything about this gene (in fact there are still a lot of genes that we don’t know much about; there are also a lot of diseases whose exact causes we also don’t know). Barely any inkling, we are now comparing tons of genomes: a few hundreds or thousands of genomes from people who suffer from this disease and at least as many genomes from people who do not have this disease. If we assume that individual differences that have nothing to do with the disease average out, then we may discover a single or a few sequence variants that are highly significantly unevenly distributed between the groups. This procedure is referred to in the research field as a genome-wide association study, or GWAS for short. So we discovered variants in the genome at very specific locations that are associated with a disease.
On the one hand, this can give us valuable information when we want to investigate the causes of the disease: these places may be either within certain genes or in the regulatory sections described above that control the activity of certain genes and can therefore give us a link to certain genes. A closer examination of these genes and their proper and impaired function will therefore very likely provide information about the disease. On the other hand, these variants can also serve us, for example, to predict the onset of a disease (with a certain probability) if, for example, the symptoms of a disease only show up from a certain age, but early detection and therapy are very valuable. Whether and to what extent people, for example with a certain personal or family medical history, should be genotyped because of this (the difference between sequencing and genotyping is explained in the article on genetic engineering) is certainly not easy to answer. Because of course possibilities of this kind always have their downsides and as a society one should take a lot of time to understand, weigh up, evaluate and, if necessary, regulate corresponding developments in their impact both on individual people and on society itself . This is perhaps the most important task to be tackled in the age of “postgenomics”. Because of the relevance of these topics and the resulting meticulousness required in their processing, this blog can and will only refer to such topics in passing.
[1] Speech of Bill Clinton, 2000, as recorded by The New York Times: https://partners.nytimes.com/library/national/science/062700sci-genome-text.html
[2] Tjio, J. H.; Levan, ALBERT (1956): THE CHROMOSOME NUMBER OF MAN. In Hereditas 42 (1-2), pp. 1–6. DOI: 10.1111/j.1601-5223.1956.tb03010.x .
[3] Lander, E. S.; Linton, L. M.; Birren, B.; Nusbaum, C.; Zody, M. C.; Baldwin, J. et al. (2001): Initial sequencing and analysis of the human genome. In: Nature 409 (6822), S. 860–921. DOI: 10.1038/35057062.
[4] Venter, J. C.; Adams, M. D.; Myers, E. W.; Li, P. W.; Mural, R. J.; Sutton, G. G. et al. (2001): The sequence of the human genome. In: Science (New York, N.Y.) 291 (5507), S. 1304–1351. DOI: 10.1126/science.1058040.