DNA sequencing – The Methods that Made us

Reading Time: 4 minutes

Every living thing depends on the information stored in its genetic material. DNA, the code passed down through generations and geological time, defines an organism. Within individual cells, DNA is transcribed into RNA, that is then translated to busy proteins: DNA is the starting block for cellular development and activity!

Furthermore, the genetic information of every being also tells the story of its evolution, its relatives and its adaptation to the environment.

It’s no wonder that scientists longed to decode the secrets of the DNA sequence ever since its discovery.

The 1940s and 1950s were exciting times for researchers interested in understanding the underlying mechanisms of biological information storage, transfer and inheritance. Avery, MacLeod and McCarty discovered that DNA is the molecule responsible for storing genetic information and making it available to cellular process. (Before, proteins had been the prime suspect). Then, Rosalind Franklin performed the famous X-ray crystallisation studies that triggered the description of the DNA double helix structure.

Yet, the exact letter code of the DNA remained a mystery until the late 1970s.

Frederick Sanger improved a process developed in 1970 to create the method that would come to be famously named after him: Sanger sequencing. In short, a mixture of regular nucleotides (A and T, and C and G) and “toxic” nucleotides was used in a DNA polymerase reaction. DNA polymerase works to synthesise new DNA molecules from nucleotide, while the toxic nucleotides antagonise the situation- forcing the polymerase to stop whenever they are added. By running the reaction for long enough, and individually running the reaction using each different toxic equivalent of A, T, C and G, the researchers created DNA strands of different lengths so that every single position of the code was covered with a specific toxic stop. Then, the researchers sorted all strands by length and counted every A, T, G and C along the strand – manually.

Using this new method, the earliest sequenced genomes were those of phages in 1977. Their genomes are comparatively small – but this still meant sequencing over 5000 basepairs using this time-consuming manual system. It took until 1994 for the first eukaryote genome to be uncovered in baker’s yeast (Saccharomyces cerevisiae), this time with the help of dedicated sequencing machines. Sanger sequencing, was still around, but this time, at least some automation was involved.

Some time later, the first nuclear genomes of multicellular organisms were produced. First, that of the worm Caenorhabditis elegans and the fruit fly Drosophila melanogaster. And then, in 2000, that of Arabidopsis thaliana . Our favourite lab rat even preceded the human genome!

When we uncovered these genome, hopes were high that we could soon bend the genome at will, understand every last process, and make the genome the canvas of our wildest dreams. But knowledge about the DNA sequence alone, is not enough. Today, 20 years later, we still uncover bit by bit the underlying complexity of gene function and their place in a large network.

To describe and study the actual gene sequences, we need to annotate the genome- a process that involves looking at a vast sea of A, T, Cs and Gs and identifying where one stretch of information (a gene) might start, and and where it might end. The annotation is the crucial detail that decides about the (lack of) usefulness of a genome: if we don’t know where a gene begins and where it ends, we can’t deduce a lot about its function. A simple analogy is this: imagine you have a book in an unknown language. Sequencing gives us the letter sequence in the book and annotation provides the spaces, punctuation and paragraph structure. If we have both, we can start to make sense of the sentences.

Once we have an annotated genome, we can look for familiar things. In bacteria, for example, we can look for known genes for nitrogen fixation and can learn more about its metabolism without actually having to look deep into its molecular pathways. We can also understand the evolutionary relationships of organisms. The letters of DNA in a genome tend to change at a fairly slow pace. By comparing the number (and type) of differences between one species and another, we can make approximations about how related they are, and even predict how long ago they might have diverged from a common ancestor.

Without DNA sequencing we wouldn’t know what we know today. Whenever we want to break, insert or tag a gene in a plant (or any other organism) we rely on the knowledge of the relevant DNA sequences. During my time in the lab, sending a short stretch of DNA for sequencing was just as common as ordering a chemical or buying something from an online retailer. Today, even whole genome sequencing- decoding the entire DNA of an unknown organisms, is accessible in terms of both price and time required.

Where the first full genomes were the result of countless of ~~hours~~ years of work of a team of researchers, we can now sequence a new organism in a much shorter time and at the fraction of the cost.

New methods deviate from the trusted old Sanger sequencing. Next Generation Sequencers chop the DNA into lots and lots of small fragments and sequence them in parallel. The sequences are read with a high level of accuracy (fidelity), meaning you know exactly what’s in your DNA, but understanding where these small fragments fit into a larger genome can be tricky, and require a lot of computing know-how. The newer kid on the block, Nanopore sequencing, is conversely able to read very long stretches of DNA, but tends to make mistakes. Both methods have become so cheap that some researchers can afford to use them in parallel, overcoming the downsides of each individual method.

DNA sequencing made almost all of our molecular research possible. No matter whether I work on RNA, proteins or metabolites, knowing the sequence of the underlying genes is crucial to create knockouts or to clone them into other systems.

References

The Arabidopsis genome: A foundation for plant research, Michael Bevan and Sean Walsh, Genome Res. December 2005 15: 1632–1642