The Language of Biology: Computation to aid protein design

Sep 02, 2022

Synthetic biology has reached its adolescence

and it is such an exciting time to be a researcher in the field. What started as a dream of molecular biologists in the ‘60s has grown into a bustling, interdisciplinary field full of researchers tackling a wide variety of the world’s biggest problems in health, environment, and agriculture.

By definition, synthetic biology is the creation and manipulation of new biological components and systems. In a way, it is a field devoted to learning the “language” of biology with the hope that one day we will be able to “read” and “write” fluently enough to use it to tackle real-world problems.

At first glance, this goal to “speak” biology appears futile. Nature is widely complicated, and oftentimes irrationally inefficient. However, as we continue to develop tools and document components, we build a foundation for future scientists to strive further. In the past 60 years, we’ve come a long way from biologists discovering and struggling to engineer DNA to chemical engineering students like me building plasmids and designing proteins on my computer in a cafe.

The onset of the Digital Revolution and the birth of bioinformatics has given researchers access to large databases of annotated parts (DNA and protein) along with computational tools for design and visualization in silico (Benchling, LatchBio, and AlphaFold). These tools represent a modern extension of efforts started in the 2000’s to standardize and abstract biology. Today I’d like to give some perspective on how synthetic biology started and how I use these computational tools to “speak” biology.

Birth of a discipline

D. Ewen Cameron wrote an eloquent timeline on the history of synthetic biology, that I’ll try to briefly summarize. In the 1960’s, scientists started to put together that the cells they were observing under a microscope were being internally regulated by a set of biological circuits in the form of genes. Circuits that, similar to electrical networks, could be rationally pieced together to program life. This idea paired well with Francis Crick’s central dogma of biology (1957) that cells contain genetic information encoded in DNA, which get transcribed into RNA, and finally translated into functional molecules known as proteins.

The central dogma of biology from the National Cancer Institute.

Over the next few decades, advancements in molecular cloning tools allowed for biologists to start to piece together and “write” their own custom DNA. Automated DNA sequencing allowed for high throughput “reading” of DNA enabling biologists to sequence and document entire genomes of microorganisms. By the late 90s, scientists were envisioning the potential that harnessing biology could have, and the field had developed enough tools to make it accessible to a wider audience. It was at this point that a small group of physicists, engineers, and computer scientists made the brave decision to join the effort to speak the language of biology.

Figure from Shixiu Cui’s “Multilayer Genetic Circuits for Dynamic Regulation of Metabolic Pathways”

In the early 2000s, humans spoke their first “sentence” in DNA and the field of synthetic biology was born. These early “sentences” are in the form of rationally designed genetic circuits created in E. coli (bacteria) which function as biological “logic gates” controlling whether select genes are turned “on” or “off”. Shortly following these discoveries, the world’s first international synthetic biology conference, SB1.0, was held at MIT in 2004, and scientists from across disciplines sat down to plan how to systematically learn the “language” of biology. It was here that a sophisticated engineering discipline was created and made accessible to researchers from across fields. There were two core ideas drafted that would propel synthetic biology forward over the years:

Parts standardization - publicly characterized components that can be used among the community without continually reinventing the wheel.
Abstraction hierarchies - structuring information in the field to allow anyone to design biological systems without requiring rigorous understanding of implementation details.

With this framework in place, biology was no longer a discipline studied by one small portion of academia, but rather an eclectic group of researchers approaching it from a wide set of new perspectives. This new set of direction coupled with the age of information allowed the nascent field to explode over the past twenty years. Further advancements in molecular cloning, large scale DNA synthesis, and abstracted DNA construction softwares such as Benchling has enabled “everyday” biologists like me to design, order, and build any piece of DNA all from our computers. The question is no longer “how can I make this?” but rather “what should I make?”.

The goal of learning the language of biology has shifted from understanding the alphabet (DNA) to learning and forming words (proteins). Currently our collective vocabulary is limited by a finite set of proteins we have characterized from nature, but in order to truly become fluent and express ourselves we need to expand that collection. This desire to create novel proteins to interact with biology, tackle problems in human health, and catalyze reactions is the inspiration for the subdiscipline I specialize in…

Protein Engineering

Despite our current ability to generate any protein we want, encoded by DNA we have constructed, the secrets to protein folding and function have remained elusive. Proteins are functional molecules built up of a string of amino acids. Their functions are vast and are primarily dictated by their shape and how they “fold” into themselves based on their linear amino acid sequence. Despite decades of advancements and the characterization of a large number of natural proteins, scientists still have not fully deciphered this problem of protein folding.

Amino acid sequence of a protein from Karen Steward at Technology Networks.

To illustrate how large the problem is, imagine we have a protein that is 200 amino acids long. There are 20 distinct amino acids, so the total number of combinations (and distinct proteins) that can be created are 20^200! This is an astronomically large number, as the current approximation for how many atoms are in the universe is only on the order of ~10^80 and it gets even crazier when you consider that proteins can be up to 38,000 amino acids long. This tells us that nature has sampled an infinitesimally small portion of the total possible protein sequence space despite continual evolution throughout all of biological history. In other words, this is not a problem we can brute force through scouring all possible combinations, but rather one that requires thoughtful approaches to limit the sequence space.

In the past, protein engineering was low-throughput, rationally guided design making small changes to native proteins. Sequence space was manually limited by engineering around native proteins with known structures and functions and documenting changes due to engineered mutations. Without having any way to predict protein folding a priori, design was limited by educated guesswork and random sampling. A scientist’s intuition could only be verified after expressing, purifying, and characterizing a protein after weeks or months of work.

Nowadays, bioinformaticians and computational protein engineers have developed a vast portfolio of computational tools centered around predicting how a given amino acid sequence will fold or bind to other molecules. While these methods are still limited in their accuracy or scope of problems they can solve, they do offer scientists an in silico method to “pre-screen” a large number of protein variants and effectively narrow the sequence space prior to wet lab characterization.

Protein folding energy funnel from Thomas Splettstoesser.

Despite the inherent power computational methods could provide to rational protein engineers and “everyday” biologists like me, they have been historically plagued by limited accessibility from non-experts. In 1962, Christian Anfinsen hypothesized that protein folding was governed by thermodynamics and proteins would fold into the most stable conformation within their environment.

Initial software, such as Rosetta developed by the Baker Lab, sought to mimic nature’s folding by calculating energy functions for each amino acid and their conformations in order to predict the most stable fold. Due to the sheer number of combinations and calculations, these initial methods were so computationally expensive they could only be run on supercomputers or clusters that only companies or computational labs would have access to. Furthermore, even with access, a biologist would be required to navigate sending files through the intimidating black box that is a command line, as well as successfully orchestrating a run requiring a variety of convoluted inputs.

AlphaFold, a protein folding tool for you and me

The advent of cloud computing and machine learning has sparked a paradigm shift in bioinformatics, shifting computational power back into the hands of “everyday” biologists like me. In 2018, DeepMind shocked the protein engineering world with their machine learning-based protein structure prediction software, AlphaFold, at the Critical Assessment of protein Structure Prediction’s (CASP) 13th competition. This AI-based method functions differently from other “energy-based” predictions methods by training a neural network to interpret characterized protein structures as graphs and using that knowledge to predict the fold of an input sequence. AlphaFold dominated the competition at CASP13 in 2018, even beating out established “gold standard” methods such as Rosetta, and in 2020 an updated AlphaFold2 swept the competition again at CASP14. Since then, DeepMind has published their methods in Nature as well as released open source software and a searchable database for proteins they have predicted.

Q8W3K0: A potential plant disease resistance protein predicted by AlphaFold.

What this means is that biologists now have access to a powerful protein structure prediction tool that only requires an amino acid sequence input. The open source code can either be built and run on your personal machine, or you can submit individual sequence runs through a Google Colab notebook.

As a protein engineer (with limited computational experience) who rationally designs and tests individual proteins, I was excited to finally try and loop computational predictions into my workflow. Most of my work consists of creating fusion proteins; a sort of Frankenstein’s monster consisting of stitched together characterized protein domains.

I start by scouring the literature/annotated databases for proteins or domains containing a desired function, and then brainstorming how to insert them into my fusion protein. Generally, this level of engineering requires a lot of reading, a bit of intuition, and a whole lot of luck so that separate domains retain their function. The tricky part is finding the correct orientation and spacing between functional domains, as knowledge of the linear amino acid sequence does not translate to how the protein will fold in on itself.

A “caged” cytokine I designed, predicted on AlphaFold2 using LatchBio, and visualized in PyMol.

This is where AlphaFold comes in!

By compiling a list of fusion proteins with potential orientations/spacing between protein domains, I can feed AlphaFold their sequences and visualize their predicted 3D structures. This allows me to gain some insight in silico about how feasible my designs are prior to constructing and testing in the lab. In essence, I can improve my overall throughput by reducing the sequence space in silico, all without knowing how to code.

What I found however, was that queuing multiple runs on the AlphaFold Google Colab was not feasible for the number of constructs I wanted to predict due to dynamic usage limits, absence of batch runs, and no data storage. This meant that I was still being gated from utilizing computational methods to my full potential, and it left something to be desired. It was during these frustrations of trying to run multiple proteins through Google Colab that I found…

LatchBio, A step towards on-demand compute for biologists

LatchBio is a cloud-based biocomputing company that enables researchers to find and queue workflows like AlphaFold in their browser with a no-code interface. With Latch, I was no longer limited to running a couple of proteins at a time, but rather I could queue and run tens of proteins simultaneously without any computational knowledge. This improved throughput allows me to feasibly include computational screens of every protein I design into my workflow.

The ease and accessibility of the platform strikes me as a continuation of the spirit of parts standardization and abstraction from synthetic biology’s history. Here, I see bioinformatic tools as the next layer of abstraction in “speaking” biology. To put it simply, biology is too large and complex for humans to decipher on their own. While I could not quickly read a string of DNA and tell you what protein it encodes, a computer could almost instantly. I believe the future of biology will strongly rely on machine learning to gain insight from data we have collected, and instruct us how to proceed.

What the future holds

Thirty years from now, advancements in machine learning and computing could allow for a “true” solution to the protein folding problem and enable scientists to predict any protein’s structure. The ability to rationally design de novo proteins would have huge implications in targeting disease mechanisms and producing enzymes for bioprocessing. However, the limiting factor for this impact will be on the ease and accessibility for researchers to feasibly implement these programs into their work.

Just as the founding fathers of synthetic biology laid the framework for abstracting and scaling a burgeoning field in the 2000’s, I believe open biocomputing companies like Latch are pioneering powerful platforms at a pivotal moment. One that standardizes workflows, abstracts computation, and scales to a broader audience. It is with this foundation that I believe we will continue to learn to “speak” biology, and I am thrilled to be one of the early orators.

Tyler Bennett

Sep 4, 2022

Amazing! This type of niche technology becoming available to anyone on the Internet is such an incredible feat for humanity. This is what science is all about: collaborating, spreading knowledge, helping one another, all in order to make the world a better place.

Expand full comment

Vickram Pradhan

Sep 3, 2022

This is a fantastic read and resource, thanks for writing this Carlos!

CarlosAldrete’s Newsletter

Discussion about this post