Aaron Rosenfeld / 14 posts / 9 comments / feed / comments feed

Protein Folding Simulation

I wrote this article about a year ago for my old site and decided it was worthy of a re-post.

What is the problem?

Given a sequence of amino-acids (sometimes given as DNA bases which are simply converted to codons and then amino-acids), what is the structure of a protein after a time t? In addition to the general protein-protein forces, the solution in which the protein is folding must be considered. Temperature, pH, and electromagnetic properties all affect the final outcome. In general, however, these other forces are held constant so only protein-protein forces need to be calculated.

General Approach

The most intuitive approach is to re-calculate the protein’s shape at every multiple of some interval of time (called Δh in this case) until the the end condition is reached. That is, starting at t=0 and continue incrementing time by Δh until we arrive at t=tf (when the protein is completely folded).

The next obvious question is, “When is the protein done folding?” It turns out, whenever the protein is in its lowest energy configuration, it will stop adjusting its shape and come to a rest. Although I will omit the full derivation of why this is the case, I will give another example of why this is true.

If a ball is released from rest at a height above the ground (really any height above the centerpoint of the earth but for simplicity I will say above the earth) it naturally falls towards the ground. This is because its (potential) energy is relatively high. As it accelerates towards the ground, it loses this potential energy (gaining kinetic energy). Once it hits the ground (ignoring any rebound), both the kinetic and potential energy equals 0. This is the low energy (rest) state of the ball. It will not move on its own. This is the same for proteins. Once it is in its lowest energy state it will not “want” to change its shape.

Calculation Methods

Method One
The first method of simulating the protein’s folding is to simply determine the force on each atom and update its position from that. The math behind this is relatively simple to understand. All that needs to be done is calculate the force of each atom on every other atom. Using Newton’s basic laws of motion, the force acting on the ith amino-acid is given by:

Equation One

Method Two
Although the previous method works, a (somewhat) easier way to do these calculations ignore the forces of each atom and instead uses the potential energy of the entire system. Letting φi be the potential energy of the system at the ith atom (as a function of all n that cause the potential energy), the force on the ith atom is given by:

Equation Two

As a side note, in case you are not familiar with what a gradient is, a full description of it can be found on Wikipedia but as a brief overview, a gradient is a vector field where each vector points in the direction of the largest increase of whatever is being measured (in this case φi). A gradient can be expanded to be the partial derivative in each direction (in this case x, y, and z). Since this needs to be done numerically in the end, the right side the following uses the value of each derivative at the ith atom:

Equation Three

This again goes back to the energy state discussion. Because the gradient of φ yields the direction of increasing energy, going in the negative direction of the gradient leads towards the lowest energy and an eventual end condition for the protein’s folding.

Energy of molecules

In the previous section φ was used to represent the energy of a system. In this section the actual value of the energy of molecule-molecule interaction will be investigated. From over one hundred years research, a number of different forces between atoms have been determined. Hydrogen-bonds, electrostatic interaction, and Van der Walls attraction are the three most accepted. In addition to these forces, the angle between atoms must be considered along with how much the bonds can stretch, and how much the bonds can rotate along their axis (sometimes called their Dihedral Angle Permittivity). The actual mathematics behind these forces are studied in depth by chemists. Luckily, the potential energy each cause can be reduced to nothing more than algebraic expressions. Note also that all include some sort of coefficient (e.g. angle-bending coefficient, Dihedral Angle Permittivity coefficient, etc.) and will be denoted with a C followed by a subscript. To start, the total potential energy on the atoms in question is given by:

Equation Energy

Expanding each term gives:

Equation Energy Expanded

Numerical Method

The math above looks great on paper but when it comes to implementation, computers don’t natively have “gradient”, “derivative”, or “integral” functions. To solve this, Maclaurin (or Taylor) series expansion is generally used. Again I will not go into detail about what a Taylor or Maclaurin series is, Wikipedia has more information, but as an overview, both series represent a function as the infinite sum of its derivatives at a given point. The official notation for a Maclaurin series is:

Equation Four

Note that for all of the functions in this article, this series actually equals the function itself after an infinite number of repetitions, not approximately equals. With this understanding, it is possible to use an integration algorithm such as the Verlet algorithm to propagate an atom’s position vector from t = t - Δh to time t = t + Δh without needing to know the velocity of the atom. All that would be needed is the current position, time, and acceleration which can be determined from the force. Again, the math of such an algorithm will be omitted as it beyond the scope of this article.

Computational Complexity

Possibly the largest hurdle in protein-folding simulation is the sheer amount of time the calculations discussed take to complete. All of the summations above take a great deal of time. The two most intensive display O(n2) performance. This alone, when considering n is rarely under 100 and usually in the tens-of-thousands, takes a very long time to complete. Nonetheless, these calculations are minimal in comparison to the other option: trying every combination of conformation to find the lowest energy level.

A principle known as the Levinthal Paradox shows just how intensive a process like this is. Because there are three basic conformations for each amino-acid (α-helix, β-pleated sheet, and the random coil), there are exactly 3n conformations possible in a given protein with n amino-acids and a computational complexity of O(3n). To put this in perspective, an amino-acid chain with only 100 amino-acids has 3100 conformations. That is just about 5 * 1047 combinations that would have to be searched. Assuming this was run on a home computer (mind you a very, very high performance home computer) which can calculate 1,000 conformations per second, it would take just about 9.8 * 1031 YEARS. So, as it also seems intuitively, doing the calculations in previous sections is vastly more efficient than a random search of all combinations.

A Brief Case Study

To bring all of this article together, I decided to run the algorithm above on a small protein. I eventually settled on Myoglobin for a few reasons. Myoglobin is used by the body for a number of different functions such as a type of marker of muscle injury and oxygen transportation, so it is readily available to be studied. Additionally, it only contains 153 amino-acids making the calculations relatively timely. It also has some bonding characteristics which are favorable for reducing calculation time.

Running the calculations took 3 hours, 17 minutes, 34 seconds on a 21-processor distributed computing platform. Some of this time was spent doing additional calculations to generate additional data for graphs that will be discussed next, however.

For purposes of this article, at each conformation, the software calculated the overall potential energy of the system which was then plotted vs. the position in z and x*y locations of a certain atom (the exact atom was chosen because it had the least number of bonds to minimize calculations). This method of plotting had to be used because otherwise the plot would expand into a 4th dimension; x, y, z, and energy. An animated graph would be perfect for this, but for demonstration purposes truncating the 3 independent variables seemed acceptable. So to allow all of the data to be shown in a single image x and y were multiplied to give us only three variables: x*y, z, and energy:

Plot One

As the plot shows, at different values of x, y, and z the energy of the system fluctuates. As was stated previously, getting from one conformation to another involves using a gradient to determine which direction will lower the potential energy the most. The following shows this gradient (to be completely correct, it shows the magnitude of the gradient since the gradient itself is a vector valued function). The dark areas show the highest magnitude (which is negative since it is a minimum) and light show the lowest. As a side note, there is an overall gradient towards the very strong black dot (very strong vectors) that is faint and hard to see.

Plot Two

As it should, the highest magnitude gradient is in the location of the global minimum. The software, in a sense, moved from the bottom left of this plot (since this atom can be considered the origin of the system) up, right, left, and down in the direction of the gradient until it hit the minimum when the atom was at its lowest energy state. In the end, as confirmed by Wikipedia, the structure looks like:

Protein Image

What to make of it all

I wrote this article to give others a perspective on how daunting the problem of protein folding is. What I have explained is considered the easiest when it comes to the mathematics. There are alternative approaches using energy levels of atoms, more advanced vector-based equations, and more accurate methods of describing the motion of atoms than has been explored here. But, the basic ideas were covered here. I hope everyone found this article interesting and informative. Any feedback is always appreciated.

Citations

For Article

For Case Study

No comments

Leave a comment