Bachelor Thesis

Like-Charge Attraction in DNA Michael Kuron October 2, 2013

Supervisor: JP Dr. Axel Arnold Institute for Computational Physics, University of Stuttgart

Revision v1.0.2

The image on the title page shows a snapshot from a simulation of two DNA molecules, modeled as helices and surrounded with counterions, at a low temperature.

Z U S A M M E N FA S S U N G Geladene Polymere, wie beispielsweise die DNA, ziehen einander unter bestimmten Umständen in Lösung an, obwohl die Ladungen der Polymere das gleiche Vorzeichen aufweisen. Diese elektrostatische Anziehung wird vermittelt durch die Gegenionen, die zur Neutralisation notwendig sind und sich bevorzugt in der Ebene zwischen den Polymeren anordnen. Durch die Poisson-Boltzmann-Theorie lässt sich dieses Verhalten nicht erklären, wohl aber lässt es sich in Simulationen beobachten und auch experimentell untersuchen. Ein verbreitetes Modell zur einfachen Simulation von derartigen Effekten ist, die DNA als Linie konstanter Ladungsdichte zu betrachten; die Wechselwirkung von DNA und Gegenionen lässt sich dann mit geringem Aufwand berechnen. Um die Gültigkeit dieser Vereinfachung zu überprüfen, werden in dieser Arbeit die Linienladungen durch diskrete Ladungen bei gleicher mittlerer Linienladungsdichte ersetzt. Anschließend werden diese Ladungen in unterschiedlichen Strukturen angeordnet, um jeweils zu untersuchen, wie sich die Diskretisierung an sich, eine Verschiebung der beiden Polymere gegeneinander, sowie die Anordnung der Ladungen in einer Helixstruktur auswirkt. Da ein solches System zweier paralleler Polymere lediglich entlang der Polymere eine Symmetrierichtung aufweist, können übliche Elektrostatik-Algorithmen wie P3 M nicht angewendet werden, da diese entlang aller Raumrichtungen periodische Randbedingungen voraussetzen. Stattdessen wird hier die MMM1D-Methode verwendet. Sie skaliert mit O( N 2 ) und ist daher für Systeme mit vielen Teilchen auf den ersten Blick ungeeignet. Im Rahmen dieser Arbeit wird der Algorithmus jedoch unter Verwendung des CUDA-Frameworks auf eine Grafikkarte portiert. Moderne Grafikkarten sind im Rahmen des GPGPU-Computing (General Purpose Graphics Processing Unit) in der Lage, als massiv-parallele Vektorrechner zu arbeiten. Das Molekulardynamik-Simulationspaket ESPResSo beherrscht Simulationen mit teilperiodischen Randbedingungen und beinhaltet bereits eine Implementierung des MMM1D-Algorithmus für konventionelle CPUs. Unter Nutzung der Infrastruktur von ESPResSo wird eine CUDA-Portierung des MMM1D vorgenommen, die bis zu 40 mal schneller rechnet als die vorhandene CPU-Version. Bei der Durchführung von Simulationen zum genannten Effekt der Anziehung von Ladungen gleichen Vorzeichens bei DNA stellt sich heraus, dass die zuvor verwendeten kontinuierlichen Linienladungen eine sehr starke Vereinfachung darstellen. Zum Vergleich werden zunächst Simulationen mit kontinuierlich geladenen Stäben als Modell für DNA-Moleküle durchgeführt, umgeben von Gegenionen. Im nächsten Schritt werden die Stäbe durch Linien mit unterschiedlichen Anzahlen diskreter ortsfester Ladungen ersetzt, wobei die Linienladungsdichte von zuvor beibehalten wird. Ist die Anzahl der Gegenionen nur wenig kleiner als die der Ladungen pro DNA-Molekül, ergeben sich deutlich andere Kräfte auf die DNA, die zusätzlich je nach Verschiebung der Stränge gegeneinander um bis zu eine Größenordnung schwanken können. Diese Unterschiede sind ausschließlich auf die Diskretisierung der elektrostatischen Wech-

3

selwirkungspartner zurückzuführen; der Einfluss der Diskretisierung der ebenfalls vorhandenen Lennard-Jones-Wechselwirkung ist minimal. Zuletzt wird eine Simulation durchgeführt, bei der die Ladungen der DNA-Moleküle nicht linear, sondern helikal angeordnet werden. Die Geometrie der Helix entspricht dabei der von natürlicher B-DNA. Dabei stellt sich heraus, dass bei einer Phasenverschiebung der beiden Helices um 180◦ gegeneinander zumindest für niedrige Temperaturen das Ergebnis der kontinuierlichen Stäbe repliziert wird, bei höheren Temperaturen oder kleineren Phasenverschiebungen jedoch stark davon abweicht. Dennoch ist in allen Fällen unabhängig von der Geometrie eine Anziehung gleichnamig geladener Teilchen, vermittelt durch die Gegenionen, zu beobachten.

4

CONTENTS 1 Introduction 2 Theory 2.1 The MMM1D Method . . . . . . . . . . . 2.2 DNA Like-Charge Attraction . . . . . . . 2.2.1 Poisson-Boltzmann Theory . . . . 2.2.2 Strong Coupling Theory . . . . . . 2.3 Electrostatic Interaction of Charged Rods

6

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Implementation 3.1 General-Purpose Computing on Graphics Processors . . . 3.1.1 GPGPU Architecture . . . . . . . . . . . . . . . . . . 3.1.2 GPGPU Interfaces and Frameworks . . . . . . . . . 3.1.3 CUDA Programming . . . . . . . . . . . . . . . . . . 3.2 ESPResSo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 MMM1D on GPU . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Suitability of MMM1D for GPGPU Implementation 3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . 3.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

4 Simulation 4.1 Rods with Continuous Charges and Lennard-Jones Interactions . . . . 4.2 Lines of Discrete Charged Particles with Lennard-Jones Interactions . 4.3 Lines of Discrete Charged Particles and Cylinders with Continuous Lennard-Jones Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Lines of Discrete Charged Particles with Lennard-Jones Interactions and a Phase Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Helices Consisting of Discrete Charged Particles with Lennard-Jones Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion

. . . . .

8 8 11 11 12 12

. . . . . . . . .

14 14 14 15 16 18 19 19 19 21

24 . 27 . 30 . 33 . 34 . 37 40

A Appendix 41 A.1 Implementation of MMM1D on CUDA . . . . . . . . . . . . . . . . . . . 41 A.2 Simulation Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Bibliography

54

5

1 INTRODUCTION Charged polymers like DNA attract each other under certain circumstances when they are placed in solution with counterions. This happens despite the polymers’ charges having the same sign; the attraction is conveyed by the counterions. These tend to arrange in the plane between the polymers. Poisson-Boltzmann theory does not explain such behavior, but it can be observed in simulation and experiment. A common model for simulating such effects is viewing a DNA molecule as a rod with constant line charge density. This makes the calculation of interactions between DNA and counterions quite easy. To determine the impact of this simplification, in this thesis the rods will be replaced with discrete charges, maintaining the same average line charge density. These charges will be arranged in different structures to determine how the discretization itself and a relative shift of the two DNA strands affects the force between them. Finally, the charges will be arranged in a helical geometry to get this simple model as close to real DNA as possible. As such a system of two parallel polymers only has one axis of symmetry along the polymer axis, common electrostatics algorithms like P3 M cannot be used since they require periodic boundary conditions along all three directions in space. Instead, the MMM1D method shall be used. It scales like O( N 2 ) and is therefore, at first sight, not well suited for systems with large numbers of particles. As part of this thesis, the algorithm will be ported to a graphics card (GPU) using the CUDA1 framework. GPUs, whose original purpose was to convert data into pixel colors on a screen, are essentially massively parallel vector computers. Only in recent years, GPU vendors have made available interfaces that give easy access to these vast computing resources for custom (non-graphics) calculations, a technique often referred to as GPGPU (general purpose GPU). The computational science community has been quick to adopt this new technology. Even in the quickly developing computing industry, sudden 30-fold performance jumps (comparing the raw number of floating-point operations per second between high-end GPUs and CPUs) are very unusual. As long as the algorithm to be implemented meets certain criteria (for the underlying hardware architecture is still optimized for the needs of graphics), the actual performance gain may be even bigger as GPUs also sport very fast access to their integrated memory. While GPGPUs were originally only a byproduct of graphics card development for computer gaming and other 3D graphics applications, more recently special versions for high-performance computing have been made available that e.g. ship with large amounts of integrated memory and allow for fast double-precision floating point arithmetic, both of which are not needed for pure graphics applications. The molecular dynamics simulation package ESPResSo2 is capable of performing simulations with the required partially-periodic boundary conditions and already 1 Nvidia CUDA, https://developer.nvidia.com/cuda 2 ESPResSo, Extensible Simulation Package for the Research on Soft Matter, http://espressomd.org

6

contains an implementation of the MMM1D algorithm for conventional CPUs. Using the infrastructor provided by ESPResSo, MMM1D is re-implemented using CUDA. This results in a massive performance gain. In the first part of this thesis, the MMM1D method is explained and the theoretical background of DNA like-charge attraction is summarized. The second chapter details the algorithmic requirements for high-performance GPGPU implementations, gives an introduction to ESPResSo and explains the CUDA-MMM1D implementation. The third chapter contains the results of the different simulations and talks about their consequences.

7

2 T H E O RY 2.1

The MMM1D M ethod

There are various approaches to the calculation of electrostatic interactions in periodic and partially periodic systems. The trouble here is that the Coulomb potential decays like r −1 : The simple assumption that only interactions between closely-spaced particles deliver significant contributions and the potential could thus be treated as zero beyond a certain cutoff distance rc can induce serious errors as the estimate for the contribution of the tail beyond that distance does not converge for potentials that decay slower than r −3 [7, p. 291]. Common algorithms that correctly treat the long-range contribution, e.g. P3 M, are not applicable to the systems we are interested in because those algorithms require periodicity in all three spatial directions. For systems in which the simulation box is periodically replicated in only one direction (in this case, the coordinate system is chosen such that this direction corresponds to the z-axis), the general form of the Coulomb energy is 0 qi q j 1 ∞ φ= (1) ∑ ∑ 2 k=−∞ i,j r ij + kλz eˆz where r ij is the distance vector between particles i and j, qi and q j are these particles’ charges, λz is the length of the simulation box along the z-axis, and eˆz is the unit vector along that axis. The prime on the second sum denotes that for k = 0, the contribution for i = j is omitted. The MMM1D method ([2],[5]) applies a convergence factor e− βrk , uses the Poisson summation formula (F denotes Fourier transformation), and takes the limit of β → 0: ∞ φij 1 = ∑ qi q j r k =−∞ k ∞

e− βrk rk β →0 k=−∞  − βr  ∞ e k = lim ∑ F ( p) rk β→0 p=−∞ s   2 ∞ 2πipz 2 2π p K0  β2 + ρ  e λz = lim ∑ λz β→0 λz p=−∞     4 2π pρ 2π pz = K0 cos λz p∑ λz λz 6 =0     2 2 ρ 2γ + lim − ln(λz β) − ln − + O( β) λz λz 2λz λz β →0

= lim

8



This sum converges well for large ρ. The

2 λz

ln(λz β) term cancels with the self energy:

  e− β|kλz | 2 2 φself = ∑ = − ln 1 − e− βλz = − ln ( βλz ) + O( β) qi q j kλ λ λ | | z z z k 6 =0 For a proof of the equivalence of the convergence factor approach with the original sum and for details on the derivation of a different formula that converges well for small ρ, see [2]. This leaves us with an energy contribution of each pair i, j rewritten as for ρ > 0: φij (ρ, z) = qi q j

4 λz

∑ K0



p 6 =0

2π pρ λz



 cos

2π pz λz



2 ρ 2γ − ln − λz 2λz λz

! (2)

for ρ < ( Nψ − 12 )λz and (ρ, z) 6= 0: φij (ρ, z) = qi q j Nψ −1

+



k =0

 (2n) − 21 ψ ( Nψ + ∑ n n ≥0 ! ! 1 1 1 + 3 − 3 rk3 r− r0 k

1 − λz



z λz ) +

ψ(2n) ( Nψ −

z λz )

(2n)!



ρ λz

2n



2γ λz (3)

Here, ρ is the magnitude of the projection of r ij onto the xy-plane and z is the zcomponent of r ij . rk is short-hand for the distance between particle i in the primary box and particle j in the kth image box, i.e. rk = r ij + kλz eˆz . Kn denotes the modified Bessel function of the second kind of order n, while ψ(n) denotes the Polygamma function of order n. Nψ is a parameter that may be tuned for precision and computational speed. An important criterion for the choice of an interaction algorithm for simulations is the availability of estimates of the error introduced by truncating the infinite to a finite series. MMM1D offers such estimates. For the far formula (2), [2] gives the upper bound to the pairwise energy error if cut off at p = P as     λ exp 2πρmin z λz φij (ρ, z) − φij,P (ρ, z) ≤ 4qi q j K0 2πPρmin (4) λz 2πρmin ρmin is the distance in the xy-plane above which the far formula is used; below it, the near formula is used. For the near formula, [2] fixes Nψ = 2, requires ρmin < λz and uses the Leibniz criterion to abort the summation when the summand gets smaller than the desired error bound. As usual, the forces are obtained from (2) and (3) by differentiation: for ρ > 0: !    2π pρ 2π pz 2 Fij (ρ, z) = qi q j ∑ pK1 λz cos λz + ρλz eˆρ p 6 =0     8π 2π pρ 2π pz + qi q j 2 ∑ pK0 sin eˆz λ z p 6 =0 λz λz 8π λ2z



(5)

9

for ρ < ( Nψ − 12 )λz and (ρ, z) 6= 0: Fij (ρ, z) = qi q j Nψ −1

+



k =0

+ qi q j Nψ −1

+



k =0

 (2n)  z z  (2n) − 21 ψ ( Nψ + λz ) + ψ ( Nψ − λz ) ρ 2n−1 ∑ n (2n)! λz n ≥0 ! ! ρ ρ ρ + 3 − 3 eˆρ 3 rk r−k r0  1  (2n+1)   ( Nψ + λzz ) − ψ(2n+1) ( Nψ − λzz ) ρ 2n 1 −2 ψ − λz n∑ n (2n)! λz ≥0 ! ! (z + kλz ) (z − kλz ) z + − 3 eˆz (6) 3 3 rk r−k r0



1 λ2z



eˆρ is a unit vector along the xy-projection of the vector connecting particles i and j, while eˆz is a unit vector in z-direction.

10

2.2

DNA L ike-C harge Attraction

Since the geometry of real DNA, which will be simulated in chapter 4, is too complex for complete theoretical treatment, the DNA molecules are treated as stiff charged rods in a solution with oppositely-charged counterions ([3]). At high temperatures, such a system can be studied through Poisson-Boltzmann Theory. At low temperatures, one needs to resort to the Strong Coupling Theory devised by A. Moreira and R. Netz ([11]). Intermediate temperature ranges however show a non-trivial dependency on temperatures and can only be studied in simulations. A common measure for the strength of the electrostatic interaction is the Bjerrum length l B , which gives the distance between two particles with the elementary charge at which the strength of their electrostatic interaction is equal to the thermal energy: lB =

e2 4πe0 er k B T

(7)

Here, e is the elementary charge, k B is Boltzmann’s constant and T is the absolute temperature. e0 is the vacuum permittivity and er is the relative permittivity of the medium.

2.2.1

Po i sson-B oltzmann Theory

The central approximation used in the Poisson-Boltzmann Theory is that the individual ions do not interact with each other, just with the overall ion density. This leads to the Poisson-Boltzmann equation ([1, sec. 7.1]), which describes the electrostatic potential ψ caused by the interaction of a fixed charge density ρ f (r ) and mobile ions with density ρ0 :     2πqσS e2 ψ(r ) ∇2 ψ(r ) = 2 2πΞρ0 exp − ρ f (r ) kB T

(8)

where q is the ion valency and σS the surface charge density of the fixed charge distribution in terms of the elementary charge e; either q or σS is negative. k B is the Boltzmann constant, T the absolute temperature and Ξ = −2πq2 l B2 σS is the coupling parameter which determines the strength of the ion-ion coupling. For counterions around a single charged rod, the solution of the Poisson-Boltzmann equation is quite simple as it becomes effectively 1-dimensional ([1, sec. 7.1]). Even for two cylinders (resulting in an effective 2-dimensional problem), an analytical expression for the force acting on the rods can be obtained ([13, 8]). [6] studies the force per unit length acting on the rods as a function of the Bjerrum length l B when the rods are kept at a fixed axial distance D = 2R + ∆. τ is the line charge density of the rods. F 2l τ 2 = B (9) kB T D This force is repulsive and, as it is derived from Poisson-Boltzmann theory, is only valid for l B → 0 (i.e. T → ∞). At l B = 0, the force is zero as l B = 0 corresponds to a complete lack of electrostatic interaction.

11

2.2.2

Strong Couplin g Theory

While in Poisson-Boltzmann theory the coupling between the ions is zero, Strong Coupling Theory describes the opposite limit. For this purpose, the free energy of the system is expanded in terms of inverse powers of the coupling parameter Ξ as given before ([11]). The strong-coupling limit is taken by discarding all but the leading order. To obtain the free energy of the system in terms of the rod radius R and the rods’ axial distance D = 2R + ∆, different limits of both D and the so-called Manning parameter ξ = l B qτ are taken ([12]). The Manning parameter determines the strength of the ion-rod interaction. For large D (i.e. ∆  R), the leading order of the free energy per unit length becomes for ξ < 1:

F1 = −2ξ (2 − 3ξ ) ln D˜

(10)

F1 = 2ξ 2 ln D˜

(11)

for ξ > 1:

Dl qτ

˜ = B and q is the valency of the ions. So for large D and ξ < 2 , the force where D R 3 is attractive. For small D (i.e. D ≈ 2R), the free energy becomes for ξ  1:

F1 = 6ξ 2 ln D˜ − 2ξ ln( D˜ − 2ξ )

(12)

The force per unit length acting between the rods is then obtained as F=−

2.3

1 ∂ F1 k B T ∂D

(13)

Electrostatic Interaction of Charged R ods

In section 4.1, we will need to analytically determine the electrostatic forces between two charged rods. To calculate force acting between a particle of charge q at r and an infinitely long charged rod of line charge density τ along the z-axis, the line charge density first needs to be converted into a spatial charge density: ρ (r ) = α (r ) δ (r ) Q

rod

!

= τλz =

Zλz

dz

Z2π

0

= 2πλz

0

ZR 0

12



ZR

rdrρ(r )

0

ρ(r )rdr = 2πλz

ZR 0

α(r )δ(r )rdr

This integral is non-zero only if α(r ) = ar .

= 2πλz

ZR

aδ(r )rdr = πλz a

0

⇒a=

τ τ δ (r ) ⇒ ρ (r ) = π π r

Now the force can be calculated (q and τ are in terms of the elementary charge e): F rod = qlB kB T

Z

qτl B = π

ρ (r 0 ) Z∞ 0

qτl B = 2π 2π

=

r −r0

|r

δ (r r0 0 r Z∞

d3 r 0

− r 0 |3

0)

dz0

−∞

dr

0

Z2π



0

Z∞

dz0

−∞

0

r −r0

|r − r 0 |3

r eˆr + (z − z0 )eˆz (r2 + (z − z0 )2 )3/2

2qτl B eˆr r

(14)

This can again be integrated to get the force acting between two infinitely long charged rods of line charge densities τ and τ 0 (again in terms of e): F rod,rod = lB kB T

= lB eˆr =

∂F rod dq ∂q

Z

Zλz

0 0 2ττ l

r

2τ 0 τ dz r

B λz

eˆr

(15)

13

3 I M P L E M E N TAT I O N 3.1

General -Purpose Computing on Graphics P rocessors

For decades, the main purpose of graphics processing units (GPUs) in computers was to create images, i.e. to determine the color each pixel on the screen should have based on data computed by the central processing unit (CPU). Since different parts of an image often do not depend on each other, such a GPU usually consists of many hundreds or even thousands of processors. Each of these processors is significantly slower than a standard CPU, however their sheer number allows a GPU to perform at least an order of magnitude more floating point operations per second than a similarly-priced CPU. These vast computing resources have only recently been tapped into for applications other than graphics processing; this is called GPGPU (general-purpose GPU) computing. In GPGPU computing, the CPU still is in charge of program execution, while the GPU merely acts as an accelerator onto which the CPU offloads certain operations and program parts.

3.1.1

GPGPU A rchite cture

While modern GPUs are general-purpose processors, there are some important differences compared to traditional CPUs that need to be taken into account when implementing algorithms on GPGPUs. The connection between GPU and CPU consists of the PCI Express bus, which is quite slow compared to the memory bandwidth available within the GPU. Therefore, GPGPU computation often only makes sense if there is little data to be copied back and forth between CPU and GPU and intensive computations are to be performed on the GPU. Another aspect that qualifies an algorithm for GPGPU implementation is its ability to be parallelized. If every step requires the result of the previous step, i.e. the algorithm is completely serial, it should be run on a CPU with high single-thread performance. If however the algorithm can be broken down into many individual steps that do not depend on each other, it is suitable for the massively parallel architecture of a GPU. GPGPUs do have facilities for serializing certain operations. For example, thread barriers are available which make sure that all threads in a block of threads have reached a certain point in code execution before allowing the threads to move beyond this point. Another feature offered by modern GPUs are so-called atomic operations, which ensure that competing write operations by multiple threads are executed serially. For example, atomic additions add a number onto a variable stored in memory

14

while making sure that between reading the old value and writing the new value, no other thread writes to the same location. This comes with a significant performance penalty, so atomic operations should be used sparingly unless collisions are unlikely (e.g. if the number of memory locations the threads might be writing to is much larger than the number of threads executed concurrently). In all other cases, schemes like binary sum reductions might be preferable. Here, each of N threads first stores its intermediate result at a location i, matching its thread number, in a common array. Then, the first N/2 threads each add the value from location i + N/2 to their value. This is repeated with the first N/4, N/8, etc. threads until a single value is obtained. A special characteristic of the hardware architecture of GPUs needs to be taken into account: The hundreds of processing cores do not operate individually. Instead, a block of several cores has a common control unit, thus the same operations run in lockstep on each of these cores, with each core having a different piece of input data. This scheme is called SIMD (single instruction multiple data) and such processors may be referred to as vector processors. Modern CPUs also contain SIMD units, but their width (the number of scalers they operate on simultaneously) is much smaller and CPUs still contain high-performance scalar processing units. To make full use of SIMD, it is desirable for the program not to branch into different execution paths based on input data. If branch divergence is inevitable, as much computation as possible should be performed in the common execution path. Programs running on the GPU not only have access to these powerful computational capabilities, but also have a high-speed connection to what is called global memory, a dedicated set of memory chips on the GPU card. Memory accesses however come with a rather high latency. To hide these latencies, the number of threads running simultaneously on a GPU should be significantly larger than the number of threads the hardware is capable of executing in parallel. That way, the GPU can interleave multiple blocks of threads in a way that one block is executed while the others wait for data from global memory. Even registers come with a limitation: data cannot be read from a register for several cycles after it is written there, albeit this latency is much shorter than the one for global memory access. Other memory areas accessible to GPGPU programs include constant memory, which allows low-latency access to small amounts of immutable data, and shared memory, which can be used as fast cache memory shared by all threads in one block.

3.1.2

G P GPU I nterfaces and F rameworks

There are several interfaces available for accessing GPUs for GPGPU computing. The oldest and most mature one is CUDA by Nvidia. It was first introduced in 2007. It makes the fullest use of the capabilities of Nvidia GPUs, however it does not support GPUs from any other manufacturer. Programs are usually written in a slightly limited C++ dialect, detailed below. Around 2009, OpenCL was created. It follows a similar programming scheme as CUDA and is an open standard supported by all major vendors of GPUs and even on some other kinds of hardware. In 2011, the OpenACC standard was published. It takes a different approach to GPGPU programming by allowing the developer to annotate standard C++ code and automatically distributing work to the GPU based on these annotations. It is also an open standard, however it

15

is still very much in development. We use CUDA here as it is by far the most mature and easy-to-use GPGPU API.

3.1.3

CUDA P rogramming

At the heart of all CUDA programs are CUDA kernels. A kernel is simply a program that runs on a GPU and is declared using the __global__ keyword. An example that takes the inverse square root (rsqrt) on all elements of an array a and stores the result into b is shown below. __global__ void rsqrtKernel ( float *a , float *b , int N ) { for ( int tid = threadIdx . x + blockIdx . x * blockDim . x ; tid < N ; tid += blockDim . x * gridDim . x ) { b [ tid ] = rsqrt ( a [ tid ]) ; } } threadIdx and blockIdx allows each parallel instance of the kernel to determine its individual thread ID. A kernel is called from a normal C program using a slightly modified function call syntax which specifies how many blocks to create with how many threads each. The number of threads should be an integer multiple of 32, which is the so-called warp size, i.e. the number of GPU cores that operate under a common control unit. The number of blocks determines how many of these thread blocks should be submitted to the GPU. A large number is favorable so that the GPU can interleave the processing of multiple blocks in order to hide memory access latencies as mentioned in subsection 3.1.1. int numThreads = 64; int numBlocks = N / numThreads ; if ( numBlocks > 65535) numBlocks = 65535; rsqrtKernel < < < numBlocks , numThreads > > >( dev_a , dev_b , N ) ; Since CUDA kernels can only operate on data in global memory, data needs to be sent there before kernel execution and fetched back afterwards. A complete piece of code that uses the GPU to perform rsqrt on an array stored in host memory might look like the following. void cuda_rsqrt ( float *a , float *b , int N ) { float * dev_a , * dev_b ; cudaMalloc (( void **) & dev_a , N * sizeof ( float ) ) ; cudaMalloc (( void **) & dev_b , N * sizeof ( float ) ) ; cudaMemcpy ( dev_a , a , N * sizeof ( float ) , cudaMemcpyHostToDevice ) ; rsqrtKernel < < < numBlocks , numThreads > > >( dev_a , dev_b , N ) ; cudaMemcpy (b , dev_b , N * sizeof ( float ) , cudaMemcpyDeviceToHost ) ; cudaFree ( dev_a ) ; cudaFree ( dev_b ) ; }

16

This simple example already gives a general impression of what is possible using CUDA. More advanced techniques include the use of streams to enqueue a number of memory copy operations and kernel calls, which allows the execution of host code while a CUDA kernel is running. Similarly, multiple GPUs in a single computer may be used simultaneously. CUDA-capable hardware also provides a mechanism called events which allows measuring the execution time of kernels and memory copies. This can be used to measure performance or to tune computational parameters based on the effect they have on computational speed.

17

3.2

ESPResS o

ESPResSo ([4],[10]) is a general-purpose simulation package for Molecular Dynamics (MD) simulations in soft matter. It is written in the C programming language and features an embedded Tcl script interpreter for easy access to the internal data structures and for straight-forward setup of the system to be simulated. Setting up a simulation is quite simple. First, one needs to set up the cell system, periodic boundary conditions and simulation box size: cellsystem nsquare setmd periodic 0 0 1 setmd box_l 10 .0 10 .0 10 .0 Next, a thermostat can be switched on to maintain a canonical (NVT) ensemble: set temp 4 .0 set gamma 1 .0 thermostat langevin $temp $gamma To add particles to the system, a for loop can be used. This option is a major advantage of an embedded scripting engine over plain configuration files. For this example, the particles are placed randomly in the box. set num_parts 100 for { set i 0} { $i < $num_parts } { incr i } { part $i pos [ expr rand () *10 ] [ expr rand () *10 ] \ [ expr rand () *10 ] q [ expr -2.0 + rand () *4.0 ] } Now, the desired interaction potential can be set. For this example, MMM1D electrostatics shall be used. inter coulomb 1 .0 mmm1dgpu tune 0 .0001 Lastly, the integrator needs to be set up and the integration can be run. setmd time_step 0 .01 setmd skin 0 .05 invalidate_system set time_steps 1000 integrate $time_steps After this simple simulation has run, the resultant particle positions and velocities can be read out. for { set i 0} { $i < $num_parts } { incr i } { print $i [ expr part $i print pos ] [ expr part $i print v ] }

18

3.3

MMM1D on GPU

ESPResSo already contains an implementation of the MMM1D algorithm. However, even on modern CPUs, it takes several hundred nanoseconds of computation time for each particle pair. Due to the O( N 2 ) scaling of MMM1D, this makes the simulation infeasibly slow for all but the smallest numbers of particles.

3.3.1

Su itability of MMM1D for GPGPU I mplementation

The MMM1D algorithm meets most of the criteria listed in subsection 3.1.3 that determine whether a program should be ported to GPGPUs. For sufficiently large particle numbers, the overhead of copying data to and from the GPU becomes negligible over the time required for actual computation as the particle positions, charge and forces are all O( N ) in memory footprint, while the computational effort scales like O( N 2 ) as calculations need to be performed on each pair of particles. MMM1D is trivially parallelizable as the computation for one particle pair does not depend on the result for any other particle pair, thus it can be performed in parallel for all particle pairs. The requirement of minimal branch divergence however is not met: as there are different formulae for particles spaced closely and far apart, branching into two execution paths is essential to the algorithm. At the time of this writing, ESPResSo already contained several algorithms ported to GPGPU, including an implementation of lattice-Boltzmann fluid dynamics and P3 M electrostatics. Porting the MMM1D algorithm is part of this effort to allow ESPResSo to tap into the vast computing resources available in modern GPUs. As ESPResSo already uses CUDA for its existing GPGPU code, CUDA was also the obvious choice for implementing MMM1D.

3.3.2

I m plementation

The translation of the equations from section 2.1 into code is rather straight-forward. First, the vector connecting the two particles for which the calculation is performed needs to be determined. Its z-component needs to be folded back into the primary simulation box, i.e. z ∈ [− λ2z , λ2z ]. Next, the distance between the two particles projected onto the xy-plane is determined as this is the criterion for switching between near and far formula. For the near formula, the first sum is terminated as determined by the desired error bound. The Polygamma functions are evaluated using Taylor expansions whose coefficients are determined by the existing CPU implementation of MMM1D in ESPResSo. The second sum can be completely unrolled as ESPResSo uses a fixed Nψ = 2. For the far formula, the number of required summands in the first sum is predetermined using the error formula (4). The Bessel functions are also evaluated using Taylor expansions, for which the coefficients are included in ESPResSo’s source code. Up to this point, the implementation of the energy formula and the force formula is very similar. They do however differ in the way the final result is determined. For energies, the results of all threads within a block are stored in shared memory and a

19

reduction (as explained in subsection 3.1.1) is performed. That way, each block only stores its intermediate total energy into global memory. Once all blocks have run, a second kernel is launched that sums up all the energies stored in global memory and stores that total back into global memory, leaving only a single number that needs to be retrieved from the GPU. Forces however need to be treated differently as for N particles, N forces need to be returned and a reduction is not appropriate. Instead, the forces contributed by each pair are stored into global memory. Once this has been done for all pairs, a second kernel is launched that goes over the forces in global memory and adds up all contributions for each particle. The total for each particle is then stored back into global memory. This approach is quite memory-intensive as it needs to store 3N 2 floating-point numbers for N particles. Alternatively, atomic additions can be used. Unlike normal additions, they check for collisions that occur when multiple threads try to write to the same memory location. As atomic additions can be a significant performance hit when threads often collide in their memory accesses, this is only used if the GPU has little memory or large numbers of particles are used. However, for particle numbers much larger than the number of cores available on the GPU, collisions become quite rare. So if the point at which memory is insufficient falls somewhere near the point where collisions get rare, atomic additions no longer have a significant performance impact. Another possible approach to reducing the O( N 2 ) to an O( N ) memory requirement is a programming pattern called tiling ([15]). Here only a subset of n2 out of the N 2 forces would be calculated at one time and one thread each would calculate the forces acting on one particle, summing them up while looping over the interaction partners. This approach also has the advantage that the input data (positions and charges of this subset of particles) could first be copied to shared memory as not all input data is used simultaneously. This avoids the latencies of global memory during the actual calculations. Tiling requires more coordination from the host code than a naive N 2 implementation as it needs to be made sure that summing up the contributions from each tile occurs without memory access collisions. Tiling was not used in this implementation of MMM1D, but could be added in the future to further improve performance. The implementation of MMM1D done for this thesis is also capable of utilizing multiple GPUs in the same computer. This is done by having each GPU work on only part of all pairs. To start out with, the line at which the GPUs switch is drawn based on the GPUs’ reported clock speeds, which only gives a very rough estimate of GPU performance. For each subsequent kernel run, the time required by each GPU is measured using CUDA events and the dividing line is moved so that each GPU needs the same amount of time to operate on the share of pairs it is assigned. For two identical GPUs, this obviously splits the computation time near in half, while in asymmetric multi-GPU systems other effects (e.g. significantly different memory copy bandwidths) might shrink the advantage. However, on all tested multi-GPU systems, the total run time benefitted from using all available GPUs. To minimize the effect of branch divergence due to the different formulas for particles spaced closely and far apart, it was tested whether pre-sorting the particle pairs based on which formula would be used could benefit performance. For this pur-

20

pose, the Thrust library1 was used. An array of N 2 ascending integers was created to identify all particle pairs. Then, thrust::partition was used to move all closely spaced particle pairs to the beginning of that array and the others to the end. While this did improve the run time of the force and energy calculation kernels, the sorting itself also took time, making the overall effect only minimal. Additionally, O( N 2 ) memory requirements are not favorable and for operation on GPUs with little free memory, alternative code paths that work without pre-sorting would have needed to be provided. As part of its transition to C++ code, ESPResSo is receiving a new object-oriented forces interface that makes it easy to integrate new forces. At the beginning of the main forces and energies routines, the methods init and run are called on all members of an object of type ForceIterator. At the end of the routines, a method isReady is used to synchronize the new force which may be running asynchronously. After synchronization, the results are retrieved from the object and added to the results obtained from code that uses the ESPResSo’s classic interface. Implementing a new force on top of the interface is now as simple as creating a class that provides the necessary methods and extending the Tcl interpreter with a function that adds an instance to ForceIterator: # include " forces . hpp " int tc l c o m m a n d _ i n t e r _ c o u l o m b _ p a r s e _ m m m 1 d g p u ( Tcl_Interp * interp , int argc , char ** argv ) { /* ... */ Mmm1dgpuForce * A = new Mmm1dgpuForce () ; FI . addMethod ( A ) ; coulomb . method = COULOMB_MMM1D_GPU ; mpi_bc ast_ coul omb _par ams () ; } This function needs to be registered inside tclcommand_inter_parse_coulomb, the parent function for all Coulomb interactions, and, as MMM1D is mutually exclusive with all other Coulomb interactions provided by ESPResSo, it must be registered as a Coulomb method in interaction_data.hpp. The CUDA implementation of MMM1D was verified against ESPResSo’s CPU implementation: Both agree within machine precision. In the zone of overlap between near and far formula, the CUDA implementations of both formulae have been verified against each other, again agreeing within machine precision.

3.3.3

P e rformance

To determine the performance of the CUDA implementation of MMM1D, particles were randomly placed into a simulation box and the forces acting on all of them were calculated. The radius at which the switch between near and far formula occurs was set to 60% of the box length and the error bound was set to 10−4 . With the same parameters, the existing CPU implementation in ESPResSo took around 140 1 Thrust 1.5.3, http://thrust.github.io

21

nanoseconds per particle pair running on all four cores of on an Intel Core i7-38202 , independently of the number of particles. The results obtained using an Nvidia GeForce GTX 6803 are shown in Figure 1. It is clearly visible that above 400 particles, the kernel execution time divided by the number of pairs becomes nearly constant. This effect is due to the memory latencies being increasingly hidden as the number of threads becomes much larger than the number of GPU cores. The figure also shows the performance when using doubleprecision floating point numbers. The significant difference over single precision is due to the small number of double-precision arithmetic units Nvidia GPUs have. The GTX 680’s theoretical double precision performance is 24 times lower4 than its single precision performance. For comparison, a modern CPU’s double precision arithmetic is only 2 times slower than its single precision arithmetic. At 200 particles, the total time consumption is still dominated by the overhead of the memory copy operation and the kernel launch. Beyond however this overhead quickly becomes negligible. For small systems (less than 100 particles), the GPU implementation is actually slower than the existing CPU implementation. As the CUDA code executes asynchronously with respect to the CPU code, the actual contribution of the CUDA code to the total execution time may be as low as zero as long as the CPU has enough work to do by itself. As shown above, the implementation of MMM1D on GPGPU outperforms a CPU implementation by a factor of 40 for sufficiently large numbers of particles when comparing a modern high-end GPU against a high-end CPU. Looking at performance per Dollar (list prices: 499 USD vs. 294 USD), the GPU still beats the CPU by a factor of 24. In terms of performance per Watt of power consumption (per data sheet: 195W vs. 130W), the advantage is a factor of 20. At the same time, it needs to be considered that writing highly parallel code for a GPU using CUDA is relatively easy, while writing the same code for a CPU cluster using MPI5 would be much more cumbersome. While ESPResSo itself already contains most of the necessary MPI support code, mitigating much of such additional work, stand-alone or from-scratch programs would benefit from the reduced effort of writing CUDA code compared to MPI code. Additionally, as MPI communication adds its own overhead and communication between multiple computers is much more expensive than communication between CPU and GPU inside a single computer, 40 CPUs might not even be enough to beat the single GPU. In contrast, adding multiple GPUs to a single computer (as many as four GPUs might be feasible) would directly scale up the performance (see previous section). This implementation was integrated into ESPResSo, taking a pre-release version of ESPResSo 3.2 as the basis. It is expected to be included with an upcoming release of ESPResSo. Performance measurements of the code within ESPResSo were not taken yet as its new object-oriented interface still contains some bottlenecks that limit the speed of transferring data between the GPU and the CPU code. 2 http://ark.intel.com/de/products/63698 3 http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications 4 CUDA Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index. html 5 Message Passing Interface, a protocol for communication in parallel programs

22

Time per particle pair (ns)

1000

Single precision (total) Single precision (kernel only) Double precision (total) Double precision (kernel only) Espresso CPU implementation

100

10

1

101

102

103 Number of particles

104

105

Figure 1: Scaling behavior of the MMM1D implementation on CUDA as measured on an Nvidia GeForce GTX 680. For large numbers of particles, memory copy and kernel launch overhead become negligible and the time required per particle pair approaches 3.5 ns / 12.1 ns (single/double precision). The CPU implementation in ESPResSo consumes 140 ns per particle pair on a four-core Intel Core i7-3820.

23

4 S I M U L AT I O N The simulation script, which is included in appendix A.2, was adapted from one provided by Axel Arnold. It depends on a parallel tempering script which is included with the ESPResSo source code in scripts/parallel_tempering.tcl. For all of the simulations detailed in the following sections, 96 counterions were placed in the middle between the two model DNA molecules. The counterions were charged q = 5, the DNA had a line charge density of τ = −0.5. Due to charge neutrality, the line charge density of the counterions is twice as large as the one of each DNA molecule. This corresponds to a simulation box length of 480. The surfaceto-surface distance of the two DNA molecules was set to ∆ = 0.25 and their radius was set to either R = 1.0 or R = 3.0. The counterion charge can be scaled out (to q = 1) by changing the line charge density to τ = −0.1 and increasing the prefactor from 8.0 to 200.0. The prefactor compensates for the other changes as only q2 , τ 2 and prefactor qτ enter into the forces. As l B = , the temperatures need to be kept unchanged T and only the Bjerrum lengths are increased by a factor of 25. The counterions are the only thing in the system that is mobile, everything else maintains a fixed position. To make sure the counterions stay inside the simulation box, a cylindrical wall interacting with them via a repulsive-only Lennard-Jones potential was used. The counterions interact with each other and with the DNA molecules via the same Lennard-Jones potential. As both the counterions and the DNA molecules are charged, their electrostatic interactions are calculated using the MMM1D algorithm, detailed in previous chapters. The switching radius beyond which the far formula is used is set to be larger than the cylindrical wall so that only the near formula is used. It turns out to be computationally less intensive than the far formula while still providing the required accuracy. The upper error bound for the pairwise forces is set to ∆Fij = 10−4 . The simulation uses a method called parallel tempering. This is a Monte Carlo-like scheme with the purpose of improving the sampling of the entire phase space ([7, p. 389]). Multiple systems are set up at different temperatures and the particle trajectories are integrated for a number of time steps. Then, a Metropolis criterion based on the energies of two systems with neighboring temperatures is used to determine with which probability these two systems should swap their temperatures:   Uj − Ui Pi↔j = min 1, (16) (k B Ti )−1 − (k B Tj )−1 If this swap is performed, the velocities of the particles in both systems are rescaled so that the systems now have their respective new temperatures. Systems at high temperatures are able to cross free energy minima and thus have access to the entire phase space, while systems at low temperatures stay around individual free energy minima and sample this small part of phase space more accurately. To make sure the swaps are able to occur with sufficient likelihood, it is necessary for the energy distributions of the systems to overlap. For illustration purposes,

24

a simulation like in section 4.1 was run (at R = 3.0), the total energy of each system was determined repeatedly throughout the simulation and energy histograms were created for each temperature. These are shown in Figure 2. The overlap between neighboring temperatures is clearly visible. If parallel tempering is run for a sufficiently long time, each system essentially performs a random walk through all temperatures. This is visualized in Figure 3. At the beginning of each simulation, systems at all specified temperatures are created and the trajectories are integrated for a certain number of time steps. Now the current total energy is determined and parallel tempering swaps are performed per the Metropolis criterion. All this is repeated over and over until the desired number of time steps has been reached. At regular intervals, the net force attracting or repelling the two DNA molecules is written to a file, as is the complete configuration of all particle positions. The different sections below use different models for the DNA molecules while still having a common line charge density. Each simulation is performed for R = 1.0 and R = 3.0 and samples different numbers of particles per DNA molecule. Regularly during the simulation, the forces acting on the DNA molecules are written out. After the simulation completes, for each temperature, the mean force and the force variance are calculated. Negative forces correspond to an attraction between the two DNA molecules and positive forces correspond to repulsion. The forces are divided by the Bjerrum length as theory (subsection 2.2.2) predicts that for large Bjerrum lengths (small temperatures), the force is proportional to Bjerrum length. The forces are also normalized by dividing by the simulation box length λz . Visualizations of molecule configurations are created using the software VMD ([9, 16]). All rods and ions are drawn to scale; their radii correspond to half their LennardJones parameter σ.

25

0.040

T=0.50 T=0.59 T=0.69 T=0.80 T=0.94 T=1.01 T=1.29 T=1.51 T=1.77 T=2.07 T=2.42 T=2.83 T=3.32 T=3.88 T=4.55 T=5.32 T=6.23 T=7.30 T=8.54 T=10.0

Relative probability

0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000 −22050

−21150 Total energy

−12150

Figure 2: Parallel Tempering: Energy histograms at all simulated temperatures. The overlap between neighboring temperatures ensures that swaps occur.

Temperature

10

1

0

2000

4000

6000 8000 10000 12000 14000 16000 18000 20000 Number of attempted swaps

Figure 3: Parallel Tempering: The individual systems perform a random walk through the temperatures. For clarity, only four of the 20 systems are shown.

26

4.1

Rods with Continuous Charges and LennardJ ones Interactions

To compare the results from later simulations back to the rod model used in e.g. [1] and [3], initial simulations were performed using rods. ESPResSo has a feature called constraints that, among other things, allows setting up lines of constant charge density as well as cylinders of a defined radius that interact with particles via a LennardJones potential. Combined, they are used to model the DNA as linear molecules. For each of these constraints, the force acting on them can be queried, which is how the forces we are interested in are obtained. ESPResSo does not currently calculate the interactions between two constraints, so the repulsive electrostatic force between the two charged rods needs to be added by analytic calculation using Equation 15. The simulation was performed for R = 1.0 and R = 3.0 with parallel tempering for 20 exponentially spaced temperatures. Resulting forces are shown in Figure 5. The results match theory well: for small Bjerrum lengths (large temperatures), repulsion is observed; for large Bjerrum lengths (small temperatures), the force is proportional to the Bjerrum length. Since it will be needed for comparison later, for the lowest temperature, a histogram of the x-component of the distance of the counterions to the respective nearest rod was computed and is shown in Figure 6. For R = 1.0, attraction occurs for all but the smallest Bjerrum lengths. For larger Bjerrum lengths, the counterions arrange themselves exclusively between the two rods and stay close to the middle, mediating attraction. For R = 3.0, repulsion still occurs for significantly larger Bjerrum lengths. This difference is expected as increasing the rod radius (while keeping the surface-to-surface distance constant) is equivalent to moving same-sized rods closer together. Also, even for large Bjerrum lengths the counterions no longer all align in the gap between the rods. While the majority is centered very sharply in there, some now move to the other side of the rods. To get a qualitative impression of how the counterions arrange around the DNA molecules, snapshots at different temperatures from the simulation are pictured in Figure 4. At low temperatures, it is clearly visible that the counterions arrange themselves in the rod plane. For R = 1.0, they all stay between the rods while for R = 3.0 a few move outside as seen in the histogram in Figure 6. At high temperatures, the counterions leave the rod plane.

27

(a) R = 1.0, l B = 16.0

(b) R = 3.0, l B = 16.0

(c) R = 3.0, l B = 1.10

Figure 4: Snapshots from the simulation using both continuously charged rods and continuous Lennard-Jones interactions.

28

1

R = 1.0 R = 3.0

0.5 0

Force on rods

-0.5 -1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5

0

2

4

6

8 lB

10

12

14

16

Figure 5: Forces acting on DNA molecules when using both continuously charged rods and continuous Lennard-Jones interactions.

0.0007

R = 1.0 R = 3.0

Relative probability

0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0

-1.5

-1 -0.5 0 0.5 1 x-position, normalized to rods’ axial distance

1.5

Figure 6: x-position of the counterions relative to the DNA molecules at T = 0.50 when using both continuously charged rods and continuous Lennard-Jones interactions.

29

4.2

Lines of D iscrete Charged Particles with Lenn ard-J ones Interactions

For the next simulation, discrete fixed particles placed opposite each other (without a shift along the z-axis) were used in place of the rods. Different numbers of particles were sampled. The results are shown in Figure 8 and Figure 9, each compared against the previous results for the continuous rods. At R = 1.0, the results for 1.0 and 2.1 particles per counterion per per rod are quite remarkable as they significantly deviate from the forces previously obtained from continuous rods and are much stronger. For 3.1 and more particles however, the forces are slightly weaker than for the continuous rods while quickly approaching them as the number of particles gets larger. The x-position histograms show that for small numbers of particles (where the forces are stronger), the particles align slightly off the middle between the two particle rods, while for large numbers of particles they align closer to the rods. At R = 3.0, only 1.0 particles show an unexpected behavior; the larger particle numbers result in forces slightly stronger than the forces on continuous rods and and again quickly approach them as the particle number gets larger. Here, the histograms for larger particle numbers mostly agree with the ones for the continuous rods. Only for 1.0 particles per rod, the particles align quite differently: while before they were almost exclusively located in the gap between the rods and a few particles had moved out from between the rods, they now distribute themselves almost evenly among five locations: in the gap between the rods, and symmetrically at the surfaces of both rods. 2.1 particles however are already spaced sufficiently close that the counterions can no longer move into the gaps between two particles. In section 4.4 in Figure 14, the forces for constant Bjerrum length l B = 16.0 will also be shown as a function of the number of particles per molecule (only the red points are relevant to the current section). This makes it easy to see the convergence of the forces against a single value, as well as the striking fluctuations for low numbers of particles. Figure 7 shows some snapshots of these simulations. At R = 1.0, it is clearly visible that for 1.0 particles per rod, the counterions mostly pair up with opposite particles of the rods, aligning slightly off-center; for 4.2 particles a few tend to be almost in the middle between the rods while the majority enter into the rods to pair up with two particles of a rod. This zig-zag alignment can also be seen in the histograms in Figure 9: For small numbers of particles per rod, few counterions are in the very center, instead they are close to the surface of the rod particles. As the number of particles gets greater, the counterions first move closer to the middle. This happens until the rod particles are spaced closely enough for it to be more favorable for the counterions to align between the rod particles. Further increases in the number of particles result in the counterions getting pushed further out. At R = 3.0, the situation is slightly different: with 1.0 particles per rod, most counterions can be seen on the surface of the rods, which is already no longer the case at 2.1 particles where the behavior is very similar to the previous simulation with the continuous rods.

30

(a) 1.0 particles per rod, R = 1.0, l B = 16.0

(b) 4.2 particles per rod, R = 1.0, l B = 16.0

(c) 1.0 particles per rod, R = 3.0, l B = 16.0

Figure 7: Snapshots from the simulation using discrete charged particles with Lennard-Jones interactions.

31

R = 1.0

R = 3.0

2

0.6 0.4

0 Force on rods

Force on rods

0.2 -2 -4 -6

0 -0.2 -0.4 -0.6

-8 -10

-0.8 0

2

4

6

8 10 12 14 16 lB 1.0 per rod 3.1 per rod 2.1 per rod 4.2 per rod

-1

0

2

4

6

8 10 12 14 16 lB

Rods

Figure 8: Forces acting on DNA molecules when using discrete charged particles with Lennard-Jones interactions.

R = 3.0 0.0007

0.00035

0.0006

0.0003 0.00025 0.0002 0.00015 0.0001

Relative probability

Relative probability

R = 1.0 0.0004

0.0004 0.0003 0.0002 0.0001

5e-05 0

0.0005

-0.4 -0.2 0 0.2 0.4 x pos., normalized to ax. dst. 1.0 per rod 3.1 per rod 2.1 per rod 4.2 per rod

0

-1.5 -1 -0.5 0 0.5 1 1.5 x pos., normalized to ax. dst. Rods

Figure 9: x-position of the counterions relative to the DNA molecules at T = 0.50 when using discrete charged particles with Lennard-Jones interactions.

32

4.3

Lines of D iscrete Charged Particles and Cylin ders with Continuous Lennard-J ones Interactions

To check whether the differences between the simulations in section 4.1 and section 4.2 are due to the discretization of the Lennard-Jones interaction partners or the discretization of the electrostatic interaction partners, a simulation with discrete particles without Lennard-Jones interaction combined with Lennard-Jones cylinders was performed. The resulting forces are shown in Figure 10 where they are compared against the forces from section 4.2. The forces with continuous Lennard-Jones cylinders deviate only slightly from the ones with Lennard-Jones particles. The effect is quite small, suggesting that the effect of discretization observed in section 4.2 is mostly due to the discretization of electric charges. R = 1.0

R = 3.0

1

0.8 0.6

0 Force on rods

Force on rods

0.4 -1 -2 -3 -4

0.2 0 -0.2 -0.4 -0.6

-5 -6

-0.8 0

2

4

6

8 10 12 14 16 lB 2.1 per rod + cont. LJ 2.1 per rod 4.2 per rod + cont. LJ 4.2 per rod

-1

0

2

4

6

8 10 12 14 16 lB

Rods

Figure 10: Forces acting on DNA molecules when using discrete charged particles and continuous Lennard-Jones interactions.

33

4.4

Lines of D iscrete Charged Particles with Lenn ard-J ones Interactions and a Phase S hift

To further analyze the effect of charge discretization, which has already been shown to cause a strong tendency for the counterions to align between two rod ions, another simulation was performed where the two lines of equally spaced particles from section 4.2 were shifted relative to each other by half a particle. The results are shown in Figure 12 and Figure 13, each compared against the previous results for non-shifted particle rods. Again, the results for 1.0 and 2.1 particles stand out at R = 1.0. For 1.0 particles, shifting the rods makes the forces differ by almost an order of magnitude. For 2.1 particles, the forces are significantly different from the ones for the non-shifted particle rods, however they are already closer to the ones obtained for continuous rods. For R = 3.0, the effect is significantly weaker: for 1.0 and 2.1 particles per rod, the forces are only slightly larger than for the non-shifted particle rods. The histograms show no major difference compared to the non-shifted particle rods. Only for R = 3.0 with 1.0 particles it is noticable that shifting the rods significantly reduces the number of counterions that are located right in the middle between the rods. Figure 14 shows the same data for both the shifted and the non-shifted molecules in a different way: the forces are displayed for the smallest Bjerrum length as a function of the number of particles per rod simulated, which makes it easier to see against which value they converge (the value for continuous rods). At high temperatures, the behavior of the counterions looks no different from previous simulations. At low temperatures, it can be observed that the counterions tend to arrange themselves on the connecting line between two closest particles of opposite rods. The counterions stay close to a particle of either one or the other rod and occassionally switch between the rods. They also occasionally switch between the two connecting lines emerging from one rod particle. Both of these observations suggest a frustrated ground state. At R = 3.0, the counterions additionally are observed to move around the rod. Snapshots are shown in Figure 11.

34

(a) 1.0 particles per rod, R = 1.0, l B = 16.0

(b) 1.0 particles per rod, R = 3.0, l B = 16.0

(c) 4.2 particles per rod, R = 1.0, l B = 16.0

Figure 11: Snapshots from the simulation using discrete charged particles with Lennard-Jones interactions. The two lines of charges are shifted relative to each other by half a particle.

R = 1.0, l B = 16.0

R = 3.0, l B = 16.0

0

0 -0.2 Force on rods

Force on rods

-2 -4 -6 -8 -10

-0.4 -0.6 -0.8 -1

0 1 2 3 4 5 6 particles per rod per counterion discrete

-1.2

0 1 2 3 4 5 6 particles per rod per counterion discrete, shifted

Figure 14: Forces acting on DNA molecules as a function of the number of particles per molecule. Discrete charges and Lennard-Jones interactions are used and the forces are shown both for lines of charges non-shifted and shifted by half a particle.

35

R = 1.0

R = 3.0

2

0.6 0.4

0 -2

Force on rods

Force on rods

0.2

-4 -6

0 -0.2 -0.4 -0.6

-8 -10

-0.8 0

2

4

6

1.0 per rod, shifted 2.1 per rod, shifted 3.1 per rod, shifted

-1 8 10 12 14 16 0 lB 4.2 per rod, shifted 1.0 per rod 2.1 per rod

2

4

6

8 10 12 14 16 lB 3.1 per rod 4.2 per rod Rods

Figure 12: Forces acting on DNA molecules when using discrete charged particles with Lennard-Jones interactions, comparing results of shifted and nonshifted molecules. R = 3.0 0.0007

0.00035

0.0006

0.0003 0.00025 0.0002 0.00015 0.0001

Relative probability

Relative probability

R = 1.0 0.0004

0.0005 0.0004 0.0003 0.0002

5e-05

0.0001

0

0

-0.4 -0.2 0 0.2 0.4 x pos., normalized to ax. dst. 1.0 per rod, shifted 4.2 per rod, shifted 2.1 per rod, shifted 1.0 per rod 3.1 per rod, shifted 2.1 per rod

-1.5 -1 -0.5 0 0.5 1 1.5 x pos., normalized to ax. dst. 3.1 per rod 4.2 per rod Rods

Figure 13: x-position of the counterions relative to the DNA molecules at T = 0.50 when using discrete charged particles with Lennard-Jones interactions, comparing results of shifted and non-shifted molecules.

36

4.5

H elices Consisting of D iscrete Charged Parti cles with L ennard-J ones I nteractions

For the final simulations, the particles modeling the DNA were placed in a helical geometry. Of all the models investigated here, this should be closest to real DNA. The number of particles, the diameter and the pitch (the height of a single turn) are chosen to match experimental measurements of DNA. B-DNA has a pitch of 3.4 nm and a diameter of 2.0 nm. Each base pair is doubly charged and a single base pair has a rise of 0.324 nm, i.e. there are 10.5 base pairs per turn ([14]). It must be noted that ESPResSo does not have a native unit system, i.e. relative units are used for everything. So to build a DNA molecule, the other dimensions need to be scaled based on the radius set. Only simulations with R = 3.0 are performed as the Lennard-Jones parameters σ = 1.0 needs to be subtracted from R to get the effective radius of the helix. So for R = 3.0, the pitch is set to h = 10.2. For realistic DNA, there need to be 10.5 base pairs per turn, i.e. 21λz /h particles are needed in the helix. The fact that each base pair should be made up of two particles is disregarded here for performance reasons and since previous simulations have shown that for such large numbers of particles the effect of adding more is quite small. λz needs to be a multiple of 2h so that the helix is seamlessly replicated by the periodic boundary conditions. To maintain the correct line charge density, this would require the number of particles per rod to be a multiple of 525 and the number of counterions to be the same multiple of 102, i.e. 5.1 particles per rod per counterion. For the comparison data in the diagrams, the previous simulations were rerun at these slightly different parameters, even though this did not yield noticably different results. Different phase shifts between the two helices were sampled. The resulting forces on the helices with different phase shifts and the distribution of counterions are shown in Figure 16 and Figure 17. The forces are also shown as a function of the phase shift in Figure 18. Snapshots are shown in Figure 15. Without a phase shift, the counterions tend to move inside the helices (with a slight bias towards the opposite helix) and slightly enter their inner surface by moving into the gaps between two helix particles. This results in an only slightly attractive force. When the two helices are shifted against each other, the attraction gets stronger and for a phase shift of 180◦ , the forces mostly agree with those for rods for large Bjerrum lengths. A number of particles now also move into the gaps where the two helices are closest. For small Bjerrum lengths, the forces are less dependent on phase shift than for rods; however these forces are not even close to recovering those for rods. The helices reach the attractive regime at Bjerrum lengths lower than is the case with rods.

37

(a) No phase shift, l B = 16.0

(b) 180◦ phase shift, l B = 16.0

(c) 180◦ phase shift, l B = 1.10

Figure 15: Snapshots from the simulation using a helix model with discrete particles.

0.2

Force on helices

0 -0.2 -0.4 -0.6 -0.8 -1

Helices, l B = 16.0 Helices, l B = 1.76 0◦

60◦

Rods, l B = 16.0 Rods, l B = 1.76

120◦ 180◦ 240◦ Phase shift between helices

300◦

360◦

Figure 18: Forces acting on DNA molecules as a function of the phase shift at T = 0.50 when using a helix model with discrete particles. For comparison, the forces obtained for continuous rods are also shown.

38

0.6

Helices shifted 0◦ Helices shifted 90◦◦ Helices shifted 120◦ Helices shifted 150◦ Helices shifted 180 Rods with R = 3.0

0.4

Force on helices

0.2 0 -0.2 -0.4 -0.6 -0.8 -1

0

2

4

6

8 lB

10

12

14

16

Figure 16: Forces acting on DNA molecules as a function of the Bjerrum length when using a helix model with discrete particles. Different phase shifts are displayed.

0.0007

Helices shifted 0◦◦ Helices shifted 180 Rods

Relative probability

0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0

-1.5

-1 -0.5 0 0.5 1 x-position, normalized to helices’ axis distance

1.5

Figure 17: x-position of the counterions relative to the DNA molecules at T = 0.50 when using a helix model with discrete particles.

39

5 CONCLUSION The objective of this thesis, re-implementing the MMM1D algorithm on CUDA, was successfully met and MMM1D turned out to be quite well suited for GPGPUs. Using this new implementation, results from previously published simulations with MMM1D were replicated and further simulations were performed that previously would have taken impractical amounts of computation time. The simulations performed show that charge discretization and phase shifts between DNA molecules, modeled as rods, have a significant influence on their attractive properties, an effect that previous works disregarded as it was computationally too expensive, even though it turns out to be too large to neglect for realistic results. For numbers of particles per rod similar to the number of counterions, forces differing from those for continuously charged rods by a factor of 2.5 could be seen, while for larger numbers of particles the forces converged against the ones for rods. Even larger differences were observed when shifting the two rods of discrete particles relative to each other by half a particle: for small numbers of particles, the forces between shifted and non-shifted rods differed by a factor of almost an order of magnitude. The forces on the shifted rods actually turned out to be closer to those of the continuously charged rods than was the case for unshifted rods. Curling up the discretely charged rods into helices, thus making the most accurate model of DNA that could be simulated with the limits of time and resources for this thesis, reveals further geometry dependencies and again a strong influence of a phase shift between the two helices. For phase shifts of 180◦ , the results for the continuous rods are mostly recovered for large Bjerrum lengths, but for any other phase shift, the forces are weaker, albeit still attractive. The forces are even attractive at small Bjerrum lengths for which continously charged rods already repell each other. However, these observations need to be taken with a grain of salt as in such more complex geometries, finite size effects might adversely influence the results. Further simulations may be necessary. The new GPGPU implementation of the MMM1D algorithm lays the foundation for further research in 1D periodic electrostatic systems as it now allows simulating systems with several thousand particles in reasonable amounts of time.

40

A

APPENDIX A.1

I mplementation of MMM1D on CUDA

Due to its length, the C++ and CUDA code written is not included here in printed form. The standalone version of the CUDA MMM1D implementation is available at http://bitbucket.com/mkuron/mmm1dgpu and the version integrated into ESPResSo will be available at http://github.com/mkuron/espresso until it is included into the main ESPResSo development tree. The code was developed and tested against the CUDA SDK in version 5.0 and 5.5. It is known not to work on SDK version 3.2 and below as it accesses all GPUs in a multi-GPU system from a single CPU thread. The bulk of the code added to ESPResSo as part of this thesis is located in the files Mmm1dgpuForce.cu and Mmm1dgpuForce.hpp in the src folder in ESPResSo’s source tree, while some reusable parts are in atomic.cuh, mmm-common.cuh and specfunc.cuh. Minor changes related to the ESPResSo integration are sprinkled throughout ESPResSo’s code.

A.2

S imulation S cript

The simulation script for performing all of the simulations in chapter 4 is shown below. A plain-text version of the script file is also embedded into the electronic PDF version of this thesis. PDF readers that support such attachments include Adobe Reader. # ################### # # Parameters: # rod_rad = 1 .0 or 3 .0 # model = " rods " or " parts " or " ljcyl " or " helix " # helix_parts = 100 ...800 # helix1_shift = 0 .0...1.0 ( i.e. multiples of the ,→ particle distance ) # rounds = 100 ...1000 # # For the helix model, these parameters should be used so ,→ the helix is correctly continued at the PBC: # model = " helix " # rod_rad = 3 .0 # helix_parts = 525 # COU_rod = 51 ( instead of 48)

41

#

# # #

helix1_shift = [ expr 0 .0*10.5 ] ... [ expr 1 .0*10.5 ] ( ,→ taking into account that there are 10 .5 base pairs per ,→ turn ) Running: mpirun -np 4 ~/ Documents / espresso / Espresso ,→ thesis.tcl 1 -nodes 1

# # ################### set rootdir " / home / mkuron / Documents " source " $rootdir / espresso / scripts / parallel_tempering.tcl " # ## # 0 # 1 # 2

types used outer limit lj rods counterions

# ## electrostatics set maxPWerror 1 e-4 # bjerrum-length / temperature # 8 keeps them close at the surface # with the default lj_eps=1 # set prefactor 200 .0 set prefactor 8 .0 # set of temperatures to use set nr 20 # temperatures via prefactor / Bjerrum-length set temp_min [ expr $prefactor /16 .0 ] set temp_max [ expr $prefactor /0 .8 ] for { set i 0} { $i < $nr } { incr i } { lappend temps [ expr $temp_min*pow ( double ( $temp_max ) / ,→ $temp_min, double ( $i ) /( $nr-1 ) ) ] } # ## set set set

repulsive Lennard-Jones lj_eps 1 lj_sigma 1 lj_cut [ expr 1 .12246*$lj_sigma ]

# ## counterions # set COU_charge 1 set COU_charge 5 set COU_rod 48 set n_part [ expr 2 *$COU_rod ]

42

# how to distribute the charges initially set initial_inner [ expr (2 *$COU_rod*10 ) /10] # ## rod parameters set ss_dist .25 # set lambda -0.1 set lambda -0.5 set rod_rad 1 .0 set set set set

model " rods " helix_parts 0 helix_pitch [ expr 3 .4*$rod_rad ] helix1_shift 0 .0

if { $model == " rods " } { set helix_parts 0 } # ## thermostat, friction set gamma 1 # ## timing setmd time_step 0 .005 set wrt_time 1000 set equ_steps 50 set measure_steps 5 set rounds 100 # ## integration parameters setmd skin 0 # ## misc # fixed output directory format if { $helix1_shift ! = 0 .0 } { set output " $rootdir / ,→ $ m o d e l - $ h e l i x _ p a r t s - s h i f t _ $ h e l i x 1 _ s h i f t - R _ $ r o d _ r a d ,→ " } elseif { $helix_parts == 0 } { set output " $rootdir / $model-R_$rod_rad " } else { set output " $rootdir / $ m o d e l - $ h e l i x _ p a r t s - R _ $ r o d _ r a d " } # ################### # Setup System # ################### # parse parameters

43

set master " " set nodes -1 set argv [ lrange $argv 1 end ] while { $argv ! = " " } { set arg [ lindex $argv 0] switch -- $arg { " -nodes " { set nodes [ lindex $argv 1] } " -master " { set master " -connect [ lindex $argv 1] " } " -subdir " { set output " $output /[ lindex $argv 1] " } default { error " unknown parameter $arg " } } set argv [ lrange $argv 2 end ] } if { $nodes == -1 } { error " please specify the number of ,→ processes " } set load " -load [ expr ([ llength $temps ] + $nodes - 1) / $nodes ,→ ] " # master sets up the directory exec mkdir -p $output catch { exec cp $argv0 " $ { output }/ " } # ################### # Interaction setup # ################### # just to keep the integrator happy, will be changed later setmd box_l 100 100 100 # wall inter 0 2 lennard-jones $lj_eps $lj_sigma $lj_cut auto 0 # rods if { $model == " helix " } { inter 1 2 lennard-jones $lj_eps $lj_sigma $lj_cut auto 0 # The helix takes care of the radius itself. } else { inter 1 2 lennard-jones $lj_eps $lj_sigma $lj_cut auto [ ,→ expr $rod_rad - $lj_sigma ] # For the linear models, the radius is taken care of by ,→ the Lennard-Jones potential. # The LJ radius ( $lj_sigma ) needs to be subtracted so ,→ that the effective radius is $rod_rad. # Note that the cylinder constraints have a radius of ,→ zero. If it were [ expr $rod_rad - $lj_sigma ] , we ’ d ,→ need to set the radius in the above line to zero. } # ions with each other

44

inter 2 2 lennard-jones $lj_eps $lj_sigma $lj_cut auto 0 # ################### # Initialization # ################### cellsystem nsquare set tcl_precision 6 set rod_dist [ expr 2 *$rod_rad + $ss_dist ] set outer_limit [ expr 4 *$rod_dist ] set box_l [ expr -double ( $COU_rod*$COU_charge ) / $lambda ] set box_l_xy [ expr 2 *$outer_limit ] if { $model == " helix " } { if { [ expr $helix_parts /21 .0 ] ! = [ expr $helix_parts /21] ,→ } { error " The number of particles in the helix ( ,→ $helix_parts ) must be a multiple of 21 . " } if { [ expr $box_l /20 .4 ] ! = [ expr int ( $box_l /20 .4 ) ]} { error " The box length ( $box_l ) must be a multiple of ,→ 20 .4. " } if { $helix_parts ! = [ expr $box_l*35 /34] } { error " The number of particles in the helix ( ,→ $helix_parts ) must be 35/34 times the box ,→ length ( $box_l ) . " } } set cxy [ expr $box_l_xy / 2] # rod positions set r1_center [ expr $cxy - $rod_dist /2] set r2_center [ expr $cxy + $rod_dist /2] setmd periodic 0 0 1 setmd box_l $box_l_xy $box_l_xy $box_l # constraints / rods # ################### proc helix { id xoffset yoffset radius height n_helix lambda ,→ type { zshift 0} } {

45

global model set pi 3 . 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 global box_l set q [ expr $lambda*$box_l / $n_helix ] # make it a rod instead of a helix if { $model ! = " helix " } { set radius 0 } for { set i 0} { $i < $n_helix } { incr i } { set posx [ expr $xoffset + $radius*cos (2 *$pi*1.0*$i / ,→ $n_helix*$box_l / $height ) ] set posy [ expr $yoffset + $radius*sin (2 *$pi*1.0*$i / ,→ $n_helix*$box_l / $height ) ] set posz [ expr $box_l*$i / $n_helix + $zshift*$box_l / ,→ $n_helix ] part [ expr $i + $id*$n_helix ] pos $posx $posy $posz ,→ type $type q $q fix } } proc helix_force { n } { global helix_parts set totx 0 .0 set toty 0 .0 set totz 0 .0 for { set i 0} { $i < $helix_parts } { incr i } { set f [ part [ expr $n*$helix_parts + $i ] print f ] set totx [ expr $totx +[ lindex $f 0]] set toty [ expr $totx +[ lindex $f 1]] set totz [ expr $totx +[ lindex $f 2]] } return " $totx $toty $totz " }

constraint delete if { $model == " rods " } { constraint rod center $r1_center $cxy lambda $lambda } if { $model == " rods " || $model == " ljcyl " } { constraint cylinder center $r1_center $cxy 0 axis 0 0 1 ,→ radius 0 dir outside length 1 e40 type 1 }

46

if { $model == " rods " } { constraint rod center $r2_center $cxy lambda $lambda } if { $model == " rods " || $model == " ljcyl " } { constraint cylinder center $r2_center $cxy 0 axis 0 0 1 ,→ radius 0 dir outside length 1 e40 type 1 } constraint cylinder center $cxy $cxy 0 axis 0 0 1 radius [ ,→ expr $outer_limit + $lj_sigma ] dir inside length 1 e40 ,→ type 0 # ################### # Helpers # ################### proc get_alpha {} { global r1_center r2_center set inner 0 set total 0 for { set p 0} { $p $r1_center && $posx < $r2_center } { incr ,→ inner } incr total } return [ expr double ( $inner ) / $total ] } proc writecfg { cfg code count alpha lj_forcex cl_forcex { ,→ range " all " }} { global box_l model puts $cfg " { nextcfg $code $count } " if { $model == " rods " } { puts $cfg " { rod_forces_lj {[ constraint force 1]} ,→ constraint force 3]}} " puts $cfg " { rod_forces_cl {[ constraint force 0]} ,→ constraint force 2]}} " } elseif { $model == " ljcyl " } { puts $cfg " { rod_forces_lj {[ constraint force 0]} ,→ constraint force 1]}} " puts $cfg " { rod_forces_cl {[ helix_force 0]} {[ ,→ helix_force 1]}} " } else { puts $cfg " { rod_forces_lj {0 .0 0 .0 0 .0 } {0 .0 0 .0 ,→ }} "

{[ {[

{[

0 .0

47

puts $cfg " { rod_forces_cl {[ helix_force 0]} {[ ,→ helix_force 1]}} " } puts $cfg " { net_forces lj $lj_forcex cl $cl_forcex } " puts $cfg " { total_energy [ analyze energy total ]} " puts $cfg " { alpha $alpha } " blockfile $cfg write variable box_l blockfile $cfg write particles { id pos v q f } $range flush $cfg } proc log_file { code tag } { global output return [ format " $ { output }/ info-$code- %4 .2f.log " $tag ] } proc cfg_file { code tag } { global output return [ format " $ { output }/ cfg-$code- %4 .2f " $tag ] } # ################### # Simulation # ################### proc init { id temp } { puts " @@ init $id $temp " # simulation data global wrt_time equ_steps global prefactor gamma # stored simulations global config time # for the initialization global maxPWerror global master # set up of particle coordinates global box_l outer_limit model global COU_rod initial_inner COU_charge helix_parts global r1_center r2_center cxy global rod_rad lambda global rod_dist ss_dist puts " [ pid ] equilibration " set log_file [ open [ log_file " equ " $temp ] " w " ]

48

set cfg_file [ open [ cfg_file " equ " $temp ] " w " ] puts $log_file " temperature $temp " puts $log_file " cylinder length $box_l, radius ,→ $outer_limit " puts $log_file " rod radius $rod_rad, line charge density ,→ $lambda " puts $log_file " rod distance $rod_dist, surface dist ,→ $ss_dist " # particle setup #################### if { $model ! = " rods " } { # DNA setup global helix_pitch helix_parts lj_sigma helix1_shift set helix_radius [ expr $rod_rad - $lj_sigma ] # the LJ radius is $lj_sigma which needs to be ,→ subtracted from the helix radius if { $model ! = " helix " } { set helix_radius 0 } if { $model == " ljcyl " } { set helix_type 3 } else { ,→ set helix_type 1 } helix 0 $r1_center $cxy $helix_radius $helix_pitch ,→ $helix_parts $lambda $helix_type helix 1 $r2_center $cxy $helix_radius $helix_pitch ,→ $helix_parts $lambda $helix_type $helix1_shift } set initial_out1 [ expr $COU_rod - $initial_inner /2] set initial_out2 [ expr 2 *$COU_rod - $initial_inner ,→ $initial_out1 ] puts " n_i=$initial_inner, n_o1=$initial_out1, ,→ n_o2=$initial_out2 " # inner set px $cxy set i_s [ expr $box_l / $initial_inner ] for { set i 0} { $i < $initial_inner } { incr i } { part [ expr $i +2 *$helix_parts ] pos $px $cxy [ expr ,→ $i*$i_s ] type 2 q $COU_charge v 0 0 0 } # outer 1 if { $initial_out1 > 0} { set px [ expr $r1_center - $rod_rad ] set i_s [ expr $box_l / $initial_out1 ] set s $initial_inner

49

for { set i 0} { $i < $initial_out1 } { incr i } { part [ expr $s + $i +2 *$helix_parts ] pos $px $cxy ,→ [ expr $i*$i_s ] type 2 q $COU_charge v 0 0 ,→ 0 } } # outer 2 if { $initial_out2 > 0} { set px [ expr $r2_center + $rod_rad ] set i_s [ expr $box_l / $initial_out2 ] set s [ expr $initial_inner + $initial_out1 ] for { set i 0} { $i < $initial_out2 } { incr i } { part [ expr $s + $i +2 *$helix_parts ] pos $px $cxy ,→ [ expr $i*$i_s ] type 2 q $COU_charge v 0 0 ,→ 0 } } # Equilibration Integration ( electrostatics startup ) #################### setmd time 0 thermostat langevin $temp $gamma # well, we only use the near formula, far is too ,→ expensive with our # topology inter coulomb [ expr $prefactor / $temp ] mmm1dgpu [ expr 2 ,→ *$outer_limit ] 20 $maxPWerror set coulomb_params [ lrange [ lindex [ inter coulomb ] 0] 2 ,→ end ] puts " [ pid ] coulomb parameters $coulomb_params " set wrote_fixed 0 for { set i 0} { $i < $equ_steps } { incr i } { integrate $wrt_time set alpha [ get_alpha ] if { $model == " rods " } set lj_forcex [ expr ,→ force 3] 0] ,→ 0]) / $box_l ] set cl_forcex [ expr ,→ force 2] 0] ,→ 0]) / $box_l ]

50

{ 0 .5* ([ lindex [ constraint [ lindex [ constraint force 1] 0 .5* ([ lindex [ constraint [ lindex [ constraint force 0]

} else { if { $model == " ljcyl " } { set lj_forcex [ expr 0 .5* ([ lindex [ constraint ,→ force 1] 0] - [ lindex [ constraint ,→ force 0] 0]) / $box_l ] } else { set lj_forcex 0 .0 } set cl_forcex [ expr 0 .5* ([ lindex [ helix_force 1] ,→ 0] - [ lindex [ helix_force 0] 0]) / $box_l ] } if { $wrote_fixed == 1 && $i ! = [ expr $equ_steps -1 ] ,→ } { writecfg $cfg_file equ [ setmd time ] $alpha ,→ $lj_forcex $cl_forcex " [ expr 2 ,→ *$helix_parts ] -end " } else { writecfg $cfg_file equ [ setmd time ] $alpha ,→ $lj_forcex $cl_forcex all set wrote_fixed 1 } puts $log_file " [ setmd time ] equ $alpha [ expr [ ,→ analyze energy total ] - [ analyze energy ,→ kinetic ]] $lj_forcex $cl_forcex " flush $log_file } close $log_file close $cfg_file set config ( $id ) [ part ] set time ( $id ) [ setmd time ] if { $master == " " } { p a r a l l e l _ t e m p e r i n g : : s e t _ s h a r e d d a t a ,→ $coulomb_params } } proc perform { id temp } { puts " @@ perform $id $temp " # simulation data global wrt_time measure_steps global prefactor gamma # stored simulations global config time global box_l model helix_parts n_part

51

# Integration #################### puts " [ pid ] start integration $temp " thermostat langevin $temp $gamma eval inter coulomb [ expr $prefactor / $temp ] ,→ $ p a r a l l e l _ t e m p e r i n g : : s h a r e d d a t a # load particles foreach p $config ( $id ) { eval part $p } setmd time $time ( $id ) # reassign velocities according to temperature set fac [ expr sqrt (1 .5*$temp*$n_part /[ analyze energy ,→ kinetic ]) ] puts " [ pid ] kinetic energy [ analyze energy kinetic ] " for { set p 0} { $p