A charge density prediction model for hydrocarbons using...

13
Machine Learning: Science and Technology PAPER • OPEN ACCESS A charge density prediction model for hydrocarbons using deep neural networks To cite this article: Deepak Kamal et al 2020 Mach. Learn.: Sci. Technol. 1 025003 View the article online for updates and enhancements. This content was downloaded by deepak.kamal from IP address 143.215.38.159 on 19/03/2020 at 16:59

Transcript of A charge density prediction model for hydrocarbons using...

Page 1: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

Machine Learning: Science and Technology

PAPER • OPEN ACCESS

A charge density prediction model for hydrocarbons using deep neuralnetworksTo cite this article: Deepak Kamal et al 2020 Mach. Learn.: Sci. Technol. 1 025003

 

View the article online for updates and enhancements.

This content was downloaded by deepak.kamal from IP address 143.215.38.159 on 19/03/2020 at 16:59

Page 2: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

Mach. Learn.: Sci. Technol. 1 (2020) 025003 https://doi.org/10.1088/2632-2153/ab5929

PAPER

A charge density predictionmodel for hydrocarbons using deepneural networks

DeepakKamal , AnandChandrasekaran , Rohit Batra andRampi RamprasadDepartment ofMaterials Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332,United States of America

E-mail: [email protected]

Keywords: electron density,machine learning, hydrocarbons, density functional theory, neural networks

Supplementarymaterial for this article is available online

AbstractThe electronic charge density distribution ρ(r) of a givenmaterial is among themost fundamentalquantities in quantum simulations fromwhichmany large scale properties and observables can becalculated. Conventionally, ρ(r) is obtained usingKohn–Shamdensity functional theory (KS-DFT)basedmethods. But, the high computational cost of KS-DFT renders it intractable for systemsinvolving thousands/millions of atoms. Thus, recently there has been efforts to bypass expensive KSequations, and directly predict ρ(r)usingmachine learning (ML) basedmethods. Here, we build uponone such scheme to create a robust and reliable ρ(r) predictionmodel for a diverse set of hydrocarbons,involving huge chemical andmorphological complexity /(saturated, unsaturatedmolecules, cyclo-groups and amorphous and semi-crystalline polymers).We utilize a grid-based fingerprint to capturethe atomic neighborhood around an arbitrary point in space, andmap it to the reference ρ(r) obtainedfrom standardDFT calculations at that point. Owing to the grid-based learning, dataset sizes exceedbillions of points, which is trained using deep neural networks in conjunctionwith a incrementallearning based approach. The accuracy and transferability of theML approach is demonstrated on notonly a diverse test set, but also on a completely unseen systemof polystyrene under different strains.Finally, we note that the general approach adopted here could be easily extended to othermaterialsystems, and can be used for quick and accurate determination of ρ(r) forDFT charge densityinitialization, computing dipole or quadrupole, and other observables forwhich reliable densityfunctional are known.

1. Introduction

The electronic charge density distribution ρ(r) of amolecule or amaterial is a physical observable of greatsignificance. In principle, all ground state properties associatedwith amaterial can be accessedwith theknowledge of its underlying ρ(r), as per density functional theory (DFT) [1]. An accurate estimate of ρ(r) canprovide insights concerning charge redistribution, bond formation etc, inmolecular andmaterials systems. ρ(r)is also the starting point for a variety of electronic structure simulations aimed at calculating higher levelelectronic properties like electrostaticmoments associatedwithmolecules such as dipole and quadrupolemoments, electrostatic potentials (recently demonstrated by Fabrizio et al [2]), electrostatic interaction energiesetc. It can also be used to directly compute IR intensities [3] and identify binding sites in host-guest compounds[4–7].

Obtaining the charge density of a givenmaterial system is, however, non trivial. Althoughmany-electronquantum chemistrymethods can accurately estimate ρ(r), they are extremely expensive. Amodern less-expensive alternative is DFT,within theKohn–Sham (KS) single-particle ansatz [8]. KS-DFThas becomepopular owing to its attractive cost-accuracy trade off, and serves as a vital part of themodernmaterials discoveryportfolios [9–15]. But, KS-DFT is still expensive enough that high-performance computers are necessary tostudy large swaths of chemical spaces involving 10–100 thousands ofmolecules andmaterials. The primary

OPEN ACCESS

RECEIVED

2August 2019

REVISED

22October 2019

ACCEPTED FOR PUBLICATION

19November 2019

PUBLISHED

18March 2020

Original content from thisworkmay be used underthe terms of the CreativeCommonsAttribution 3.0licence.

Any further distribution ofthis workmustmaintainattribution to theauthor(s) and the title ofthework, journal citationandDOI.

© 2020TheAuthor(s). Published by IOPPublishing Ltd

Page 3: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

bottleneck of KS-DFT comes from the difficulty in self consistently solving theKS equation (specifically, havingto orthogonalize single particle eigenvectors during the process), which produces the electronic charge density,eigenvalues and one-electronwave functions. It has recently been shown that this bottleneckmay be sidesteppedusingmachine learning (ML) techniques [16].

Amajority of efforts related to application ofML in chemical andmaterials sciences have focused on learninga particularmaterial property of interest (e.g. bandgap, formation energy, etc) using large reference databases ofexperimentalmeasurements [17–19] orKS-DFT calculations [20–22]. A few of them go a step further and learnmore fundamental properties like energies and atomic forces [23–33]. Lately, an increasing number ofpublications have tried to directly predict the ground state electronic charge density aswell [28, 34–40]. Thispaper is a contribution towards the last class of efforts.

TheHohenberg–Kohn (H–K) theorem guarantees a uniquemapping between charge density distributionand the structure of amaterial (in terms of nuclear potential), thereby suggesting a functional relationshipbetween a structural representation and the corresponding charge density. Brockherde et al [37] showed that it ispossible tomap the total potential energy function of a givenmolecular structure to the associated charge densityusing aML approach thatmimics theH–Kmapping. Starting with a representation of charge density in terms ofplanewave basis functions, they successfully learned the relationship between the coefficients of these basisfunctions and the nuclear potentials using kernel ridge regression. Due to numerical representation of thestructure in terms of full body nuclear potentials and use of a planewave basis to represent the charge density,theirmodel however offers limited transferability. Grisafi et al [39] overcome this challenge by expanding thetotal charge density as a sumof atom-centered non-orthogonal spherical harmonic basis functions and usingsymmetry-adaptedGaussian process regression to learn the coefficients of these basis functions. But thecomplex nature of the high dimensional regression problem that attempts tofind atom-decomposed basisfunction coefficients limits the ability of themodel to learn from a large dataset.

Here, we build on our previouswork [41], which introduced a different approach to access the chargedensity, given the structure. A schematic of our approach is shown in figure 1(a). The idea is tomap the chargedensity at every grid point to the respective local atomic environment around the grid point. In this work, wedemonstrate the accuracy, transferability and scalability of thismethod, by constructing a general charge densitymodel for hydrocarbons (involving nearly 60 chemically distinctmolecules and polymers). A summary of thesteps involved in the generation of thismodel is shown infigure 1(b).We start with generating a dataset whichincludes a large variety of hydrocarbon environments and the corresponding charge densities. Local atomicarrangement is then represented using a novelfingerprinting technique, introduced byChandrasekaran et al[41]. The relationship between local environment and charge density is then learned using neural networks(NNs). To handle the problemof data explosion inherent to the nature of the approach, we introduce a step-by-step training process that exploits the ability ofNNs to learn information and down-select huge datasets tocomputationally affordable subsets for efficientmodel training.

The charge densitymodel thus obtained can readily be used to quickly estimate the electronic charge densityof new configurations, and can even serve as a starting point for self-consistent computations involved in theKS-DFT routines, especially for systems of large sizes. Looking into the future, themodel can also be used to

Figure 1. (a)An illustration of howwe cast the problemof solvingKS equations into a learning problem. The 3D arrangement ofatoms at each grid-point is defined using a fingerprinting scheme. The relationship between local arrangement of atoms and chargedensity at each grid point is learned usingNNs. (b)The overall workflowused in this work formodel generation and validation.

2

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 4: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

calculate properties like dipole and quadrupolemoments of structures.Moreover, it can be utilized inconjunctionwith classical potentials orMLbased force fields to access properties such as dielectric response.

2.Methodology

2.1.DatasetIn order to systematically capture a large variety of bonding environments prevalent in the hydrocarbon family,we have built a dataset comprising offive groups: (i) alkanes, (ii) alkenes, (iii) alkynes, (iv) cyclo-groups, and (v)polymers, as illustrated infigure 2. Such a classification ensures that a variety of possible bonding environments(sp3, sp2, sp1 and aromatic bonds) arewell represented in the dataset. Groups (i), (ii) and (iii) contains all possiblehydrocarbonmolecules up to 5 carbon atoms. Additionally, the cyclo-group contains benzene and cyclo-alkanesup to 10 carbon atoms. The polymer class contains crystalline and amorphous polyethylene (PE), and PEwithdefects involving double bonds, triple bonds and side chains. Information about all themolecules/polymerspresent in each group is listed in appendix A. To generate non-equilibriumbonding environments, ab initiomolecular dynamics (AI-MD) simulationswere performed for each case, fromwhich 10 snapshots wererandomly sampled. Further, for each snapshot of eachmolecule, self-consistent field electronic structureoptimizationwas performed to obtain the corresponding charge density to be utilized for theMLprocess. Thedistribution of the generated reference dataset in terms of number ofmolecules, snapshots permolecule, and thetotal number of configurations is presented in table 1.

2.2.DFT andMDdetailsAI-MD simulations for reference dataset generation, using amicro-canonical ensemble, were carried out at300 K and for 0.5 nswith a time step of 1 fs. All charge density calculations in this work are done using theViennaAb Initio Simulation Package [42] (VASP) employing the Perdew–Burke–Ernzerhof exchange-correlation functional and the projector-augmentedwavemethodology. AMonkhorst–Pack gridwith density of01Å−1 adopted and a basis set of planewaveswith kinetic energies up to 500 eVwas used to represent thewavefunctions. Convergence studies were conducted tofind the optimumK-points and other geometry-dependentnumerical parameters for all calculations.

Figure 2.Reference dataset used for learning charge densitymodels. Dataset includes configurations from (i) alkanes, (ii) alkenes, (iii)alkynes, (iv) cyclo-groups and (v) polymers class to ensure sufficient chemical diversity.

Table 1. Summary of the reference data set used in this work. The data set iscategorized into different classes based on the types of bondingenvironments.

Group name

# of

Molecules

# of

Snapshots # of grid points

Alkanes 10 100 204 472 320

Alkenes 20 200 568 604 800

Alkynes 11 110 322 884 480

Cyclo-groups 26 260 1061 442 560

Polymers 6 60 285 286 400

3

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 5: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

2.3. FingerprintWithin a standardDFT implementation, the scalar field ρ(r) is calculated for a structure on a 3D grid of discretepoints in space. Our objective is tofind the relationship between ρ(r) at these grid points and the arrangement ofatoms around them.We accomplish this using a fingerprint that numerically represents the atomic arrangementaround a grid point [43]. Out ofmany powerful structure representations in literature [32, 38, 39, 43–45], wechoose this fingerprint [43] because it systematically represents radial and angular distributions of the atomsaround a point in space. This is crucial when using a grid based approachwherein large number of points are tobe represented. Further, this representation is alsomore intuitive relative to the popular bispectrum approach[45]. The saidfingerprint definition is based on our previous work, and consists of a hierarchy of scalar (S),vector (V ) and tensor (T) components which capture the radial and angular distribution of atoms around a gridpoint. The scalar component that captures the radial information of atoms around a grid-point g using apredefined set of Gaussian functions (k) of varyingwidthsσk is defined as:

ås

=-

W=

W

S cr

f rexp2

, 1k ki

Ngi

kc gi

1

2

2

⎛⎝⎜⎜

⎞⎠⎟⎟ ( ) ( )

where, rgi is the distance between the atom i of specieΩ and the reference grid-point g. NΩ denotes the number ofatoms of typeΩ. fc(rgi) is the cut-off function defined as +

p0.5 cos 1

r

R

gi

c[ ( ) ], for rgi�Rc and equal to 0 for

rgi>Rc. The coefficient ck is the normalization constant given by sp k

1

2

3( ) . The overall dimensionality of the

scalar component is the product of Gaussian k and the number of element types. Similarly, the vectorcomponents are defined by

ås

=-a a

W=

W

V c rr

f rexp2

, 2k ki

N

gigi

kc gi

1

2

2

⎛⎝⎜⎜

⎞⎠⎟⎟ ( ) ( )

ås

=-ab a b

W=

W

T c r rr

f rexp2

, 3k ki

N

gi gigi

kc gi

1

2

2

⎛⎝⎜⎜

⎞⎠⎟⎟ ( ) ( )

where,α andβ represent the x, y or z directions. Here pre-factors argi anda br rgi gi are added to incorporate angular

distribution of atoms of a specific species at a given radius r. Further tomake vector and tensor componentsrotationally invariant, the following quantities are defined and used as thefingerprint components:

= + +W W W WV V V V , 4k kx

ky

kz2 2 2( ) ( ) ( ) ( )

= + +W W W WT T T T , 5k kxx

kyy

kzz ( )

¢ = + + - - -W W W W W W W W W WT T T T T T T T T T , 6k kxx

kyy

kyy

kzz

kzz

kxx

kxy

kyz

kzx2 2 2( ) ( ) ( ) ( )

= abW WT Tdeterminant . 7k k( ) ( )

Collecting the different pieces, the overall fingerprint consists of terms ¢W W W WS V T T, , ,k k k k and WTk . Byconstruction, each of the fingerprint component is invariant to system translation, rotation and permutations ofatoms, similar to the target property ρ(r). To calculate these fingerprints at a grid point, we use a set of predefinedGaussian functionswith itsmean at the chosen grid point and having a range of standard deviations (σk). A totalof 16Gaussian functionswere used in this studywithwidths varying from0.25 to 8Å divided on a logarithmicscale, alongwith a cut-off parameter ofRc=9Å.

2.4.MLmethodThe choice ofML algorithm is another key ingredient in developing anyMLbasedmodel. Since the problem athand demands handling large volumes of data, amounting to billions of points, conventionalmethods likeGaussian process regression are computationally inefficient. Even varieties of sparseGaussian process regressionbasedmethods [46–48] effectively reduce the train and prediction cost to ´ m n2( ) or m2( ) [48], wherem isthe size of psuedo inputs and n is the number of data points. NNs have been the ‘go to’ solution for big dataproblems in situationswhere the functional relationships are completely unknown andwhen the dataset size isenormous. They have beenwidely used in areas like image-recognition, targeted advertising and naturallanguage processing for at least a decade [49].Moreover, within thematerials science and chemistrycommunities, NNs have been used extensively to develop force-fields for a variety ofmolecular andmaterialssystems [25, 26, 32, 38].

Here, we employ a type ofNN called deep feed forward neural network (DNN). The architecture iscomposed of a set of layers of neurons, namely, the input-layer, hidden layers and an output layer. In the inputlayer, each neuron contains a component of the fingerprint vector (see figure 1(a)) and the size of the input-layeris decided by the number of components in thefingerprint. In the hidden layers, each neuron represents a

4

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 6: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

predefined parametrized function acted upon by aweighted linear transformof each of the neurons in theprevious layer. The number of neurons in every hidden layer and number of hidden layers themselves are hyperparameters chosen/optimized based on the problem at hand.Here, for convenience, we keep the number ofneurons in each layer as a constant (128 neurons per layer) and vary the number of hidden layers to study itsdependence onmodel performance. The output layer has a single neuronwhich collects the output from the lasthidden layer and calculates the property of interest (ρ(r), in this work) using aweighted linear transform.Usingthismethod, the underlying functional relationship between the fingerprint and the property is learned byfinding the correct weights of the linear transforms and the function parameters used in each neuron by anoptimization algorithm [50].

In this work, Keras [51]withTensorflowbackendwas used to build themodels. The input layer and hiddenlayers have 128 neuron eachwhile the output layer has one neuron. Relu activation function is used in thehidden layers. Amini-batch training algorithmwith random samplingwas employed in training allmodels, rootmean square error (RMSE)was used as the objective function, and the Adamoptimizer was used to optimize theweights. NNhyperparameters like the number of nodes per layer, the activation function used in the neurons,the decision of optimizers used, the formof the loss function used, and the stopping criterion for traininghyperparameters were chosen based on validation errorminimization criterion and time taken to train theNNmodel [41].

2.5.Model trainingIn our grid based approach, the charge density at every grid-point was used as input to train ourmodel. Tomaintain the accuracy of themodel, afine grid-spacing of approximately 0.1 A was used. This results in anenormous amount of training data of the order of billions of examples (for thewhole dataset), which raisesmanypractical challenges in handling the learning process. To reduce the size of the problem, amulti-step process wasemployed in developing of themodel. The dataset was first divided into subsets as described in the followingsection. Themodels were then progressively improved by training on different subsets in a sequentialmanner asdescribed below, closely following the idea of incremental learning.We note here thatmany of the advancedmini-batch diversification techniques [52, 53]which focus on intelligently dividing the datasets in smallersubsets to allowNN training at low computational (memory) costs, could also be a solution. They do not down-select or throw away points fromdatasets, but utilize all points forNN training.We chose to gowith down-selecting the data instead owing to the similarity between examples in the dataset and other practicalconsiderations like time taken to train theNN.

2.5.1. Dataset divisionTo start with, the dataset is divided into: the training set, the validation set and the test set as shown infigure 3.The examples used to train theNNswere drawn from the training set, whereas the validation set was used toprevent over-fitting and the test set was used to evaluate themodel performance. To overcome practicallimitations in handling data, the training dataset is further divided into 3 equal subsets, i.e. Dataset-A, Dataset-BandDataset-C. In addition to these, amuch smaller dataset, Dataset-0, is introduced to initiate the trainingworkflow. For all data subsets, an equal proportion of data is taken from each of the five hydrocarbon groups (seefigure 4) to ensure enough chemical diversity, and to facilitate continual improvement of themodel duringtraining.

2.5.2.Multi-stepmodel trainingSince the hardware limits the number of examples that can be learned during the training, we expose ourmodelto different data subsets (and thus to the entire dataset) in a sequentialmanner. Further, we use the pre-trainedmodels tofind data points where ourmodel performs poorly, and then specifically sample from such high errorpoints to train in a loop-wise fashion. This is one of the possible active samplingmethods, very similar in spirit toadaboost learning [54]. As demonstrated in the figure 4, we start by creating an initial NNmodel, termedModel0, using the relatively smallerDataset-0 (which amounts to about 20million examples). TheModel 0 is next usedtomake predictions onDataset-A, and exclusively find data points for which themodel performance is poor.These high error points are then used to form the ‘revised’Dataset-A, details of which are provided inappendix B. This idea of retaining only the data points with high errors allows us to limit the size of trainingexamples. The next generation ofNNmodel (with the sameNNarchitecture) initialized usingweights ofModel0 is then trained using the ‘revised’Dataset-A. The obtainedNN is then again used to ‘revise’Dataset-B andinitialize weights of the futureNNmodel, with the process repeated until convergence in test errors is obtained.Further, in order to expose ourNNmodel to the entire dataset, the data subsets are revised in a cyclicmannerwith the order A→B→C→A→B→C, as reflected infigure 4. RMSE between theDFT calculated andmodel predicted values was chosen as the errormetric.

5

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 7: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

Thismulti-stepworkflowhas several key advantages. Initial sampling and subsequent revision of datasubsets reduces the practical difficulties in handling data by reducing the dataset size. This also prevents themodels frombeing biased due to the presence of large number of examples from a small region, which usuallyresults in poor representation of outliers in themodel. The initialization of weights from the previous generationofNNmodels, helps to efficiently transfer knowledge learned from the past data, a process termed as transferlearning. This allows to further improve themodel performance for the new examples, without having to accessthe past reference data.

Figure 3. Schematic of dataset division into training, test and validation sets. Each of the test and validation sets contains 20%of thetotal data. The training set has 60%of the total data and is further divided into subsets namedDataset-A,Dataset-B andDataset-C.Finally, an initial seed dataset, labeledDataset-0, which has 5 structures from each group, is used to create the initialMLmodel.

Figure 4.Themulti-step training process adopted to learn charge densitymodels. First, an initial NNmodel,Model 0, is learned usingDataset-0.Model 0 is then used to sample high error points fromDataset-A to arrive at a ‘revised’Dataset-A. Starting from theweightsofModel 0, the next generation ofNNmodel is obtained by training on the ‘revised’Dataset-A. This next generation ofNN is thenagain used to ‘revise’Dataset-B, and initialize the next generation ofNNmodel. This process is repeated until convergence in test erroris achieved.

6

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 8: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

3. Results and discussion

3.1. Effect ofmodel architecture onperformancePast studies [55, 56] indicate a heavy dependence of themodel performance on the depth of theNNmodel. Here,to understand the effect of number of hidden layers on the performance of themodel and to choose the bestNNmodel that can be trained using the prescribedmethodology, a set of differentNNarchitectures were explored.Models withminimumcross-validation RMSE errorwere chosen to prevent over-fitting. RMSE of the test setwas used to evaluate thesemodels. To limit the search space of hyper parameters, the number of neurons on allhidden layers were keptfixed as described earlier. Figure 5 shows the performance ofmodelsmadewith varyingnumbers of hidden layers when tested on the test set. To compare the performance of themodelsmore clearly,we report an average test set error computed by taking themean of the error of 1000 points withmaximumabsolute error for each structure (reported as structure-error infigure 5). Themodel developed using a singlehidden layer was found to show significantly high errors and hence is not reported here. A decrease in structure-errorwith increasing number of hidden layers is evident from the figure.

Figure 5 indicates that allmodels are able to systematically learn the information contained in differentdatasets following the proposedmulti-stepmethodology. It should be noted thatmodels developed in each cyclewith a particular architecture do not have access to the entire dataset, but only to one of theDataset-A, B andC.Information about the previously seen data is only contained in the initial weights fromwhich themodeltraining begins.

The proposedmethodology thus provides away to systematically trainmodels on very large datasets bysplitting them into small batches and choosing training points which are unfamiliar to themodel. Thismethodology also enables us to improve a pre-trainedmodel by incorporating new environments of interestwithout having access to the data used to train themodel. On comparing the performance of the bestmodels as afunction of increasing depth of theNN, it can be seen that there is no significant effect on the performancewiththe increase inmodel depth. Rather, the performance ismore dependant on the number of training cycles. Thismeans that all theseNNmodels are able to capture the underlying functional relationship between ourfingerprint and the charge density fairly well.

3.2. Performance of bestmodelTo evaluate the performance of themodel on different classes of structures in the test dataset, the bestmodel (interms of charge density RMSE) out of all those examined earlier was chosen (hidden layers:8, cycle:2). Figure 6shows parity plots between the charge density predicted usingML and those calculated usingDFT.Owing to thelarge size of the dataset, only charge densities of the top 1% least accurate points from each structure in the testset is depicted in the figure 6(a).

In order to estimate the accuracy of the prediction and compare it with otherworks in literature,%weighedmean absolute percentage error (òρ(%)), was computed.

Figure 5.Average structure-error on the test set as a function of the number of hidden layers inNNs and the number of training cycles.The structure-error, for a givenNN architecture, converges after a couple of training cycles.

7

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 9: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

å òò

r r

r= ´

-r

NN

r r r

r r% 100

1 d

d, 8

e i

n

ei

i i

iDFT DFT

DFT

( )∣ ( ) ( )∣

( )( )

where r riDFT ( ) is the charge density of test structure i calculated usingDFT, r ri

ML ( ) is the charge density of teststructure i predicted by theMLmodel, andNi

e is the number of electrons in each structure andNe is the totalnumber of electron. òρ(%)was found to be 1.26% for the entire test set. This is very close to the errors reported inother studies [2, 39]. The accuracy of the prediction can be visually observed from figure 6where that chargedensity in each class of structures contained in the test set are plotted. òρ(%) for each subclass of structures is alsoreported in thefigure. The skewness in the parity plots, seen pronounced for alkenes and cyclo-groups suggestthat errors are lowerwhen charge density is low, that is, close to the atom centers and far away from atomcenters. This is intuitive as for these grid points, the charge density is not as heavily dependant on thearrangement of atoms as compared to regions between atoms. For regions between the low valence densitypoints, the errors are higher, but still close to the parity line.

To assess if charge density with this level of accuracy ismeaningful, we evaluated energy of the structures ineach group using theML-predicted charge density and compared it with those obtained using self consistentfield converged calculations (SCF)withVASP.We did this by predicting charge density on the same fast Fouriertransformation grid obtained from these SCF calculations using ourmodel and the values in a format which canbe read in byVASP. Further, we usedVASP to evaluate the energy, keeping the charge density constant. Parityplots for this is shown infigure 6(b). For all chosen classes ofmolecules and polymers, the correlation betweentheDFT calculated andMLpredicted energy values are close to unity. These plots provide additional reassuranceof the agreement betweenDFT and predicted charge densities for all chosen classes of hydrocarbons. It was alsofound that when the ρ(r) calculated by thefinalmodel is used as initial guess inVASP, approximately 10%speedupwas observed for SCF convergence, for a small structure (<100 atoms). This% speedup is expected toincrease as a function of system size.

3.3. Tests onunseen structuresTo test the transferability of the developedmodel to similar environments, charge density predictionswere doneon a set of totally unseen structures. A single chain of polystyrenewas created and compressively strained todifferent degrees along the chain axis, for this test. Note that the training set contains PE and benzene as twodifferent entities but does not have the structure of polystyrene itself. Figure 7 shows an illustration of three ofthese structures with 0%, 25%and 50% strains. Charge densities were calculated for these structures using oneself consistent loopDFT calculation. These charge densities were then compared to those predicted by theMLmodel. Figure 7(b) shows line plots of charge density indicated along the axis described infigure 7(a). A goodagreement between themodel predictions andDFT computations is can be observed visually and looking at thepercentage absolute error for each structure (òρ0(%), òρ0.25(%), òρ0.50(%))infigure 7. For an even closerinspection, a difference plot between the predicted and the calculated charge densities is reported in appendix C(figureC1). However, the energy values computed using the charge density predicted by theMLmodel differedby a few eVs, signaling that themodel should be retrained using some instances of polystyrene to reliablyestimate energy values using themodel.

Figure 6.Comparison ofMLpredicted (a) charge density and (b) potential energywith respect toDFT computed values. In (a), onlytop 0.1% least accurate predictions are shown. In (b), energy predictions for all structures are accurate up to tens ofmeVs.

8

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 10: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

3.4. ConclusionsIn summary, we have successfully demonstrated aML based approach to build surrogate charge densitymodels(that circumvent a direct quantummechanical calculation)using reference datasets, obtained fromfirstprinciplesDFT calculations. The robustness and transferability of the schemewas established using the exampleof a fairly diverse (chemically) class ofmaterial, i.e. hydrocarbons. A grid-based fingerprint that captures the 3Datomic neighborhood distribution of local environments was used to learn respective grid-based charge densityvalue, which resulted in a need to handle huge amounts of data (greater than billion points). Such a large datasetwas trained using amulti-step iterative scheme that systematically learns from small chunks of high error points,similar in spirit to adaboost learning [57]. The schemewas shown to produce accurate charge density predictionsfor newor unseen hydrocarbon systems, and recovered correct energies when SCFDFT calculations wereinitiated from themodel predicted charge densities, thereby providing a pathway to accelerate quantummechanical simulations through better initialization of charge densities (instead of conventional charge densityinitializations [58]). Further, we believe that the present scheme could be used to compute other charge densityrelated properties, such as dipole, quadrupolemoments, etc and other properties for which density functionalsmay be available. Finally, we note that the approach introduced here is general and can be effortlessly repeatedfor othermaterial systems tomake accurate charge densitymodels.

Funding

Thiswork is supported by theOffice ofNaval Research throughN0014-17-1-2656, aMulti-University ResearchInitiative (MURI) grant. The authors thankXSEDE for the utilization of Stampede2 cluster via project ID‘DMR080058N’.

Competing interests

The authors declare that they have no competing financial interests.

Data availability

The data that support thefindings of this study are available from the corresponding author upon reasonablerequest. All data used to generate themodels (and the train-validation-split details) are available online atavailable online athttps://khazana.gatech.edu.

Figure 7.Polystyrene (a) configurations and (b) correspondingDFT andMLpredicted charge densities under different strainconditions. In (a), the red and green lines represent the axes alongwhich the charge density predictions aremade.

9

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 11: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

AppendixA.Dataset

This section lists all distinct chemical species (molecules/polymers) contained in each group in the dataset.

A.1. Alkanesmethane, ethane, propane, butane (2 isomers), 2-methyl propane, but-1-ene, pentane, 2-methyl-butane, 2-2-dimethyl-propane.

A.2. Alkenesethene, propene, propadiene, but-1-ene, but-2-ene, 2-methyl, propene, but-1-2-diene, but-1-3-diene, pent-1-ene, pent-2-ene, 2-methyl-but-1-ene, 2-methyl-but-2-ene, 3-methyl-but-1-ene, penta-1-2-diene, penta-1-4-diene, penta-2-3-diene, penta-1-3-diene, 2-methyl-but-1-3-diene, 3-methyl-but-1-2-diene.

A.3. Alkynesethyne, propyne, but-1-yne, but-2-yne, but-3-en-1-yne, pent-1-yne, pent-2-yne, 3-methyl-but-1-yne, penta-3-en-1-yne, penta-4-en-1-yne, penta-4-en-2-yne

A.4. Cyclo-groupscyclopropane, cyclopropene, 1-methyl-cyclopropene, 3-meth-3-en-cyclopropene, 3-methyl-cyclopropene,1-methyl-cyclopropane, cyclobutene, cyclobutadiene, 1-1-dimethyl-cyclopropane,methyl-cyclobutane,cyclopentene,methyl-cyclobutadiene, 1-2-dimethyl-cyclopropane, 1-2-dimethyl-cyclopropene, 1-3-dimethyl-cyclopropene, 1-methyl-cyclobutene, 3-3-dimethyl-propene, 3-meth-en-cyclobut-1-ene, 3-methyl-cyclobutene, cyclopenta-1-2-diene, cyclohexane, benzene, cycloheptane, cyclooctane, cyclononane,cyclodecane.

A.5. Polymerspolyethylene crystalline, polyethylene disordered, polyethylenewith double bond defects, polyethylene withtriple bond defects, polyethylenewith -CH3 side chain defects.

Appendix B.Data revision

Data revisionwas introduced to address the following:

(i) to purge practical difficulties in handling a huge amount of data

(ii) to eliminate bias in themodel due to domination of examples frommost prolific training points

(iii) as amethod to improve themodel given new examples without having to access the past reference data

Dataset revision refers to the process of choosing a small subset from a given large dataset. Examples to beincluded in this subset are chosen based on the performance of themodel from the previous iteration(orinitial model),as described in figure 4 in themain text, on examples of the large dataset. For instance,Model:(i-1), a model trained in a previous training cycle is used to predict the charge densities at each grid point ineach structure in training dataset-i. Then, the absolute difference between the predicted values and referencevalues is evaluated for each example. These errors can be loosely categorized as a gaussian function with aabnormally sharp peak centered around 0. Now, all points whose error values fall half a standard deviation(of the fitted gaussian function) away from 0 are grouped as High Error points. 75% of points from this higherror dataset are then replaced with points from the low error regions to form the RevisedDataset. 75% arechosen from the well predicted region because it was found that the performance of themodel worsens if istrained on high percentage examples of high error regions. These revised dataset are used for training themodel within each training cycle.

AppendixC. Charge density difference plot for polystyrene

10

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 12: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

ORCID iDs

DeepakKamal https://orcid.org/0000-0003-1943-7774AnandChandrasekaran https://orcid.org/0000-0002-2794-3717

References

[1] Hohenberg P andKohnW1964Phys. Rev. 136B864–71[2] Fabrizio A, GrisafiA,Meyer B, CeriottiM andCorminboeuf C 2019Chem. Sci. 10 9424–32[3] PorezagD and PedersonMR1996Phys. Rev.B 54 7830[4] BuckinghamA, Fowler P andHutson JM1988Chem. Rev. 88 963–88[5] CastlemanA Jr andHobza P 1994Chem. Rev. 94 1721–2[6] Brutschy B andHobza P 2000Chem. Rev. 100 3861–2[7] Hobza P andRezac J 2016Chem. Rev. 116 4911–2[8] KohnWand ShamL J 1965Phys. Rev. 140A1133[9] Becke AD2014 J. Chem. Phys. 140 18A301[10] Jones RO2015Rev.Mod. Phys. 87 897[11] BurkeK 2012 J. Chem. Phys. 136 150901[12] JainA, Shin Y and PerssonKA2016Nat. Rev.Mater. 1 15004[13] Mannodi-Kanakkithodi A, TreichGM,HuanTD,MaR, TefferiM, CaoY, SotzingGA andRamprasad R 2016Adv.Mater. 28 6277–91[14] Batra R,HuanTD, Jones J L, Rossetti G Jr andRamprasad R 2017 J. Phys. Chem.C 121 4139–45[15] Chen L,HuanTD andRamprasad R 2017 Sci. Rep. 7 6128[16] Brockherde F, Vogt L, Li L, TuckermanME, BurkeK andMüller KR 2017Nat. Commun. 8 872[17] KimC,ChandrasekaranA, JhaA andRamprasad R 2019MRSCommun. 9 860–6[18] Mannodi-Kanakkithodi A, ChandrasekaranA,KimC,HuanTD, PilaniaG, BotuV andRamprasad R 2018Mater. Today 21 785–96[19] KimC,ChandrasekaranA,HuanTD,DasD andRamprasad R 2018 J. Phys. Chem.C 122 17575–85[20] Mannodi-Kanakkithodi A, ChandrasekaranA,KimC,HuanTD, PilaniaG, BotuV andRamprasad R 2018Mater. Today 21 785–96[21] Balachandran PV, EmeryAA,Gubernatis J E, LookmanT,WolvertonC andZunger A 2018Phys. Rev.Mater. 2 043802[22] KimC,ChandrasekaranA,DoanHuanT,DasD andRamprasad R 2018 J. Phys. Chem.C 122 17575–85[23] BotuV andRamprasad R 2015 Int. J. QuantumChem. 115 1074–83[24] BotuV andRamprasad R 2015Phys. Rev.B 92 094306[25] Behler J and ParrinelloM2007Phys. Rev. Lett. 98 146401[26] Behler J 2011 J. Chem. Phys. 134 074106

FigureC1. (a)DFT andMLpredicted charge densities under different strain conditions as listed infigures 7(b), (c) difference plots ofcharge densities corresponding to each charge density plot.

11

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al

Page 13: A charge density prediction model for hydrocarbons using ...ramprasad.mse.gatech.edu/wp-content/uploads/2020/03/...2.3.Fingerprint WithinastandardDFTimplementation,thescalarfieldρ(r)iscalculatedforastructureona3Dgridofdiscrete

[27] BartókAP, PayneMC,Kondor R andCsányi G 2010Phys. Rev. Lett. 104 136403[28] Schütt K, Kindermans P J, FelixHE S, Chmiela S, TkatchenkoA andMüller KR2017 SchNet: a continuous-filter convolutional neural

network formodeling quantum interactionsAdvances inNeural Information Processing Systems 30 992–1002[29] BotuV, Batra R, chman J andRamprasad R 2016 J. Phys. Chem.C 121 511–22[30] Kolb B, Lentz LC andKolpakAM2017 Sci. Rep. 7 1192[31] HuanTD, Batra R, chman J, Krishnan S, Chen L andRamprasad R 2017NPJComput.Mater. 3 37[32] Smith J S, IsayevO andRoitberg AE 2017Chem. Sci. 8 3192–203[33] ImbalzanoG, Anelli A, GiofréD, Klees S, Behler J andCeriottiM 2018 J. Chem. Phys. 148 241730[34] Snyder J C, RuppM,HansenK,Müller KR andBurkeK 2012Phys. Rev. Lett. 108 253002[35] MontavonG, RuppM,GobreV, Vazquez-Mayagoitia A,HansenK, TkatchenkoA,Müller KR andVon LilienfeldOA2013New J.

Phys. 15 095003[36] Schütt KT,GlaweH, Brockherde F, SannaA,Müller KR andGross EKU2014Phys. Rev.B 89 205118[37] Brockherde F, Vogt L, Li L, TuckermanME, BurkeK andMüller KR 2017Nat. Commun. 8 872[38] Schütt KT, SaucedaHE, Kindermans P J, TkatchenkoA andMüller KR 2018 J. Chem. Phys. 148 241722[39] GrisafiA, Fabrizio A,Meyer B,WilkinsDM,Corminboeuf C andCeriottiM 2018ACSCent. Sci. 5 57–64[40] Fowler AT, PickardC J and Elliott J A 2019 J. Phys.:Mater. 2 034001[41] ChandrasekaranA,KamalD, Batra R, KimC,Chen L andRamprasad R 2019NPJComput.Mater. 5 22[42] KresseG and Furthmüller J 1996Phys. Rev.B 54 11169[43] Batra R, TranHD,KimC, Chapman J, Chen L, ChandrasekaranA andRamprasad R 2019 J. Phys. Chem.C 123 15859–66[44] WillattM J,Musil F andCeriottiM 2019 J. Chem. Phys. 150 154110[45] Behler J 2016 J. Chem. Phys. 145 170901[46] Quiñonero-Candela J andRasmussenCE 2005 J.Mach. Learn. Res. 6 1939–59[47] Quiñonero-Candela J et al 2010 J.Mach. Learn. Res. 11 1865–81[48] Snelson E andGhahramani Z 2006 Sparse Gaussian processes using pseudo-inputsNIPS’05: Proc. of the 18th Int. Conf. onNeural

Information Processing Systems 1257–64[49] LeCunY, Bengio Y andHintonG2015Nature 521 436[50] Hecht-Nielsen R 1992Theory of the backpropagation neural networkNeural Networks for Perception (Amsterdam: Elsevier) pp 65–93[51] Chollet F et al 2015Keras (https://keras.io)[52] ZhangC,Öztireli C,Mandt S and Salvi G 2019Activemini-batch sampling using repulsive point processesProc. AAAIConf. on

Artificial Intelligence vol 33, pp 5741–8[53] ZhangC,KjellstromHandMandt S 2017 arXiv:1705.00607[54] Hastie T, Rosset S, Zhu J andZouH2009 Stat. Interface 2 349–60[55] HeK, ZhangX, Ren S and Sun J 2016Deep residual learning for image recognition Proc. IEEEConf. on Computer Vision and Pattern

Recognition pp 770–8[56] EldanR and ShamirO2016The power of depth for feedforward neural networksConf. on Learning Theory pp 907–40[57] Schire R E and FreundY 2013Kybernetes 42 164–6[58] WoodsN,Hasnip P and PayneM2019 J. Phys.: Condens.Matter 31 453001

12

Mach. Learn.: Sci. Technol. 1 (2020) 025003 DKamal et al