Evolving and Nano Data Enabled Machine Intelligence for ...

34
doi.org/10.26434/chemrxiv.7291205.v1 Evolving and Nano Data Enabled Machine Intelligence for Chemical Reaction Optimization Daniel Reker, Gonçalo Bernardes, Tiago Rodrigues Submitted date: 02/11/2018 Posted date: 02/11/2018 Licence: CC BY-NC-ND 4.0 Citation information: Reker, Daniel; Bernardes, Gonçalo; Rodrigues, Tiago (2018): Evolving and Nano Data Enabled Machine Intelligence for Chemical Reaction Optimization. ChemRxiv. Preprint. Optimizing reaction conditions is an essential routine in synthetic chemistry. However, selecting appropriate experiments remains tightly connected to expert chemistry knowledge. Here, to streamline the reaction yield optimization process and disconnect it from chemical intuition, we developed an adaptive machine intelligence to navigate multidimensional reaction conditions’ spaces. Our approach (LabMate.AI) employs an interpretable algorithm and requires only <0.05% of all search space as input data. LabMate.AI optimizes many reaction parameters simultaneously, and uses minimal computational resources and time. We demonstrate how LabMate.AI can identify optimal conditions for a Ugi and a C–N cross-coupling reaction in a more efficient and faster manner than human experts, while affording reactivity insights. Our approach formalizes chemical intuition, and acquires expert chemistry knowledge autonomously, thereby providing an innovative framework towards informed and automated experiment selection. The results support machine learning for hastening experimental design, democratizing synthetic chemistry, and freeing chemists for non-routine tasks. File list (1) download file view on ChemRxiv manuscript.pdf (7.18 MiB)

Transcript of Evolving and Nano Data Enabled Machine Intelligence for ...

Page 1: Evolving and Nano Data Enabled Machine Intelligence for ...

doi.org/10.26434/chemrxiv.7291205.v1

Evolving and Nano Data Enabled Machine Intelligence for ChemicalReaction OptimizationDaniel Reker, Gonçalo Bernardes, Tiago Rodrigues

Submitted date: 02/11/2018 • Posted date: 02/11/2018Licence: CC BY-NC-ND 4.0Citation information: Reker, Daniel; Bernardes, Gonçalo; Rodrigues, Tiago (2018): Evolving and Nano DataEnabled Machine Intelligence for Chemical Reaction Optimization. ChemRxiv. Preprint.

Optimizing reaction conditions is an essential routine in synthetic chemistry. However, selecting appropriateexperiments remains tightly connected to expert chemistry knowledge. Here, to streamline the reaction yieldoptimization process and disconnect it from chemical intuition, we developed an adaptive machine intelligenceto navigate multidimensional reaction conditions’ spaces. Our approach (LabMate.AI) employs aninterpretable algorithm and requires only <0.05% of all search space as input data. LabMate.AI optimizesmany reaction parameters simultaneously, and uses minimal computational resources and time. Wedemonstrate how LabMate.AI can identify optimal conditions for a Ugi and a C–N cross-coupling reaction in amore efficient and faster manner than human experts, while affording reactivity insights. Our approachformalizes chemical intuition, and acquires expert chemistry knowledge autonomously, thereby providing aninnovative framework towards informed and automated experiment selection. The results support machinelearning for hastening experimental design, democratizing synthetic chemistry, and freeing chemists fornon-routine tasks.

File list (1)

download fileview on ChemRxivmanuscript.pdf (7.18 MiB)

Page 2: Evolving and Nano Data Enabled Machine Intelligence for ...

1

Evolving and nano data enabled machine intelligence for chemical reaction optimization

Daniel Reker,1,2 Gonçalo J. L. Bernardes,*,3,4 Tiago Rodrigues*,4 1. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, USA. 2. Division of Gastroenterology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA. 3. Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW, Cambridge, UK. 4. Instituto de Medicina Molecular, Faculdade de Medicina da Universidade de Lisboa, Av. Prof. Egas Moniz 1649-028 Lisboa, Portugal. * Corresponding author: [email protected]

[email protected]

Competing financial interests: The authors declare no competing financial interests.

Author contributions: D.R. contributed ideas and performed data analyses. G.J.L.B., contributed research tools and analysed data. T.R. designed and implemented the LabMate.AI software, performed chemistry, analytics and data analyses. T.R. designed the study. T.R. and D.R. wrote the manuscript with contributions from G.J.L.B. All authors agreed on depositing the manuscript in ChemRxiv.

Page 3: Evolving and Nano Data Enabled Machine Intelligence for ...

2

Abstract

Optimizing reaction conditions is an essential routine in synthetic chemistry. However, selecting appropriate experiments remains tightly connected to expert chemistry knowledge. Here, to streamline the reaction yield optimization process and disconnect it from chemical intuition, we developed an adaptive machine intelligence to navigate multidimensional reaction conditions’ spaces. Our approach (LabMate.AI) employs an interpretable algorithm and requires only <0.05% of all search space as input data. LabMate.AI optimizes many reaction parameters simultaneously, and uses minimal computational resources and time. We demonstrate how LabMate.AI can identify optimal conditions for a Ugi and a C–N cross-coupling reaction in a more efficient and faster manner than human experts, while affording reactivity insights. Our approach formalizes chemical intuition, and acquires expert chemistry knowledge autonomously, thereby providing an innovative framework towards informed and automated experiment selection. The results support machine learning for hastening experimental design, democratizing synthetic chemistry, and freeing chemists for non-routine tasks.

Page 4: Evolving and Nano Data Enabled Machine Intelligence for ...

3

Chemistry and synthetic methods development is nuclear to successful chemical biology, drug discovery, materials science and engineering research programs1,2. The identification of appropriate synthetic procedures requires expert chemistry knowledge3, but may still lead to suboptimal methods and yields. Moreover, designing experiments towards optimal reaction conditions remains a largely irreproducible/non-deterministic task4. Thus, developing technologies leveraging deterministic processes while facilitating and streamlining the identification of optimal reaction conditions will assist future discovery chemistry. Ultimately, such technologies may afford chemical matter previously deemed intractable in amounts suitable for downstream studies, while also providing formalized reactivity insights.

Artificial intelligence is reshaping how science is carried out5,6 and its application to automate laboratories is expected to accelerate drug development6-13. Despite being an enabling technology in chemistry14-16, current machine learning implementations rely on harnessing massive datasets coupled to complicated algorithms and difficult to interpret benchmark statistics. Also, given the perceived need of expert chemistry knowledge to effectively tackle non-routine tasks, algorithms have seen their applicability curtailed by organic chemists3,17. Indeed, these hurdles have kept the optimization of chemical reaction conditions a challenge with no universal and automated solution available and only a few applications reported. For example, reaction feasibility can be predicted by classifiers18, and substrate scoping, i.e. determining which building blocks react under certain fixed conditions, is predictable by leveraging brute-force reaction screening data19,20. Similarly, discovery of new chemical reactions has been automated21, while using thousands of data points to teach a machine. These approaches are thus reserved to select researchers with the technical means for conducting/analyzing hundreds to thousands of reactions in parallel, and do not allow optimizing reaction conditions towards maximized product yields. To that end, deep learning using several thousand probability density functions for simulated data pre-training have been employed to optimize only three reaction parameters over 40 iterations15. Likewise, a “black box” algorithm was recently integrated into a flow chemistry apparatus to enable fast feedback loops. With said algorithm, up to four reaction variables were optimized in flow for multiple chemical reactions using 30–60 iterations22. Ideally, computer-driven reaction optimization routines should allow optimizing a larger number of reaction parameters without the need for a specific computational or chemical hardware.

Herein, we report the development and application of a self-evolving machine intelligence (LabMate.AI) that models reaction conditions, albeit being agnostic to the identity of the chemical transformation. By running this tool on a personal computer we provide proof-of-concept for the condition optimization of two distinct and pharmacologically relevant chemistries – one well-studied Ugi reaction and a C–N cross coupling that had proven challenging to optimize by the pharmaceutical industry. Our self-evolving machine intelligence formalizes chemical intuition and contrasts with methods built from data and algorithms15,19-21 of limited practicality and accessibility, i.e. it relies on an easily accessible volume of data and a traceable decision tree-based algorithm. Although LabMate.AI uses “nano data” (<0.05% of search space) to navigate the reaction condition space, only an additional 5–10 iterations/reactions were required to efficiently model up to eight condition parameters, design an optimal synthetic method and unveil new insights into a C–N cross coupling. These operational features provide substantial improvements over the current state-of-the-art. Crucially, the algorithm’s performance, was significantly superior to that of

Page 5: Evolving and Nano Data Enabled Machine Intelligence for ...

4

four human chemists, supporting the power and promise of driverless machine intelligence in future organic synthesis, either as standalone tools or integrated with robots.

Results and discussion

Architecture of a self-evolving machine intelligence for chemistry optimization. Using our software (Fig. 1), we endeavored to counter the abovementioned reproducibility, interpretability, data access, optimization speed and applicability domain limitations, by analyzing and providing a validation for its reaction optimization concept. LabMate.AI uses adaptive random forest models to navigate the search space. Crucially, it requires only ten data points as a starting package, i.e. 100–200 fold less data than previous methods21,23, to build a crude model from which a new experiment is suggested. Neither prior assumptions nor pre-training are required. Considering that each selected reaction is informative, the ideal random forest parameters change dynamically. To adapt, LabMate.AI creates the best machine learning method on its own, i.e. evolves in autonomous and stepwise fashion. In doing so, it mimics on-the-fly learning24 by synthetic chemists to efficiently detect patterns in small data, and obtain increasingly better models and predictions. To the best of our knowledge, LabMate.AI pioneers “nano-data”-enabled, self-evolving machine intelligence inspired in chemistry practice for reaction optimization. We also implemented different prioritization strategies to evaluate efficiency and speed to identify optimized reaction conditions. LabMate.AI was endowed with naïve curiosity to explore the multidimensional space, by selecting reaction conditions least understood by the model (high predictive variance), irrespective of the predicted reaction outcome. In doing so, we recognized that “nano data” could be insufficient for an acceptable understanding of chemical reactivity, hence the selective sampling of informative data. LabMate.AI then adopts a balanced approach, selecting the least confident of the perceived top-10 high-yielding reactions. Conversely, another version of the software was built to obtain high-yielding reactions directly from the information enclosed in “nano data” by using a ”greedy”/exploitative approach.

Page 6: Evolving and Nano Data Enabled Machine Intelligence for ...

5

Fig. 1 Workflow for reaction optimization. Reactions are performed with computationally-suggested conditions and subsequently assessed by liquid chromatography–mass spectrometry (LC-MS). The area under the curve for the required m/z peak is used as a proxy for the reaction conversion and its numerical value as the objective target. The reaction conditions are used as features for LabMate.AI training. LabMate.AI uses random forests (RF) to generate models that rationalize the currently available data and suggest a new condition set. The software is autonomous and re-trainable after each iteration; with the additional data, RF hyperparameters are optimized using cross-validation. LabMate.AI runs for 5–10 minutes for a full cycle of re-training and prediction on a single personal computer (e.g. Mac Pro and mid-range MacBook Pro).

Validation of LabMate.AI using Ugi chemistry. As proof-of-concept, we selected an Ugi three-component reaction that affords a privileged structure25 in drug discovery – imidazopyridines (Fig. 2a). Despite Ugi reactions having a high substrate scope and reaction condition tolerance, obtaining good conversions (>50%) is not straightforward26, given the multiple variables that must be optimized simultaneously. The area under the curve for the required product in LC-MS traces was used as a proxy for the reaction conversion and, therefore, as target value for LabMate.AI. As an initial training set, we provided LabMate.AI with only 10 random conditions, representing a minute amount (10/27,000 or ~0.04%) of the vast multidimensional search space – here compressed to 27,000 discrete combinations – which sharply contrasts to both “big data” and a recent active learning21 study. Those 10 random reactions provided a range of conversions, yet suboptimal conditions (Fig. 2b). This underscores the complexity of finding optimal reaction conditions and that random condition selection is impractical to obtain high yields. Between iterations 1–10 (red) LabMate.AI informatively explored the reactivity space, selecting conditions affording no product and others yielding the imidazopyridine with good conversion. With the generated information, LabMate.AI was then able to optimize chemistry in stepwise fashion. A similar outcome was obtained via an exploitation approach (blue) with the benefit of minimizing synthetic effort, i.e. number of performed reactions, compared to the more explorative software counterpart. With only five trials optimal reaction conditions were achieved. The latter result is surprising as the use of LabMate.AI for out-of-sample predictions, i.e. out of its domain of applicability, led to the swift identification of productive conditions, against our expectations. Moreover, the result suggests that the initially provided small set of random conditions (“nano data”) – with varying but low yields – still encrypts the blueprints for a successful reaction outcome. With the collected data in hand, we then analyzed the behavior of both software tools. Using dimensionality reduction to visualize the trajectory of picked conditions in the experimental space one can conclude that the exploitative LabMate.AI approach exclusively selects conditions from within two islets with identical reaction outcomes, whereas the curiosity-driven selection method probes different regions in feature space, as originally desired (Fig. 2c). It is apparent that the diverging selection strategies impact the fate of the suggested experiments. Nonetheless, equivalent optimal conditions are obtained while implementing conceptually different search strategies, yet in different timeframes. The self-evolution of optimized random forest parameters and predictive model architecture is also contrasting: while the explorative selection method more drastically modifies the model architecture, the exploitative counterpart is essentially conservative, possibly due to a narrower view on the reaction optimization problem (Fig. 2d). Importantly, as a control experiment and to probe

Page 7: Evolving and Nano Data Enabled Machine Intelligence for ...

6

the accuracy of the method, we confirmed that conditions predicted to yield no product were indeed experimentally unable to afford the imidazopyridine (cf. Supplementary Information), suggesting a broad applicability domain for the model. Next, we studied if simpler prediction methods could have performed equally well. Irrespective of the selection approach, the predictive performance of LabMate.AI is superior to those linear regression methods, as assessed by different metrics (cf. Supplementary Information), which fully motivates the use of adaptive random forests to chemistry optimization problems. Also, the preference for a balanced, i.e. explorative/exploitative, or exploitative LabMate.AI depends on the goal. Whereas the balanced method can accurately predict reaction outcomes, the exploitative approach is less accurate in its predictions. However, the latter is able to rank order correctly well performing reactions and will be preferred for short optimization cycles (cf. Supplementary Information).

Figure 2. Self-evolving machine intelligence optimizes Ugi chemistry. a, Studied chemical reaction. Conditions were selected within the depicted range. b, Ten reactions were performed to afford an initial model. Reaction conditions were selected one at a time, according to different criteria: i) exploration of the reaction condition space (iterations 1–10,

Page 8: Evolving and Nano Data Enabled Machine Intelligence for ...

7

red), and exploitation of the most promising region (iterations 11–20, red); ii) full exploitation (iterations 1–10, blue). c, Projection of the multidimensional search space using the t-distributed stochastic neighborhood embedding (t-SNE) learning algorithm. Background depicts conditions’ density within the search space. White dots: random reactions; Blue dots: exploitative approach reactions. Red dots: explorative reactions of the balanced approach. Color gradients mirror the iteration number. d, t-SNE of the optimized hyperparameters for LabMate.AI, which is fully re-trainable, providing updated parameters and models for improved performance. Color gradient shows the unsupervised, self-evolution of LabMate.AI. Model instances are labeled.

We then compared the performance of exploitative LabMate.AI to that of four researchers – a MSc without experience in organic synthesis, and three experienced PhD-level organic chemists – in a double-blind setup. Descriptors were scaled and randomized for Researchers 1–3 to disable identification of the variables and thereby avoid drawing organic chemistry knowledge into play that could bias/skew the optimization process according to previous experience. For Researcher 4 (PhD level) a fully identified data table was made available, comprising also real-value descriptors, to establish a real-world comparison. Surprisingly, in this narrow test, the software appeared to be more efficient than any of the researchers at optimizing this Ugi reaction over ten iterations (Fig. 3a). Not only are the curves between LabMate.AI and the researchers significantly different (p=0.002, n=4–6, Welch’s t-test), but also the best-performing reaction conditions suggested by LabMate.AI affords a significantly better outcome than the researchers’ counterparts (p=0.001–0.010, n=3–5, Welch’s t-test). Notably, granting additional reactions to Researcher 1 showed no benefits in the identification of optimized conditions. These results support that optimizing reaction conditions’ is a pattern recognition problem accessible to an automated machine learning platform and that true chemical knowledge may, in some instances, not be strictly necessary. Indeed, Researcher 4 who was provided with the non-normalized and thereby identifiable parameters performed not significantly different from the other subjects. Similarly, Researcher 1 with no in-depth chemistry education was able to find productive reaction conditions. Altogether, the driverless evolution of machine intelligence was at least as competent as human intuition for the identification of patterns in nano-sized datasets. Despite these highly encouraging results, further evaluations on larger sets of human experts may be required to statistically validate trends. To rationalize how LabMate.AI navigates the chemical reaction space, we calculated the Euclidean distances between conditions for a given iteration against conditions of its predecessors. The results show that the “condition hop” in the human intuition-driven optimization is identical to that of the exploitative LabMate.AI approach (p>0.10, n=10, unpaired one-way ANOVA with Dunnett’s test, Fig. 3b), while the explorative strategy changed conditions more drastically. Thus, our data supports that while using nano-sized datasets, informed decisions by learning algorithms may resemble chemical intuition. Also, the data-driven yet chemically-naïve selection by algorithms can provide an important advantage if short optimization cycles are required.

To further tap human intuition we fitted different learning algorithms to data generated by all researchers (cf. Supplementary Information) to reveal that, in most cases, machine intelligence can interpret the selected reactions. This is true with random forests, from which

Page 9: Evolving and Nano Data Enabled Machine Intelligence for ...

8

we extracted feature importance ranks in the selection process (Fig. 3c). Strikingly, the catalyst amount was generally perceived, and confirmed by the researchers, as the most important variable for optimizing the Ugi reaction, whereas the pyridine amount was the least important. While Researchers 1–3 appear to rank each feature differently, there was a good agreement between LabMate.AI and Researcher 4 (57% feature rank match). The result is noteworthy, as Researcher 4 had access to real-valued descriptors and feature labels for the optimization process, which demonstrates that LabMate.AI obtained comparable chemical expertise, albeit from scratch and over only five iterations.

Figure 3. Active machine intelligence is more efficient at optimizing chemical reactions than human intuition. a, Researchers optimize reaction conditions’ in stepwise fashion, but never reaching the level of optimization achieved by the algorithm. Best-performing reactions (average value ± confidence interval 95%) – LabMate.AI: 100±3%, n=5; Researcher 1: 90±1%, n=3; Researcher 2: 91±2%, n=3; Researcher 3: 90±3%, n=3; Researcher 4: 88±3%. p=0.001 (Researcher 1), p=0.004 (Researcher 2), p=0.010 (Researcher 3), p=0.001 (Researcher 4), Welch’s t-test. Iteration 8 of LabMate.AI was not reproducible, thus considered an outlier. b, Distribution of the average Euclidean distances calculated between a given reaction and all previous iterations. The exploitative LabMate.AI performs “condition hops” similar to Researchers (p>0.10). The explorative LabMate.AI selects more dissimilar conditions. p<0.0001, n=10, unpaired one-way ANOVA, Dunnett’s test. c, Heatmap of feature importance extracted from random forests fitted to reactions selected by exploitative LabMate.AI/human intuition. Euclidean distances between exploitative LabMate.AI and Researcher 1–4: 6.48 (14% match); 4.24 (43% match); 3.16 (43% match) and 2.45 (57% match), respectively.

LabMate.AI affords new insight into a C–N cross-coupling. As a more challenging test, we applied exploitative LabMate.AI to a C–N (Buchwald-Hartwig) cross-coupling (Fig. 4a)27 – a relevant transformation identified as an outstanding challenge in drug discovery28. Replicates of the optimal reaction conditions from the literature consistently afforded a mixture of regioisomers, as reported27. These results served as control. Randomly sampling only 0.03% of the reaction condition space (Fig. 4a,b) offered a sparse dataset for LabMate.AI training (Fig. 4c). Conversion rates were on average low for those reactions conditions (60% control), with two of them yielding almost no product and one affording the required C2–N product with conversion identical to that of the previously published (control) conditions. Using this data, the software was again able to gradually optimize the cross-coupling. At its peak, the algorithm suggested a protocol that provided an improved 140%

Page 10: Evolving and Nano Data Enabled Machine Intelligence for ...

9

conversion when compared to the literature control (p=0.0008, Fig. 4b). Moreover, the computer-optimized protocol was more base- and time-economical compared to the reported conditions; in both cases requiring half the amounts of the control protocol. As observed for the Ugi chemistry, LabMate.AI evolved with each reaction. The analysis of the feature/parameter importance over the whole iterative experimentation and optimization revealed that the base amount steadily grew as the most important feature for building predictive decision trees. Conversely, reaction time and palladium catalyst amount were less informative in distinguishing good- from poor-yielding reactions. Importantly, some parameters such as solvent amount and reaction temperature dynamically changed during model evolution, with some experiments assigning increasing or decreasing importance to these parameters. A similar dynamic had been reported23 and highlights the adaptive character of iterative learning. Whether these findings apply to other C–N couplings remains a matter of study. Intrigued by this result, which opposed our personal understanding of the most important parameters for this type of chemical transformation, we surveyed 38 independent, organic chemistry experts from top universities and industry in Europe and the USA, asking them to assign a feature/parameter importance for this reaction based on their chemical intuition. The results clearly show that LabMate.AI has an orthogonal vantage point to all surveyed scientists in regards to the most important feature (amount of Cs2CO3), but is in large agreement with respect to the importance of xantphos, chloropyridine and amine amounts (cf. Supplementary Information); once again suggesting that expert chemical knowledge was acquired by LabMate.AI. Indeed, our algorithm indirectly learned that a high concentration of xantphos is associated to an increased amount of Pd(xantphos)2, which is detrimental for the reaction conversion rate due to its low activity as pre-catalyst and high insolubility in dioxane29.

Pairing of the subjects’ answers to the LabMate.AI feature ranks shows a ≤50% match in 89% of the cases. The observed general disagreement between the surveyed experts also highlights how irreproducible reaction troubleshooting routines can be, and that several routes can be taken towards the same goal. Conversely, LabMate.AI offers a robust solution to make reaction optimization processes reproducible/deterministic while adding its unique chemical creativity and innovation to problem solving. In this particular case, LabMate.AI advocates that higher importance should be given to the base amount while optimizing this C–N coupling, a realization that was ancillary for the surveyed expert chemists – 0% answers including base amount among the top-2 most important features – but allowed improved reaction yields.

Page 11: Evolving and Nano Data Enabled Machine Intelligence for ...

10

Figure 4. LabMate.AI optimizes a C–N cross-coupling reaction. a, Reaction optimized by LabMate.AI. Conditions were selected within the depicted range. Median values correspond to literature conditions. b, Optimization by exploitative LabMate.AI. Horizontal line shows conversion rate for reaction as described in the literature27. Reactions were assessed in relation to the average conversion for the literature protocol (Average value ± confidence interval 95%: 100±6%, n=4). The best suggested reaction affords a conversion rate significantly higher than the optimized literature protocol. Only the major product (C2–N coupling) was taken into account for data analyses. p=0.0008, Welch’s t-test, n=3–4. c, t-Distributed stochastic neighborhood embedding (t-SNE) of reaction condition space, showing the focused selection of reactions by LabMate.AI. Color gradient depicts iteration number, with the darkest color equaling iteration 20. d, Spider plot showing adaptive feature importance as LabMate.AI evolves. Cyan: iteration 1; Purple: iteration 10.

Conclusions

Machine intelligence to enable sustainable and informed synthetic chemistry is of high value and widespread interest. Still, applications leveraging “big data”, cryptic algorithms and descriptors, and a lack of comparison to human performance may hinder its maturation.

Page 12: Evolving and Nano Data Enabled Machine Intelligence for ...

11

Active learning remains underexplored30, despite its usefulness in design of experiments31-35. We23,36 and others21,37,38 had previously employed 5–10% of all available data in active learning applications to obtain proficient, yet relatively inflexible models. In some cases, models were built with 500–1000 data points,21 which is not practical if quality data is not readily accessible. Here, we have shown that a self-evolving method employing “nano data”, simple, yet motivated descriptors and an interpretable algorithm can acquire chemical knowledge to navigate uncharted reaction condition spaces, identify optimized reaction conditions, predict conversions for relevant chemistry and provide reactivity insights. Our autonomous learning approach is agnostic to the identity of the modeled reaction, thus it may be applied to any chemical transformation. Furthermore, it is orthogonal to the big data requirement dogma for successful machine learning deployment, and the need of expert knowledge for chemistry optimization. LabMate.AI can be at least as proficient and inventive as expert human chemists, thus opening new research avenues. This does not refute the importance of true chemical expertise, but shows that automated pattern recognition can afford an alternative path to rapidly identify optimized reaction conditions. We expect that active learning will find broad applicability in accelerating discovery chemistry, democratizing chemical syntheses, eliminating non-informative experiments, minimizing reagent feedstocks and freeing chemists for non-routine tasks.

Methods

LabMate.AI. LabMate.AI uses Random Forest (RF) regressors and exhaustively optimizes hyperparameters [number of trees (100–1000), tree depth (none, 2, 4) and number of features (auto or sqrt)] to build a prediction model that is subjected to 10-fold cross-validation. In total, >600,000 decision trees are screened and a prediction variance is calculated from the final/best RF model. This model is then used to predict a conversion value from all possible reaction conditions that have not yet been tested. Based on these predictions, the next experiment is selected. An exploration approach is taken for the first 10 iterations, to allow model improvement – this is achieved by selecting the conditions whose reaction output prediction has the highest variance. For the following iterations (10–20), an exploitative (greedy) approach is pursued to optimize the target value of the studied chemical reaction. This is carried out through distinct approaches:

If maximum target value (iterations 11–20) ≥ 4 × maximum target value (iterations 1–10):

Select reaction with lowest variance among the predicted top-5 high yielding reactions.

If maximum target value (iterations 11–20) < 4 × maximum target value (iterations 1–10):

Select reaction with the highest variance among the predicted top-10 high yielding reactions.

Alternatively, LabMate.AI follows only a greedy approach by selecting the reaction with lowest variance among the predicted top-5 high yielding reactions, i.e. without any explorative component in selection. The LabMate.AI software evolves with each added data point by refining its predictive model through full re-training, which involves hyperparameter selection, model fitting, and updating predictions on all remaining conditions. The LabMate.AI software and data analyses were fully implemented in Python 2.7.10 using the

Page 13: Evolving and Nano Data Enabled Machine Intelligence for ...

12

NumPy 1.11.3, Pandas 0.19.2 and Scikit-learn 0.18.1 libraries, and was run (5–10 minutes) on an Apple Mac Pro machine (3.5 GHz 6 core processor, 32 Gb RAM).

Miscellaneous machine learning. For control of LabMate.AI performance, linear regression, ridge, elastic net and lasso models were computed with default settings. The selection of best LabMate.AI hyperparameters was guided by the calculation of mean absolute errors (MAE). Complementary metrics, such as the mean squared error (MSE) and coefficient of determination (r2) were calculated in all cases to further scrutinize the models’ quality.

To map human reasoning, different machine learning algorithms, including RFs, Support Vector Machines (SVMs) and deep feed-forward Neural Networks (NN) were employed. The best combination of hyperparameters [SVM: C = 0.1–1; epsilon = 0.001–0.01; kernel = rbf/poly; NN: alpha = 0.0001/0.001; iterations = 200/500/1000; activation=logistic/tanh; solver = lbfgs/sgd; learning rate = constant/adaptive; hidden layers = 1–3 layers with 1–3 neurons] was exhaustively searched for in every case through 10-fold cross-validations. t-Distributed Stochastic Neighborhood Embedding (t-SNE, learning rate = 900) was employed to visualize and monitor the model evolution over time and the reaction condition selection process. All machine learning methods were implemented in Python 2.7.10 using the NumPy 1.11.3, Pandas 0.19.2 and Scikit-learn 0.18.1 libraries. Data was plotted with Matplotlib 1.5.3 or Seaborn 0.7.1 and statistics computed with SciPy 0.18.1.

Chemistry. For each chemistry type, ten random reactions were carried out in parallel and analyzed through LC-MS with the goal of providing initial data for learning. The area under the curve (AUC) for the reaction product was used as proxy for the reaction conversion. This data was made available to LabMate.AI and human researchers. Based on this data, machine learning and human researchers selected one additional set of conditions, the synthesis was performed at these conditions by the same, independent researcher, and the yield was evaluated using the same LC-MS protocol. The outcome for the selected experiment was reported only to the researcher / machine learning method that had selected those conditions. Based on this data, another condition was selected, until a total of 10 additional experiments were performed.

Ugi reaction. 2-Aminopyridine (0.1–0.3 mmol) was dissolved in absolute EtOH (0.1–1.5 mL). To the solution, benzaldehyde (0.1–0.3 mmol), perchloric acid (0–10 mol%) and cyclohexyl isocyanide (0.1–0.3 mmol) were added in succession. The mixtures were allowed to react under constant shaking (500 r.p.m.) at a selected temperature (10–80 °C) and for a given period of time (5–60 minutes). After concluding the reaction, the solvent was evaporated in vacuo and the residue re-dissolved in an appropriate volume of HPLC grade acetonitrile for analysis. A target value was obtained by measuring the AUCMS+ for m/z = 292. Crystallization from acetonitrile afforded the required compound as white needles.

For control, an MSc level researcher without previous experience in organic synthesis, and two experienced PhD level chemists suggested reaction conditions without prior knowledge of reactions selected by the algorithm (i.e. double-blind setup) – Researchers 1-3. For a fair comparison of pattern recognition abilities between LabMate.AI and human counterparts, all three researchers were only provided with a normalized, unlabeled list of descriptors and target values corresponding to the ten random reactions. Normalized reaction parameters

Page 14: Evolving and Nano Data Enabled Machine Intelligence for ...

13

were selected, one at a time, with the goal of maximizing the target value. For a real-world comparison, a Principal Investigator researcher (Researcher 4) with deep experience in multi-component chemistry was provided with a fully labeled table and real descriptor values. All four researchers were provided with the outcome of their selected reaction to allow active learning and suggestion of the next best reaction condition. To avoid reaction condition selection bias due to communication, the researchers were not aware of who was involved in the study. All reactions were executed by the same researcher, who was not involved in the reaction selection process.

C–N coupling. To a dried microwave vial was added ethyl 2-amino-1,3-oxazole-5-carboxylate (0.5–0.6 mmol), Cs2CO3 (1–4 mol eq.), Pd2(dba)3 (1–5 mol %) and xantphos (5–10 mol %). The vial was capped and purged with argon. 2,4-Dichloropyridine (0.5–0.6 mmol) was added via syringe, and the mixture suspended in degassed 1,4-dioxane (1–4 mL). The reaction mixture was heated under microwaves (300 W) at 140 or 160 °C for a hold time of 30–90 minutes at max stirring. After concluding the reaction, the mixture was diluted with acetonitrile to 6 mL prior to analysis. The crude product was purified via preparative TLC using dichloromethane : methanol (5%) as eluent. The required compound was obtained as an off-white powder and its C4-position regioisomer was isolated as a yellow oil. A target value was obtained by measuring the major AUCMS+ for m/z = 268.

Survey. Organic chemistry experts, including 13 graduate students, 17 postdoctoral researchers and 8 principal investigators from the University of Lisbon (Portugal), University of Cambridge (UK), Oxford Advanced Surfaces (UK), University of Vienna (Austria), University la Rioja (Spain), Massachusetts Institute of Technology (USA) and H. Lundbeck A/S (Denmark) were enquired to rank the feature importance (1: most important; 8: least important) in the Buchwald-Hartwig cross-coupling reaction, without knowledge of the LabMate.AI output (double blind). Responses were voluntary and the anonymity of respondents was ensured. The survey was approved by the iMM and MIT (COUHES protocol 1809514426).

Acknowledgements

D.R. is a Swiss National Science Foundation Fellow (Grants P2EZP3_168827 and P300P2_177833). G.J.L.B. is a Royal Society URF and recipient of an ERC StG (TagIt). T.R. is a Marie-Sklodowska Curie Fellow (Grant 743640). T.R. acknowledges the H2020 (TWINN-2017 ACORN, Grant 807281) and FCT/FEDER (02/SAICT/2017, Grant 28333) for funding. The authors are extremely grateful to C. Oliveira, Dr. J. Seixas, Dr. H. Vila-Real and Dr. N.R. Candeias for suggesting Ugi reaction conditions, and to Prof. R. Langer and Prof. G. Traverso who provided invaluable comments on the research and manuscript. The authors are indebted to Prof. R. Moreira for access to the CEM microwave reactor, and the 13 graduate students, 17 postdoctoral researchers and 8 principal investigators across Austria, Denmark, Portugal, Spain, UK and the USA who took part in the survey. We thank R. Rodrigues for help in producing Figure 1. The survey was approved by the iMM and MIT (COUHES protocol 1809514426).

Page 15: Evolving and Nano Data Enabled Machine Intelligence for ...

14

References

1 MacCoss, M. & Baillie, T. A. Organic chemistry in drug discovery. Science 303, 1810-1813 (2004).

2 Whitesides, G. M. Reinventing chemistry. Angew. Chem. Int. Ed. 54, 3196-3209 (2015).

3 Baran, P. S. Natural product total synthesis: as exciting as ever and here to stay. J. Am. Chem. Soc. 140, 4751-4755 (2018).

4 Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452-454 (2016).

5 Jordan, M. I. & Mitchell, T. M. Machine learning: trends, perspectives, and prospects. Science 349, 255-260 (2015).

6 Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241-1250 (2018).

7 Schneider, G. Automating drug discovery. Nat. Rev. Drug Discov. 17, 97-113 (2018).

8 Lehmann, J. W., Blair, D. J. & Burke, M. D. Toward generalization of iterative small molecule synthesis. Nat. Rev. Chem. 2, 0115 (2018).

9 Trobe, M. & Burke, M. D. The molecular industrial revolution: automated synthesis of small molecules. Angew. Chem. Int. Ed. 57, 4192-4214 (2018).

10 Li, J. et al. Synthesis of many different types of organic small molecules using one automated process. Science 347, 1221-1226 (2015).

11 Roch, L. M. et al. ChemOS: orchestrating autonomous experimentation. Sci. Robot. 3, eaat5559 (2018).

12 Henson, A. B., Gromski, P. S. & Cronin, L. Designing algorithms to aid discovery by chemical robots. ACS Cent. Sci. 4, 793-804 (2018).

13 Chow, S., Liver, S. & Nelson, A. Streamlining bioactive molecular discovery through integration and automation. Nat. Rev. Chem. 2, 174-183 (2018).

14 Gomez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268-276 (2018).

15 Zhou, Z., Li, X. & Zare, R. N. Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci. 3, 1337-1344 (2017).

16 Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604-610 (2018).

17 Ley, S. V. The engineering of chemical synthesis: humans and machines working in harmony. Angew. Chem. Int. Ed. 57, 5182-5183 (2018).

18 Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73-76 (2016).

19 Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C-N cross-coupling using machine learning. Science 360, 186-190 (2018).

20 Nielsen, M. K., Ahneman, D. T., Riera, O. & Doyle, A. G. Deoxyfluorination with sulfonyl fluorides: navigating reaction space with machine learning. J. Am. Chem. Soc. 140, 5004-5008 (2018).

21 Granda, J. M., Donina, L., Dragone, V., Long, D. L. & Cronin, L. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature 559, 377-381 (2018).

Page 16: Evolving and Nano Data Enabled Machine Intelligence for ...

15

22 Bedard, A. C. et al. Reconfigurable system for automated optimization of diverse chemical reactions. Science 361, 1220-1225 (2018).

23 Reker, D., Schneider, P. & Schneider, G. Multi-objective active machine learning rapidly improves structure–activity models and reveals new protein–protein interaction inhibitors. Chem. Sci. 7, 3919-3927 (2016).

24 Jacobsen, T. L., Jorgensen, M. S. & Hammer, B. On-the-fly machine learning of atomic potential in density functional theory structure optimization. Phys. Rev. Lett. 120, 026102 (2018).

25 Yet, L. Privileged structures in drug discovery: medicinal chemistry and synthesis. (1st Ed.) John Wiley & Sons Inc., Hoboken, NJ, USA (2018).

26 Reutlinger, M., Rodrigues, T., Schneider, P. & Schneider, G. Combining on-chip synthesis of a focused combinatorial library with computational target prediction reveals imidazopyridine GPCR ligands. Angew. Chem. Int. Ed. 53, 582-585 (2014).

27 Noonan, G. M., Dishington, A. P., Pink, J. & Campbell, A. D. Studies on the coupling of substituted 2-amino-1,3-oxazoles with chloro-heterocycles. Tetrahedron Lett. 53, 3038-3043 (2012).

28 Blakemore, D. C. et al. Organic synthesis provides opportunities to transform drug discovery. Nat. Chem. 10, 383-394 (2018).

29 Klingensmith, L. M., Strieter, E. R., Barder, T. E. & Buchwald, S. L. New insights into xantphos/Pd-catalyzed C-N bond forming reactions: a structural and kinetic study. Organometallics 25, 82-91 (2006).

30 Reker, D. & Schneider, G. Active-learning strategies in computer-assisted drug discovery. Drug Discov. Today 20, 458-465 (2015).

31 Dragone, V., Sans, V., Henson, A. B., Granda, J. M. & Cronin, L. An autonomous organic reaction search engine for chemical reactivity. Nat. Commun. 8, 15733 (2017).

32 Duros, V. et al. Human versus robots in the discovery and crystallization of gigantic polyoxometalates. Angew. Chem. Int. Ed. 56, 10815-10820 (2017).

33 Yoshida, M. et al. Using evolutionary algorithms and machine learning to explore sequence space for the discovery of antimicrobial peptides. Chem 4, 533-543 (2018).

34 Has̈e, F., Roch, L. c. M., Kreisbeck, C. & Aspuru-Guzik, A. Phoenics: A Bayesian optimizer for chemistry. ACS Cent. Sci. 4, 1134-1145 (2018).

35 Häse, F., Roch, L. M. & Aspuru-Guzik, A. Chimera: enabling hierarchy based multi-objective optimization for self-driving laboratories. Chem. Sci., DOI: 10.1039/C1038SC02239A (2018).

36 Reker, D., Schneider, P., Schneider, G. & Brown, J. B. Active learning for computational chemogenomics. Future Med. Chem. 9, 381-402 (2017).

37 Warmuth, M. K. et al. Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43, 667-673 (2003).

38 Ahmadi, M., Vogt, M., Iyer, P., Bajorath, J. & Frohlich, H. Predicting potent compounds via model-based global optimization. J. Chem. Inf. Model. 53, 553-559 (2013).

Page 17: Evolving and Nano Data Enabled Machine Intelligence for ...

S1

Supplementary Information

Evolving and nano data enabled machine intelligence for chemical reaction optimization

Correspondence should be addressed to: G.J.L.B.: [email protected] T.R.: [email protected]

Page 18: Evolving and Nano Data Enabled Machine Intelligence for ...

S2

Table of contents 1. Experimental section … S3

1.1 Chemistry … S3 1.1.1 General considerations … S3 1.1.2 Synthesis of N-cyclohexyl-2-phenylimidazo[1,2-a]pyridin-3-amine … S3

1.1.3 Synthesis of ethyl 2-((4-chloropyridin-2-yl)amino)oxazole- 4-carboxylate … S3

2. Supplementary data … S4 2.1 LabMate.AI architecture … S4 2.2 Ugi chemistry … S5 2.3 C–N coupling … S14 2.4 1H NMR data … S15 2.5 Survey … S17 3. References … S17

Page 19: Evolving and Nano Data Enabled Machine Intelligence for ...

S3

1 Experimental section 1.1 Chemistry 1.1.1 General considerations Building blocks and solvents were purchased from Sigma Aldrich, Alfa Aesar, Fluka or TCI Deutschland and used without further purification. Proton nuclear magnetic resonance (1H NMR) spectra were recorded on a Bruker AVANCE 300 MHz spectrometer. All chemical shifts are quoted on the δ scale, in ppm, using with solvent peaks as an internal references. Coupling constants (J) are reported in Hz with the following splitting abbreviations: s = singlet, d = doublet, t = triplet, dd = doublet of doublets, td = triplet of doublets, m = multiplet. Reaction output values were measured in a Waters Acquity LC-MS machine using a elution gradient of 5-100% acetonitrile over 12 minutes [H2O (+0.1% formic acid) : acetonitrile (+0.01% formic acid)]. All Ugi reactions were carried out in 1.5 mL Eppendorf tubes in a VWR Thermal Shake Lite. Microwave-assisted syntheses were carried out in a CEM Discover reactor. 1.1.2 Synthesis of N-cyclohexyl-2-phenylimidazo[1,2-a]pyridin-3-amine

1H NMR (300 MHz, Acetone-d6): δ 1.20-1.50 (5H, m, CH2), 1.60-2.00 (5H, m, CH2), 3.09 (1H, m, CH), 3.80 (1H, d, J = 4.8 Hz, NH), 6.95 (1H, td, J = 6.9 and 1.2 Hz, Ar-H), 7.29 (1H, ddd, J = 9.0, 6.6 and 1.2 Hz, Ar-H), 7.40-7.47 (1H, m, Ar-H), 7.50-7.63 (3H, m, Ar-H), 8.27-8.35 (3H, m, Ar-H) ppm.

1.1.3 Synthesis of ethyl 2-((4-chloropyridin-2-yl)amino)oxazole-4-carboxylate

1H NMR (300 MHz, DMSO-d6): δ 1.28 (3H, t, J = 7.2 Hz, CH3), 4.27 (2H, q, J = 7.2 Hz, CH2), 7.13 (1H, dd, J = 5.4 and 1.8 Hz, Ar-H), 8.09 (1H, d, J = 1.8 Hz, Ar-H), 8.25 (1H, d, J = 5.4 Hz, Ar-H), 8.44 (1H, s, Ar-H), 11.35 (1H, br.s, NH) ppm.

N

N

NH

N

Cl

NH

N

O O

O

Page 20: Evolving and Nano Data Enabled Machine Intelligence for ...

S4

2. Supplementary data 2.1 LabMate.AI architecture

Figure S1. Schematics of the LabMate.AI workflow. Grey boxes depict actions performed by an independent Python script. Orange boxes depict actions performed by LabMate.AI. Dashed arrow depicts the manual addition of AUC data in LC-MS traces to the training data. CV: cross-validation.

Page 21: Evolving and Nano Data Enabled Machine Intelligence for ...

S5

2.2 Ugi chemistry Table S1. Reactions selected by a balanced (exploration/exploitation) approach, using LabMate.AI, with the goal of increasing the target value. No. Pyridine

(mmol) Aldehyde (mmol)

Isocyanide (mmol)

Temperature (°C)

Solvent (mL)

Catalyst (mol%)

Time (min) Predicteda Targetb

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0.1 0.2 0.3 0.3 0.3 0.2 0.1 0.3 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.3 0.1 0.1 0.1 0.2 0.1 0.2 0.3 0.2 0.2 0.2 0.2 0.3 0.3

0.2 0.1 0.1 0.2 0.1 0.3 0.1 0.2 0.2 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.3 0.3 0.3 0.3

0.1 0.1 0.3 0.1 0.2 0.2 0.2 0.1 0.3 0.1 0.3 0.3 0.3 0.3 0.3 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

80 10 10 40 40 20 60 40 60 10 80 80 80 80 80 80 20 80 10 80 80 80 80 80 80 80 80 80 80 80

1.5 0.1 1.5 0.1 0.1 1.0 1.5 0.25 0.5 1.0 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 0.1 0.1 0.1 0.25 0.1 0.1 0.5 0.1 0.1

0 4 3 3 2 3 2 1 4 4 4 0 3 4 0 5 4 3 4 2 4 5 5 5 5 5 5 5 5

7.5

60 15 5 10 15 5 5 5 5 5 10 10 10 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a

3.8±5.8 7.4±34.3 7.1±36.5 7.8±29.4 8.5±91.2 10.6±89.2 12.7±80.4 9.2±59.4 9.2±62.0 6.8±54.2 17.8±19.1 17.7±7.6 20.9±13.6 23.6±15.9 24.0±8.5 24.1±8.0 24.5±4.6 26.0±4.2 25.4±3.7 27.2±2.9

1.2 2.9 2.2 7.0 4.2 3.1 1.2 1.9 6.7 4.7 13.8 0.5 7.8 21.7 0.3 16.2 4.0 15.4 6.0 10.5 15.0 23.4 25.7 23.6 24.5 24.6 27.1 24.6 28.6 30.5

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses. c n/a – not applicable (randomly selected reactions).

Page 22: Evolving and Nano Data Enabled Machine Intelligence for ...

S6

Table S2. Reactions selected by a greedy (exploitation) approach, using LabMate.AI, with the goal of increasing the target value. No. Pyridine

(mmol) Aldehyde (mmol)

Isocyanide (mmol)

Temperature (°C)

Solvent (mL)

Catalyst (mol%)

Time (min) Predicteda Targetb

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.1 0.2 0.3 0.3 0.3 0.2 0.1 0.3 0.1 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

0.2 0.1 0.1 0.2 0.1 0.3 0.1 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

0.1 0.1 0.3 0.1 0.2 0.2 0.2 0.1 0.3 0.1 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3

80 10 10 40 40 20 60 40 60 10 60 60 60 60 80 80 80 80 80 80

1.5 0.1 1.5 0.1 0.1 1.0 1.5 0.25 0.5 1.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0 4 3 3 2 3 2 1 4 4 4 5 10 7.5 10 7.5 7.5 10 10 7.5

60 15 5 10 15 5 5 5 5 5 10 10 10 10 10 10 15 15 30 30

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a

5.2±3.0 16.8±62.4 21.1±36.4 23.2±17.8 25.8±8.0 27.4±6.8 26.8±31.0 29.6±0.9 31.2±1.4 29.1±2.7

1.2 2.9 2.2 7.0 4.2 3.1 1.2 1.9 6.7 4.7 22.3 23.9 26.2 26.9 29.4 28.2 30.2d 32.0e 27.9 29.4

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses. c n/a – not applicable (randomly selected reactions). d Reaction outcome was reproducible, as assessed from independent experiments (additional target values obtained: 28.7; 30.9; 30.0; n = 4). e Reaction outcome was irreproducible, as assessed from independent experiments (additional target values obtained: 28.4; 28.4; 30.2; 29.3; n = 5). Table S3. Reactions selected by a greedy (exploitation) approach, using LabMate.AI, with the goal of decreasing the target value. No. Pyridine

(mmol) Aldehyde (mmol)

Isocyanide (mmol)

Temperature (°C)

Solvent (mL)

Catalyst (mol%)

Time (min) Predicteda Targetb

1 2 3

0.2 0.1 0.1

0.1 0.1 0.1

0.2 0.2 0.2

40 40 40

1.5 1.5 1.5

0 1 0

5 5 5

2.1±1.2 1.1±1.2 0.9±0.4

0.1 1.2 0.2

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses.

Page 23: Evolving and Nano Data Enabled Machine Intelligence for ...

S7

Table S4. Reactions selected by Researcher 1 (11–26) with the goal of increasing the target value. Descriptors were anonymized and scaled. No. Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Predicteda Targetb 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0.33 0.66 1 1 1

0.66 0.33 1

0.33 0.66 1 1 1 1 1

0.66 0.66 1 1 1 1 1 1 1 1

0.66

0.66 0.33 0.33 0.66 0.33 1

0.33 0.66 0.66 1

0.66 0.66 0.66 0.66 0.66 0.66 0.66 1 1 1 1 1 1 1 1 1

0.33 0.33 1

0.33 0.66 0.66 0.66 0.33 1

0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.66 1 1

0.66 1 1 1 1

1 0.2 0.2 0.6 0.6 0.4 0.8 0.6 0.8 0.2 0.8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 0.2 1

0.2 0.2 0.8 1

0.4 0.6 0.8 0.6 0.6 0.6 0.6 0.2 0.6 0.6 0.6 0.6 0.6 0.8 0.8 0.6 0.8 0.6 0.6

0.125 0.625 0.5 0.5 0.25 0.5

0.375 0.25 0.625 0.625 0.625 0.625 0.875

1 1 1 1 1 1 1 1 1

0.875 0.875

1 1

1 0.6 0.2 0.4 0.6 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a

1.2 2.9 2.2 7.0 4.2 3.1 1.2 1.9 6.7 4.7 10.5 14.9 18.0 18.9 16.8 17.7 17.9 20.0 25.5 26.2d 25.7 22.8 26.0 23.6 26.3 22.0

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses. c n/a – not applicable (randomly selected reactions or suggested by Researcher 1). d Reaction outcome was reproducible, as assessed from independent experiments (additional target values obtained: 26.4; 26.9; n = 3).

Page 24: Evolving and Nano Data Enabled Machine Intelligence for ...

S8

Table S5. Reactions selected by Researcher 2 (11–20) with the goal of increasing the target value. Descriptors were anonymized and scaled. No. Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Predicteda Targetb 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.33 0.66 1 1 1

0.66 0.33 1

0.33 0.66 0.33 0.33 1

0.66 0.66 0.66 0.66 0.66 0.66 0.66

0.66 0.33 0.33 0.66 0.33 1

0.33 0.66 0.66 1

0.33 0.66 0.66 0.66 0.66 1 1 1 1 1

0.33 0.33 1

0.33 0.66 0.66 0.66 0.33 1

0.33 1 1 1 1

0.66 1 1 1 1 1

1 0.2 0.2 0.6 0.6 0.4 0.8 0.6 0.8 0.2 0.6 0.6 0.6 0.6 0.6 0.8 0.8 0.8 1

0.6

1 0.2 1

0.2 0.2 0.8 1

0.4 0.6 0.8 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

0.125 0.625 0.5 0.5 0.25 0.5

0.375 0.25 0.625 0.625 0.5 0.5 0.5 0.5 0.5 0.75 0.875

1 0.75 0.875

1 0.6 0.2 0.4 0.6 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.8 1

0.8 0.8

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a

1.2 2.9 2.2 7.0 4.2 3.1 1.2 1.9 6.7 4.7 5.0 11.7 12.3 13.9 13.4 20.4 26.0d 24.3 25.0 24.9

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses. c n/a – not applicable (randomly selected reactions or suggested by Researcher 2). d Reaction outcome was reproducible, as assessed from independent experiments (additional target values obtained: 27.3; 26.8; n = 3).

Page 25: Evolving and Nano Data Enabled Machine Intelligence for ...

S9

Table S6. Reactions selected by Researcher 3 (11–20) with the goal of increasing the target value. Descriptors were anonymized and scaled. No. Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Predicteda Targetb 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.33 0.66 1 1 1

0.66 0.33 1

0.33 0.66 1

0.66 1 1 1 1 1 1 1 1

0.66 0.33 0.33 0.66 0.33 1

0.33 0.66 0.66 1 1

0.66 0.66 0.66 0.66 1 1 1 1 1

0.33 0.33 1

0.33 0.66 0.66 0.66 0.33 1

0.33 0.66 0.66 0.66 1 1 1 1 1 1 1

1 0.2 0.2 0.6 0.6 0.4 0.8 0.6 0.8 0.2 0.6 0.6 0.6 0.6 0.6 0.8 1 1 1 1

1 0.2 1

0.2 0.2 0.8 1

0.4 0.6 0.8 0.6 0.4 0.2 0.2 0.8 0.2 0.2 0.2 0.6 0.4

0.125 0.625 0.5 0.5 0.25 0.5

0.375 0.25 0.625 0.625 0.625 0.625 0.625

1 1 1

0.5 0.125 0.375 0.625

1 0.6 0.2 0.4 0.6 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.4 0.2 0.2 0.2 0.8 0.4

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a

1.2 2.9 2.2 7.0 4.2 3.1 1.2 1.9 6.7 4.7 6.8 10.1 12.0 14.8 11.0 22.0 25.8d 4.6 25.5 23.9

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses. c n/a – not applicable (randomly selected reactions or suggested by Researcher 3). d Reaction outcome was reproducible, as assessed from independent experiments (additional target values obtained: 26.2; 27.5; n = 3).

Page 26: Evolving and Nano Data Enabled Machine Intelligence for ...

S10

Table S7. Reactions selected by Researcher 4 (11–20) with the goal of increasing the target value. No. Pyridine

(mmol) Aldehyde (mmol)

Isocyanide (mmol)

Temperature (°C)

Solvent (mL)

Catalyst (mol%)

Time (min) Predicteda Targetb

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.1 0.2 0.3 0.3 0.3 0.2 0.1 0.3 0.1 0.2 0.1 0.1 0.3 0.3 0.2 0.3 0.3 0.3 0.3 0.3

0.2 0.1 0.1 0.2 0.1 0.3 0.1 0.2 0.2 0.3 0.1 0.1 0.2 0.2 0.3 0.2 0.2 0.3 0.3 0.3

0.1 0.1 0.3 0.1 0.2 0.2 0.2 0.1 0.3 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.3 0.3 0.3 0.3

80 10 10 40 40 20 60 40 60 10 60 60 60 60 60 60 60 60 80 80

1.5 0.1 1.5 0.1 0.1 1.0 1.5 0.25 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0 4 3 3 2 3 2 1 4 4 3 3 3 10 10 10 10 10 10 10

60 15 5 10 15 5 5 5 5 5 10 15 15 15 15 15 15 15 15 30

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a

1.2 2.9 2.2 7.0 4.2 3.1 1.2 1.9 6.7 4.7 4.5 6.7 9.8 14.1 10.9 17.4 18.1 24.1 26.5d 25.7

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses. c n/a – not applicable (randomly selected reactions or suggested by Researcher 4). d Reaction outcome was reproducible, as assessed from independent experiments (additional target values obtained: 24.9; 25.3; 26.9; n = 4).

Page 27: Evolving and Nano Data Enabled Machine Intelligence for ...

S11

Table S8. Fit of machine learning models to data generated by Researcher 1. Method

Random forest Support Vector Machines

Feed-forward deep neural network

MAEa MSEb r2 c 2.91055 1.81058335 0.973682568684

9.23807934846 55.1279864618 0.198696377442 2.01608414867 0.282085154459 0.995899798439

a Mean absolute error. b Mean squared error. c determination coefficient Table S9. Fit of machine learning models to data generated by Researcher 2.

Method Random forest

Support Vector Machines Feed-forward deep neural network

MAEa MSEb r2 c 3.20625 1.58487195 0.979274320967

8.24056996249 74.7648992317 0.0222848575017 2.17436328915 0.589112468749 0.992296061558

a Mean absolute error. b Mean squared error. c determination coefficient Table S10. Fit of machine learning models to data generated by Researcher 3.

Method Random forest

Support Vector Machines Feed-forward deep neural network

MAEa MSEb r2 c 4.74315 4.71462185 0.930256660542

6.29566411564 66.0728023979 0.0225858969891 3.0268854045 4.2864358285 0.936590810767

a Mean absolute error. b Mean squared error. c determination coefficient Table S11. Fit of machine learning models to data generated by Researcher 4.

Method Random forest

Support Vector Machines Feed-forward deep neural network

MAEa MSEb r2 c 2.8918 1.417876816 0.979074546527

6.43123982293 59.5117270265 0.121707992594 2.07530305907 0.249032635908 0.996324701288

a Mean absolute error. b Mean squared error. c determination coefficient

Page 28: Evolving and Nano Data Enabled Machine Intelligence for ...

S12

Figure S2. LabMate.AI provides better predictability comparing to linear regression models. MAE: mean absolute error; MSE: mean squared error; r2: coefficient of determination. In MAE and MSE low values are desired. In r2 high values are desired.

Page 29: Evolving and Nano Data Enabled Machine Intelligence for ...

S13

Figure S3. Retrospective analysis of the predictive power of LabMate.AI. With data collected by the exploration phase of the balanced approach and the exploitative approach, individually, two random forest models were built. The reaction outcomes of the held out data (10 random reactions) were predicted to reveal that the model generated with data from the explorative approach predicts outcomes more accurately (lower mean absolute error value), thus presenting a wider domain of applicability. The exploitative approach selects experiments that tend to be more identical between them. This leads to a model that contains more similar information and thus, larger prediction errors for reaction conditions dissimilar to the training set.

Page 30: Evolving and Nano Data Enabled Machine Intelligence for ...

S14

2.3 C–N coupling Table S12. Reactions selected by a greedy (exploitation) approach, using LabMate.AI, with the goal of increasing the reaction conversion. No. Pyridine

(mmol) Amine (mmol)

Solvent (mL)

Catalyst (mol %)

Ligand (mol %)

Base (mol eq.)

Time (min)

Temp. (°C) Preda Expb

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.5 0.5 0.6 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.6 0.6 0.6 0.5 0.5 0.5 0.6 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

1.5 2.0 2.5 2.5 2.5 4.0 1.5 1.5 2.5 4.0 1.0 1.0 1.0 1.5 1.5 1.5 1.0 1.5 1.5 1.5

4.0 4.0 4.0 3.0 2.5 5.0 2.5 4.0 5.0 5.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.0 3.0 3.0

8.0 5.0 8.5 6.0 8.5 10.0 10.0 7.5 7.0 5.0 9.0 8.5 8.0 8.5 9.0 10.0 10.0 9.0 10.0 8.5

1 4 2 3 3 2 2 3 3 4 1 1 1 1 1 1 1 1 1 1

30 90 60 30 60 90 90 90 30 60 30 30 30 30 30 30 30 30 30 30

140 160 140 140 140 160 140 140 160 160 140 140 140 140 140 140 140 140 140 140

n/ac n/a n/a n/a n/a n/a n/a n/a n/a n/a

3.32±0.69 3.09±0.15 3.21±0.06 3.22±0.06 3.39±0.16 3.66±0.16 3.36±0.04 3.61±0.15 3.85±0.14 3.81±0.12

3.40 0.54 3.02 2.54 2.36 2.37 2.51 2.34 0.82 2.48 3.00 3.18 3.41 3.86 3.92 3.54 3.45 4.23 3.74 4.88d

a Reaction outcomes predicted by LabMate.AI. b Experimental values as obtained through LC-MS analyses and corrected for the number of mmol used. c n/a – not applicable (randomly selected reactions). d Reaction outcome was reproducible, as assessed from independent experiments (additional values obtained: 4.84; 4.50; n = 3). Table S13. Metrics for model quality assessment (C–N coupling). No. MAEa MSEb r2 c 1 2 3 4 5 6 7 8 9 10

0.78277 0.634220909091 0.590489285714

0.55627 0.563660056022 0.549231888889 0.535018712399 0.499889838936 0.505292436067 0.476603837624

0.14064832799999999 0.094632450527271622 0.082436163197279935 0.079775172399999678 0.084796757850163981 0.10461973781810206 0.18044753727651133 0.081544605934477876 0.082484419083966587 0.078417924361738625

0.83156312475973659 0.86392751872485307 0.8815033230675644 0.88881499903840966 0.88181626790053458 0.87813753593281896 0.78802352966350564 0.90197367519381944 0.908418766071447

0.91232391860947915 a Mean absolute error. b Mean squared error. c determination coefficient

Page 31: Evolving and Nano Data Enabled Machine Intelligence for ...

S15

2.4 1H NMR data

Figure S4. 1H NMR spectrum for the Ugi product (300 MHz, acetone-d6). Purity >95%.

Figure S5. 1H NMR spectrum of the C2–N coupling product (300 MHz, DMSO-d6). Impurity corresponds to the minor product (C4–N coupling). Spectrum is in full agreement with literature data1. Purity: 80%.

Page 32: Evolving and Nano Data Enabled Machine Intelligence for ...

S16

Figure S6. 1H NMR spectrum of the C4–N coupling product (300 MHz, DMSO-d6). Spectrum is in full agreement with literature data1. Purity >95%.

Page 33: Evolving and Nano Data Enabled Machine Intelligence for ...

S17

2.5 Survey

Figure S7. Heat map of feature importance for the studied C–N coupling as surveyed among graduate students, postdoctoral researchers and principal investigators. First column corresponds to the extracted random forest feature importance in LabMate.AI. Each subsequent column corresponds to one person. 3. References 1 Noonan, G. M., Dishington, A. P., Pink, J. & Campbell, A. D. Studies on the coupling

of substituted 2-amino-1,3-oxazoles with chloro-heterocycles. Tetrahedron Lett. 53, 3038-3043 (2012).