{"title": "Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model", "book": "Advances in Neural Information Processing Systems", "page_first": 5459, "page_last": 5472, "abstract": "We present a novel probabilistic programming framework that couples directly to existing large-scale simulators through a cross-platform probabilistic execution protocol, which allows general-purpose inference engines to record and control random number draws within simulators in a language-agnostic way. The execution of existing simulators as probabilistic programs enables highly interpretable posterior inference in the structured model defined by the simulator code base. We demonstrate the technique in particle physics, on a scientifically accurate simulation of the tau lepton decay, which is a key ingredient in establishing the properties of the Higgs boson. Inference efficiency is achieved via inference compilation where a deep recurrent neural network is trained to parameterize proposal distributions and control the stochastic simulator in a sequential importance sampling scheme, at a fraction of the computational cost of a Markov chain Monte Carlo baseline.", "full_text": "Ef\ufb01cient Probabilistic Inference in the Quest for\n\nPhysics Beyond the Standard Model\n\nAt\u0131l\u0131m G\u00fcne\u00b8s Baydin,1 Lukas Heinrich,2 Wahid Bhimji,3 Lei Shao,4\n\nSaeid Naderiparizi,5 Andreas Munk,5 Jialin Liu,3 Bradley Gram-Hansen,1 Gilles Louppe6\nLawrence Meadows,4 Philip Torr,1 Victor Lee,4 Prabhat,3 Kyle Cranmer,7 Frank Wood5\n\n1University of Oxford, 2CERN, 3Lawrence Berkeley National Lab, 4Intel Corporation\n\n5University of British Columbia, 6University of Liege, 7New York University\n\nAbstract\n\nWe present a novel probabilistic programming framework that couples directly to\nexisting large-scale simulators through a cross-platform probabilistic execution\nprotocol, which allows general-purpose inference engines to record and control ran-\ndom number draws within simulators in a language-agnostic way. The execution of\nexisting simulators as probabilistic programs enables highly interpretable posterior\ninference in the structured model de\ufb01ned by the simulator code base. We demon-\nstrate the technique in particle physics, on a scienti\ufb01cally accurate simulation of\nthe \u03c4 (tau) lepton decay, which is a key ingredient in establishing the properties of\nthe Higgs boson. Inference ef\ufb01ciency is achieved via inference compilation where\na deep recurrent neural network is trained to parameterize proposal distributions\nand control the stochastic simulator in a sequential importance sampling scheme,\nat a fraction of the computational cost of a Markov chain Monte Carlo baseline.\n\n1\n\nIntroduction\n\nComplex simulators are used to express causal generative models of data across a wide segment of the\nscienti\ufb01c community, with applications as diverse as hazard analysis in seismology [49], supernova\nshock waves in astrophysics [36], market movements in economics [73], and blood \ufb02ow in biology\n[72]. In these generative models, complex simulators are composed from low-level mechanistic\ncomponents. These models are typically non-differentiable and lead to intractable likelihoods, which\nrenders many traditional statistical inference algorithms irrelevant and motivates a new class of\nso-called likelihood-free inference algorithms [48].\nThere are two broad strategies for this type of likelihood-free inference problem. In the \ufb01rst, one uses a\nsimulator indirectly to train a surrogate model endowed with a likelihood that can be used in traditional\ninference algorithms, for example approaches based on conditional density estimation [56, 70, 77, 85]\nand density ratio estimation [30, 35]. Alternatively, approximate Bayesian computation (ABC)\n[81, 87] refers to a large class of approaches for sampling from the posterior distribution of these\nlikelihood-free models, where the original simulator is used directly as part of the inference engine.\nWhile variational inference [22] algorithms are often used when the posterior is intractable, they are\nnot directly applicable when the likelihood of the data generating process is unknown [84].\nThe class of inference strategies that directly use a simulator avoids the necessity of approximating\nthe generative model. Moreover, using a domain-speci\ufb01c simulator offers a natural pathway for\ninference algorithms to provide interpretable posterior samples. In this work, we take this approach,\nextend previous work in universal probabilistic programming [44, 86] and inference compilation\n[63, 65] to large-scale complex simulators, and demonstrate the ability to execute existing simulator\ncodes under the control of general-purpose inference engines. This is achieved by creating a cross-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Top left: overall framework where the PPS is controlling the simulator. Bottom left:\nprobabilistic execution of a single trace. Right: LSTM proposals conditioned on an observation.\n\nplatform probabilistic execution protocol (Figure 1, left) through which an inference engine can\ncontrol simulators in a language-agnostic way. We implement a range of general-purpose inference\nengines from the Markov chain Monte Carlo (MCMC) [25] and importance sampling [34] families.\nThe execution framework we develop currently has bindings in C++ and Python, which are languages\nof choice for many large-scale projects in science and industry. It can also be used by any other\nlanguage that support \ufb02atbuffers1 pending the implementation of a lightweight front end.\nWe demonstrate the technique in a particle physics setting, introducing probabilistic programming as\na novel tool to determine the properties of particles at the Large Hadron Collider (LHC) [1, 29] at\nCERN. This is achieved by coupling our framework with SHERPA2 [42], a state-of-the-art Monte\nCarlo event generator of high-energy reactions of particles, which is commonly used with Geant43\n[5], a toolkit for the simulation of the passage of the resulting particles through detectors. In particular,\nwe perform inference on the details of the decay of a \u03c4 (tau) lepton measured by an LHC-like detector\nby controlling the SHERPA simulation (with minimal modi\ufb01cations to the standard software), extract\nposterior distributions, and compare to ground truth. To our knowledge this is the \ufb01rst time that\nuniversal probabilistic programming has been applied in this domain and at this scale, controlling a\ncode base of nearly one million lines of code. Our approach is scalable to more complex events and\nfull detector simulators, paving the way to its use in the discovery of new fundamental physics.\n\n2 Particle Physics and Probabilistic Inference\n\nOur work is motivated by applications in particle physics, which studies elementary particles and their\ninteractions using high-energy collisions created in particle accelerators such as the LHC at CERN.\nIn this setting, collision events happen many millions of times per second, creating cascading particle\ndecays recorded by complex detectors instrumented with millions of electronics channels. These\nexperiments then seek to \ufb01lter the vast volume of (petabyte-scale) resulting data to make discoveries\nthat shape our understanding of fundamental physics.\nThe complexity of the underlying physics and of the detectors have, until now, prevented the\ncommunity from employing likelihood-free inference techniques for individual collision events.\nHowever, they have developed sophisticated simulator packages such as SHERPA [42], Geant4 [5],\nPythia8 [79], Herwig++ [16], and MadGraph5 [6] to model physical processes and the interactions of\nparticles with detectors. This is interesting from a probabilistic programming point of view, because\n\n1 https://google.github.io/flatbuffers/\nhttps://sherpa.hepforge.org/\n\n2 Simulation of High-Energy Reactions of Particles.\n\n3 Geometry and Tracking. https://geant4.web.cern.ch/\n\n2\n\nSimulatorProbabilistic programming systemSHERPA (C++)PPL + (Python)......ExecuterequestExecutereplySamplerequestSamplereplyObserverequestObservereplySimulator executionProbprog trace recording and controlSTARTA1A2A3ENDSTARTA1A2A3ENDSamplerequestSamplereplyx1A2xN-1AN0A1LSTMq(x1|y)q(x2|x1,y)q(xN|x1:N-1,y)\u2026\u2026\u20263D-CNNyobsysimINFERENCEp(x1)p(x2|x1)p(xN|x1:N-1)\u2026x1x2xNSampled random variables (trace)Prior distributionsor likelihoodsProposaldistributionsProposallayersLSTM inputs:- Observ. embed.- Address embed.- Sample embed.SIMULATIONAddressesA1A2AN\u2026\u2026p(y|x1:N)Simulation outputObserved output\fthese simulators are essentially very accurate generative models implementing the Standard Model of\nparticle physics and the passage of particles through matter (i.e., particle detectors). These simulators\nare coded in Turing-complete general-purpose programming languages, and performing inference in\nsuch a setting requires using inference techniques developed for universal probabilistic programming\nthat cannot be handled via more traditional inference approaches that apply to, for example, \ufb01nite\nprobabilistic graphical models [58]. Thus we focus on creating an infrastructure for the interpretation\nof existing simulator packages as probabilistic programs, which lays the groundwork for running\ninference in scienti\ufb01cally-accurate probabilistic models using general-purpose inference algorithms.\nThe \u03c4 Lepton Decay. The speci\ufb01c physics setting we focus on in this paper is the decay of a \u03c4 lepton\nparticle inside an LHC-like detector. This is a real use case in particle physics currently under active\nstudy by LHC physicists [2] and it is also of interest due to its importance to establishing the properties\nof the recently discovered Higgs boson [1, 29] through its decay to \u03c4 particles [12, 33, 46, 47]. Once\nproduced, the \u03c4 decays to further particles according to certain decay channels. The prior probabilities\nof these decays or \u201cbranching ratios\u201d are shown in Figure 8 (appendix).\n\n3 Related Work\n\n3.1 Probabilistic Programming\n\nProbabilistic programming languages (PPLs) extend general-purpose programming languages with\nconstructs to do sampling and conditioning of random variables [86]. PPLs decouple model speci\ufb01ca-\ntion from inference: a model is implemented by the user as a regular program in the host programming\nlanguage, specifying a model that produces samples from a generative process at each execution. In\nother words, the program produces samples from a joint prior distribution p(x, y) = p(y|x)p(x) that\nit implicitly de\ufb01nes, where x and y denote latent and observed random variables, respectively. The\nprogram can then be executed using a variety of general-purpose inference engines available in the\nPPL to obtain p(x|y), the posterior distribution of latent variables x conditioned on observed vari-\nables y. Universal PPLs allow the expression of unrestricted probability models in a Turing-complete\nfashion [43, 89, 90], in contrast to languages such as Stan [28, 39] that target the more restricted\nmodel class of probabilistic graphical models [58]. Inference engines available in PPLs range from\nMCMC-based lightweight Metropolis Hastings (LMH) [89] and random-walk Metropolis Hastings\n(RMH) [62] to importance sampling (IS) [11] and sequential Monte Carlo [34]. Modern PPLs such\nas Pyro [20] and Edward2 [32, 82, 83] use gradient-based inference engines including variational\ninference [52, 57] and Hamiltonian Monte Carlo [53, 69] that bene\ufb01t from modern deep learning\nhardware and automatic differentiation [18] features provided by PyTorch [71] and TensorFlow\n[3] libraries. Another way of making use of gradient-based optimization is to combine IS with\ndeep-learning-based proposals trained with data sampled from the probabilistic program, resulting in\nthe inference compilation (IC) algorithm [63] that enables amortized inference [40].\n\n3.2 Data Analysis in Particle Physics\n\nInference for an individual collision event in particle physics is often referred to as reconstruction [61].\nReconstruction algorithms can be seen as a form of structured prediction: from the raw event data they\nproduce a list of candidate particles together with their types and point-estimates for their momenta.\nThe variance of these estimators is characterized by comparison to the ground truth values of the latent\nvariables from simulated events. Bayesian inference on the latent state of an individual collision is\nrare in particle physics, given the complexity of the latent structure of the generative model. Until now,\ninference for the latent structure of an individual event has only been possible by accepting a drastic\nsimpli\ufb01cation of the high-\ufb01delity simulators [4, 7\u201310, 15, 23, 27, 37, 38, 45, 59, 66, 67, 78, 80]. In\ncontrast, inference for the fundamental parameters is based on hierarchical models and probed at the\npopulation level. Recently, machine learning techniques have been employed to learn surrogates for\nthe implicit densities de\ufb01ned by the simulators as a strategy for likelihood-free inference [24].\nCurrently particle physics simulators are run in forward mode to produce substantial datasets that\noften exceed the size of datasets from actual collisions within the experiments. These are then reduced\nto considerably lower dimensional datasets of a handful of variables using physics domain knowledge,\nwhich can then be directly compared to collision data. Machine learning and statistical approaches\nfor classi\ufb01cation of particle types or regression of particle properties can be trained on these large\npre-generated datasets produced by the high-\ufb01delity simulators developed over many decades [13, 55].\n\n3\n\n\fThe \ufb01eld is increasingly employing deep learning techniques allowing these algorithms to process\nhigh-dimensional, low-level data [14, 17, 31, 54, 74]. However, these approaches do not estimate the\nposterior of the full latent state nor provide the level of interpretability our probabilistic inference\nframework enables by directly tying inference results to the latent process encoded by the simulator.\n\n4 Probabilistic Inference in Large-Scale Simulators\n\nIn this section we describe the main components of our probabilistic inference framework, including:\n(1) a novel PyTorch-based [71] PPL and associated inference engines in Python, (2) a probabilistic\nprogramming execution protocol that de\ufb01nes a cross-platform interface for connecting models and\ninference engines implemented in different languages and executed in separate processes, (3) a\nlighweight C++ front end allowing execution of models written in C++ under the control of our PPL.\n\n4.1 Designing a PPL for Existing Large-Scale Simulators\n\nA shortcoming of the state-of-the-art PPLs is that they are not designed to directly support existing\ncode bases, requiring one to implement any model from scratch in each speci\ufb01c PPL. This limitation\nrules out their applicability to a very large body of existing models implemented as domain-speci\ufb01c\nsimulators in many \ufb01elds across academia and industry. A PPL, by de\ufb01nition, is a programming\nlanguage with additional constructs for sampling random values from probability distributions and\nconditioning values of random variables via observations [44, 86]. Domain-speci\ufb01c simulators in\nparticle physics and other \ufb01elds are commonly stochastic in nature, thus they satisfy the behavior\nrandom sampling, albeit generally from simplistic distributions such as the continuous uniform. By\ninterfacing with these simulators at the level of random number sampling (via capturing calls to the\nrandom number generator) and introducing a construct for conditioning, we can execute existing\nstochastic simulators as probabilistic programs. Our work introduces the necessary framework\nto do so, and makes these simulators, which commonly represent the most accurate models and\nunderstanding in their corresponding \ufb01elds, subject to Bayesian inference using general-purpose\ninference engines. In this setting, a simulator is no longer a black box, as all predictions are directly\ntied into the fully-interpretable structured model implemented by the simulator code base.\nTo realize our framework, we implement a universal PPL called pyprob,4 speci\ufb01cally designed to\nexecute models written not only in Python but also in other languages. Our PPL currently has two\nfamilies of inference engines:5 (1) MCMC of the lightweight Metropolis\u2013Hastings (LMH) [89] and\nrandom-walk Metropolis\u2013Hastings (RMH) [62] varieties, and (2) sequential importance sampling (IS)\n[11, 34] with its regular (i.e., sampling from the prior) and inference compilation (IC) [63] varieties.\nThe IC technique, where a recurrent neural network (NN) is trained in order to provide amortized\ninference to guide (control) a probabilistic program conditioning on observed inputs, forms our main\ninference method for performing ef\ufb01cient inference in large-scale simulators. Because IC training\nand inference uses dynamic recon\ufb01guration of NN modules [63], we base our PPL on PyTorch\n[71], whose automatic differentiation feature with support for dynamic computation graphs [18] has\nbeen crucial in our implementation. The LMH and RMH engines we implement are specialized for\nsampling in the space of execution traces of probabilistic programs, and provide a way of sampling\nfrom the true posterior and therefore provide a baseline\u2014at a high computational cost.\nA probabilistic program can be expressed as a sequence of random samples (xt, at, it)T\nt=1, where\nxt, at, and it are respectively the value, address,6 and instance (counter) of a sample, the execution\nof which describes a joint probability distribution between latent (unobserved) random variables\nx := (xt)T\n\nt=1 and observed random variables y := (yn)N\n\nn=1 given by\n\nT(cid:89)\n\nN(cid:89)\n\np(x, y) :=\n\nfat (xt|x1:t\u22121)\n\ngn(yn|x\u227an) ,\n\n(1)\n\nt=1\n\nn=1\n\n4 https://github.com/pyprob/pyprob\n5 The selection of these families was motivated by working with\nexisting simulators through an execution protocol (Section 4.2) precluding the use of gradient-based inference\n6 An \u201caddress\u201d is a\nengines. We plan to extend this protocol in future work to incorporate differentiability.\nlabel uniquely identifying each sampling or conditioning event in the execution of the program. In our system it\nis based on a concatenation of stack frames (Table 1) leading up to the point of each random number draw, and it\nalso includes a suf\ufb01x identifying the type of associated probability distribution.\n\n4\n\n\fwhere fat(\u00b7|x1:t\u22121) denotes the prior probability distribution of a random variable with address at\nconditional on all preceding values x1:t\u22121, and gn(\u00b7|x\u227an) is the likelihood density given the sample\nvalues x\u227an preceding observation yn. Once a model p(x, y) is expressed as a probabilistic program,\nwe are interested in performing inference in order to get posterior distributions p(x|y) of latent\nvariables x conditioned on observed variables y.\nInference engines of the MCMC family, designed to work in the space of probabilistic execution\ntraces, constitute the gold standard for obtaining samples from the true posterior of a probabilistic\nprogram [62, 86, 89]. Given a current sequence of latents x in the trace space, these work by making\nproposals x(cid:48) according to a proposal distribution q(x(cid:48)|x) and deciding whether to move from x to x(cid:48)\nbased on the Metropolis\u2013Hasting acceptance ratio of the form\np(x(cid:48))q(x|x(cid:48))\np(x)q(x(cid:48)|x))\n\n\u03b1 = min{1,\n\nInference engines in the IS family use a weighted set of samples {(wk, xk)K\n\nempirical approximation of the posterior distribution: \u02c6p(x|y) = (cid:80)K\n\nk=1 wk\u03b4(xk \u2212 x)/(cid:80)K\n\nk=1} to construct an\nj=1 wj,\n\n} .\n\n(2)\n\nwhere \u03b4 is the Dirac delta function. The importance weight for each execution trace is\n\nwk =\n\ngn(yn|xk\n\n1:\u03c4k(n))\n\n,\n\n(3)\n\nN(cid:89)\n\nn=1\n\nT k(cid:89)\n\nt=1\n\nt |xk\nfat(xk\n1:t\u22121)\nt |xk\nqat,it(xk\n1:t\u22121)\n\nwhere qat,it (\u00b7|xk\n1:t\u22121) is known as the proposal distribution and may be identical to the prior fat (as\nin regular IS). In the IC technique, we train a recurrent NN to receive the observed values y and\nreturn a set of adapted proposals qat,it(xt|x1:t\u22121, y) such that the approximate posterior q(x|y) is\nclose to the true posterior p(x|y). This is achieved by using a Kullback\u2013Leibler divergence training\nobjective Ep(y) [DKL (p(x|y) || q(x|y; \u03c6))] as\np(x|y)\nq(x|y; \u03c6)\n\ndx dy = Ep(x,y) [\u2212 log q(x|y; \u03c6)] + const. ,\n\np(x|y) log\n\nL(\u03c6) :=\n\n(cid:90)\n\n(cid:90)\n\np(y)\n\n(4)\n\ny\n\nx\n\nat\u22121,it\u22121\n\nwhere \u03c6 represents the NN weights. The weights \u03c6 are optimized to minimize this objective by\ncontinually drawing training pairs (x, y) \u223c p(x, y) from the probabilistic program (the simulator).\nIn IC training, we may designate a subset of all addresses (at, it) to be \u201ccontrolled\u201d (learned) by the\nNN, leaving all remaining addresses to use the prior fat as proposal during inference. Expressed in\nsimple terms, taking an observation y (an observed event that we would like to recreate or explain\nwith the simulator) as input, the NN learns to control the random number draws of latents x during\nthe simulator\u2019s execution in such a way that makes the observed outcome likely (Figure 1, right).\nThe NN architecture in IC is based on a stacked LSTM [51] recurrent core that gets executed for\nas many time steps as the probabilistic trace length. The input to this LSTM in each time step is\na concatenation of embeddings of the observation f obs(y), the current address f addr(at, it), and\nthe previously sampled value f smp\n(xt\u22121). f obs is a NN speci\ufb01c to the domain (such as a 3D\nconvolutional NN for volumetric inputs), f smp are feed-forward modules, and f addr are learned\naddress embeddings optimized via backpropagation for each (at, it) pair encountered in the program\nexecution. The addressing scheme at is the main link between semantic locations in the probabilistic\nprogram [89] and the inputs to the NN. The address of each sample or observe statement is supplied\nover the execution protocol (Section 4.2) at runtime by the process hosting and executing the model.\nThe joint proposal distribution of the NN q(x|y) is factorized into proposals in each time step qat,it,\nwhose type depends on the type of the prior fat. In our experiments in this paper (Section 5) the\nsimulator uses categorical and continuous uniform priors, for which IC uses, respectively, categorical\nand mixture of truncated Gaussian distributions as proposals parameterized by the NN. The creation\nof IC NNs is automatic, i.e., an open-ended number of NN modules are generated by the PPL\non-the-\ufb02y when a simulator address at is encountered for the \ufb01rst time during training [63]. These\nmodules are reused (either for inference or undergoing further training) when the same address is\nencountered in the lifetime of the same trained NN.\nA common challenge for inference in real-world scienti\ufb01c models, such as those in particle physics,\nis the presence of large dynamic ranges of prior probabilities for various outcomes. For instance,\nsome particle decays are \u223c104 times more probable than others (Figure 8, appendix), and the prior\ndistribution for a particle momentum can be steeply falling. Therefore some cases may be much\n\n5\n\n\fmore likely to be seen by the NN during training relative to others. For this reason, the proposal\nparameters and the quality of the inference would vary signi\ufb01cantly according to the frequency of the\nobservations in the prior. To address this issue, we apply a \u201cprior in\ufb02ation\u201d scheme to automatically\nadjust the measure of the prior distribution during training to generate more instances of these unlikely\noutcomes. This applies only to the training data generation for the IC NN, and the unmodi\ufb01ed original\nmodel prior is used during inference, ensuring that the importance weights (Eq. 3) and therefore the\nempirical posterior are correct under the original model.\n\n4.2 A Cross-Platform Probabilistic Execution Protocol\n\nTo couple our PPL and inference engines with simulators in a language-agnostic way, we introduce\na probabilistic programming execution protocol (PPX)7 that de\ufb01nes a schema for the execution of\nprobabilistic programs. The protocol covers language-agnostic de\ufb01nitions of common probability\ndistributions and message pairs covering the call and return values of (1) program entry points (2)\nsample statements, and (3) observe statements (Figure 1, left). The implementation is based on\n\ufb02atbuffers,8 which is an ef\ufb01cient cross-platform serialization library through which we compile the\nprotocol into the of\ufb01cially supported languages C++, C#, Go, Java, JavaScript, PHP, Python, and\nTypeScript, enabling very lightweight PPL front ends in these languages\u2014in the sense of requiring\nonly an implementation to call sample and observe statements over the protocol. We exchange these\n\ufb02atbuffers-encoded messages over ZeroMQ9 [50] sockets, which allow seamless communication\nbetween separate processes in the same machine (using inter-process sockets) or across a network\n(using TCP).\nConnecting any stochastic simulator in a supported language involves only the redirection of calls to\nthe random number generator (RNG) to call the sample method of PPX using the corresponding\nprobability distribution as the argument, which is facilitated when a simulator-wide RNG interface is\nde\ufb01ned in a single code \ufb01le as is the case in SHERPA (Section 4.3). Conditioning is achieved by\neither providing an observed value for any sample at inference time (which means that the sample\nwill be \ufb01xed to the observed value) or adding manual observe statements, similar to Pyro [20].\nBesides its use with our Python PPL, the protocol de\ufb01nes a very \ufb02exible way of coupling any PPL\nsystem to any model so that these two sides can be (1) implemented in different programming\nlanguages and (2) executed in separate processes and on separate machines across networks. Thus we\npresent this protocol as a probabilistic programming analogue to the Open Neural Network Exchange\n(ONNX)10 project for interoperability between deep learning frameworks, in the sense that PPX is\nan interoperability project between PPLs allowing language-agnostic exchange of existing models\n(simulators). Note that, more than a serialization format, the protocol enables runtime execution of\nprobabilistic models under the control of inference engines in different PPLs. We are releasing this\nprotocol as a separately maintained project, together with the rest of our work in Python and C++.\n\n4.3 Controlling SHERPA\u2019s Simulation of Fundamental Particle Physics\n\nWe demonstrate our framework with SHERPA [42], a Monte Carlo event generator of high-energy\nreactions of particles, which is a state-of-the-art simulator of the Standard Model developed by the\nparticle physics community. SHERPA, like many other large-scale scienti\ufb01c projects, is implemented\nin C++, and therefore we implement a C++ front end for our protocol.11 We couple SHERPA to the\nfront end by a system-wide rerouting of the calls to the RNG, which is made easy by the existence\nof a third-party RNG interface (External_RNG) already present in SHERPA. Through this setup,\nwe can repurpose, with little effort, any stochastic simulation written in SHERPA as a probabilistic\ngenerative model in which we can perform inference.\nRandom number draws in C++ simulators are commonly performed at a lower level than the actual\nprior distribution that is being simulated. This applies to SHERPA where the only samples are from\nthe standard uniform distribution U (0, 1), which subsequently get used for different purposes using\ntransformations or rejection sampling. In our experiments (Section 5) we work with all uniform\nsamples except for a problem-speci\ufb01c single address that we know to be responsible for sampling\nfrom a categorical distribution representing particle decay channels. The modi\ufb01cation of this address\n\n7 https://github.com/pyprob/ppx\n9 http://zeromq.org/\n\n10 https://onnx.ai/\n\n8 http://google.github.io/flatbuffers/\n\n11 https://github.com/pyprob/pyprob_cpp\n\n6\n\n\fFigure 2: Top histograms: RMH and IC posterior results where a Channel 2 decay event (\u03c4 \u2192 \u03bd\u03c4 \u03c0\u2212)\nis the mode of the posterior distribution. Note that the eight variables shown are just a subset of the\nfull latent state of several thousand addresses (Figure 5, appendix). Vertical lines indicate the point\nsample of the single GT trace supplying the calorimeter observation in each row. Bottom plots: trace\njoint log-probability, Gelman\u2013Rubin diagnostic, autocorrelation results belonging to the posterior in\nthe \ufb01rst row.\n\nto use the proper categorical prior allows an effortless application of prior in\ufb02ation (Section 4.1) to\ngenerate training data equally representing each channel.\nRejection sampling [41] sections in the simulator pose a challenge for our approach, as they de\ufb01ne\nexecution traces that are a priori unbounded; and since the IC NN has to backpropagate through\nevery sampled value, this makes the training signi\ufb01cantly slower. Rejection sampling is key to the\napplication of Monte Carlo methods for evaluating matrix elements [60] and other stages of event\ngeneration in particle physics; thus an ef\ufb01cient treatment of this construction is primal. We address\nthis problem by implementing a novel trace evaluation scheme which works by annotating the sample\nstatements within long-running rejection sampling loops with a boolean \ufb02ag called replace, which,\nwhen set true, enables a rejection-sampling-speci\ufb01c behavior for the given sample address. The\nsimplest correct approach is to exclude these replace addresses from IC inference (i.e., proposing\nfor these from the prior) and treat them as regular raw addresses in MCMC. Other approaches include\namortization schemes where during IC NN training we only consider the last (thus accepted) instance\nilast of any address (at, it) that fall within a rejection sampling loop. The results presented in this\npaper use the former simple mode. Ef\ufb01cient handling of rejection sampling in universal PPLs [68],\nand nested inference in general [75, 76], constitute an active area of research with several alternative\napproaches currently being formulated with varying degrees of complexity and sample ef\ufb01ciency that\nare beyond the scope of this paper.\n\n5 Experiments\n\nAn important decay of the Higgs boson is to \u03c4 leptons, whose subsequent decay products interact\nin the detector. This constitutes a rich and realistic case to simulate, and directly connects to an\nimportant line of current research in particle physics. During simulation, SHERPA stochastically\ngenerates a set of particles to which the initial \u03c4 lepton will decay\u2014a \u201cdecay channel\u201d\u2014and samples\n\n7\n\n32101230.00.20.40.60.81.01.2 pxICRMH32101230.00.20.40.60.8 pyICRMH43444546470.00.10.20.30.4 pzICRMH01020300.00.20.40.60.81.0Decay ChannelICRMH0102030400.0000.0250.0500.0750.1000.1250.1500.175FSP Energy 1ICRMH0102030400.000.020.040.060.080.100.120.14FSP Energy 2ICRMH010203040FSP 1 Energy [GeV]0510152025303540FSP 2 Energy [GeV]ICRMH0.00.51.01.52.02.53.00.00.51.01.52.0METICRMHx3210123y3210123z0.02.55.07.510.012.515.0Observed Calorimeter32101230.00.20.40.60.8ICRMH32101230.00.10.20.30.40.50.60.7ICRMH43444546470.000.050.100.150.200.250.300.35ICRMH01020300.00.20.40.60.81.0ICRMH0102030400.000.050.100.150.200.250.30ICRMH0102030400.0000.0250.0500.0750.1000.1250.150ICRMH010203040FSP 1 Energy [GeV]0510152025303540FSP 2 Energy [GeV]ICRMH0.00.51.01.52.02.53.00.000.250.500.751.001.251.501.75ICRMHx3210123y3210123z0.02.55.07.510.012.515.032101230.00.10.20.30.4ICRMH32101230.00.20.40.60.8ICRMH43444546470.00.10.20.30.4ICRMH01020300.00.20.40.60.81.0ICRMH0102030400.0000.0250.0500.0750.1000.1250.1500.175ICRMH0102030400.000.050.100.150.20ICRMH010203040FSP 1 Energy [GeV]0510152025303540FSP 2 Energy [GeV]ICRMH0.00.51.01.52.02.53.00.00.20.40.60.8ICRMHx3210123y3210123z0.02.55.07.510.012.515.032101230.00.20.40.60.81.01.2ICRMH32101230.00.20.40.60.8ICRMH43444546470.000.050.100.150.200.250.300.35ICRMH01020300.00.20.40.60.81.0ICRMH0102030400.000.050.100.150.20ICRMH0102030400.000.020.040.060.080.100.120.14ICRMH010203040FSP 1 Energy [GeV]0510152025303540FSP 2 Energy [GeV]ICRMH0.00.51.01.52.02.53.00.00.51.01.52.02.5ICRMHx3210123y3210123z0.02.55.07.510.012.515.03210123Momentum [GeV/c]0.00.20.40.60.81.0ICRMH3210123Momentum [GeV/c]0.00.20.40.60.81.01.2ICRMH4344454647Momentum [GeV/c]0.000.050.100.150.200.250.300.35ICRMH01020300.00.20.40.60.81.0ICRMH010203040Energy [GeV]0.000.050.100.150.200.25ICRMH010203040Energy [GeV]0.0000.0250.0500.0750.1000.1250.1500.175ICRMH010203040FSP 1 Energy [GeV]0510152025303540FSP 2 Energy [GeV]ICRMH0.00.51.01.52.02.53.0Missing ET0.00.51.01.52.02.5ICRMHx3210123y3210123z0.02.55.07.510.012.515.0010000002000000300000040000005000000600000070000008000000Iteration215.0212.5210.0207.5205.0202.5200.0197.5195.0Log probabilityPosterior, RMH (initialized with random trace from prior)Posterior, RMH (initialized with ground truth trace)102103104105106107Iteration100101102R-hatchannel_indexmother_momentum_xmother_momentum_ymother_momentum_z14 most frequent addresses102103104105106Iteration0.00.20.40.60.81.0Autocorrelationchannel_indexmother_momentum_xmother_momentum_ymother_momentum_z14 most frequent addresses\fthe momenta of these particles according to a joint density obtained from underlying physical theory.\nThese particles then interact in the detector leading to observations in the raw sensor data. While\nGeant4 is typically used to model the interactions in a detector, for our initial studies we implement a\nfast, approximate, stochastic detector simulation for a calorimeter with longitudinal and transverse\nsegmentation (with 20\u00d735\u00d735 voxels). The detector deposits most of the energy for electrons and \u03c00\ninto the \ufb01rst layers and charged hadrons (e.g., \u03c0\u00b1) deeper into the calorimeter with larger \ufb02uctuations.\nFigure 2 presents posterior distributions of a selected subset of random variables in the simulator for\n\ufb01ve different test cases where the mode of the posterior is a channel-2 decay (\u03c4 \u2192 \u03bd\u03c4 \u03c0\u2212). Test cases\nare generated by sampling an execution trace from the simulator prior, giving us a \u201cground truth trace\u201d\n(GT trace), from which we extract the simulated raw 3D calorimeter as a test observation. We run\nour inference engines taking only these calorimeter data as input, giving us posteriors over the entire\nlatent state of the simulator, conditioned on the observed calorimeter using a physically-motivated\nPoisson likelihood. We show RMH (MCMC) and IC inference results, where RMH serves as a\nbaseline as it samples from the true posterior of the model, albeit at great computational cost. For\neach case, we establish the convergence of the RMH posterior to the true posterior by computing\nthe Gelman\u2013Rubin (GR) convergence diagnostic [26, 88] between two MCMC chains conditioned\non the same observation, one starting from the GT trace and one starting from a random trace\nsampled from the prior.12 As an example, in Figure 2 (bottom) we show the joint log-probability, GR\ndiagnostic, and autocorrelation plots of the RMH posterior (with 7.7M traces) belonging to the test\ncase in the \ufb01rst row. The GR result indicates that the chains converged around 106 iterations, and the\nautocorrelation result indicates that we need approximately 105 iterations to accumulate each new\neffectively independent sample from the true posterior. These RMH baseline results incur signi\ufb01cant\ncomputational cost due to the sequential nature of the sampling and the large number of iterations one\nneeds to accumulate statistically independent samples. The example we presented took 115 compute\nhours on an Intel E5-2695 v2 @ 2.40GHz CPU node.\nWe present IC posteriors conditioned on the same observations in Figure 2 and plot these together\nwith corresponding RMH baselines, showing good agreement in all cases. These IC posteriors were\nobtained in less than 30 minutes in each case, representing a signi\ufb01cant speedup compared with\nthe RMH baseline. This is due to three main strengths of IC inference: (1) each trace executed by\nthe IC engine gives us a statistically independent sample from the learned proposal approximating\nthe true posterior (Equation 4) (cf. the autocorrelation time of 105 in RMH); following from this\nindependence, (2) IC inference does not necessitate a burn-in period (cf. 106 iterations to convergence\nin GR for RMH); and (3) IC inference is embarrassingly parallelizable. These features represent\nthe main motivation to incorporate IC in our framework to make inference in large-scale simulators\ncomputationally ef\ufb01cient and practicable. The results presented were obtained by running IC inference\nin parallel on 20 compute nodes of the type used for RMH inference, using a NN with 143,485,048\nparameters that has been trained for 40 epochs with a training set of 3M traces sampled from the\nsimulator prior, lasting two days on 32 CPU nodes. This time cost for NN training needs to be\nincurred only once for any given simulator setup, resulting in a trained inference NN that enables\nfast, repeated inference in the model speci\ufb01ed by the simulator\u2014a concept referred to as \u201camortized\ninference\u201d. Details of the 3DCNN\u2013LSTM architecture used are in Figure 9 (appendix).\nIn the last test case in Figure 2 we show posteriors corresponding to a calorimeter observation of\na Channel 22 event (\u03c4 \u2192 \u03bd\u03c4 K\u2212K\u2212K +), a type of decay producing calorimeter depositions with\nsimilarity to Channel 2 decays and with extremely low probability in the prior (Figure 8, appendix),\ntherefore representing a dif\ufb01cult case to infer. We see the posterior uncertainty in the true (RMH)\nposterior of this case, where Channel 2 is the mode of the posterior with a small probability mass\non Channel 22 among other channels. We see that the IC posterior is able to reproduce this small\nprobability mass on Channel 22 with success, thanks to the \u201cprior in\ufb02ation\u201d scheme with which we\ntrain IC NNs. This leads to a proposal where Channel 22 is the mode, which later gets adjusted by\nimportance weighting (Equation 3) to match the true posterior result (Figure 7, appendix). Our results\ndemonstrate the feasibility of Bayesian inference in the whole latent space of this existing simulator\nde\ufb01ning a potentially unbounded number of addresses, of which we encountered approximately 24k\nduring our experiments (Table 1 also Figure 5, appendix). To our knowledge, this is the \ufb01rst time a\nPPL system has been used with a model expressed by an existing state-of-the-art simulator at this\nscale.\n\n12 The GR diagnostic compares estimated between-chains and within-chain variances, summarized as the \u02c6R\nmetric which approaches unity as the chains converge on the target distribution.\n\n8\n\n\f6 Conclusions\n\nWe presented the \ufb01rst step in subsuming the vast existing body of scienti\ufb01c simulators, which are\ncausal, generative models that often re\ufb02ect the most accurate understanding in their respective \ufb01elds,\ninto a universal probabilistic programming framework. The ability to scale probabilistic inference to\nlarge-scale simulators is of fundamental importance to the \ufb01eld of probabilistic programming and\nthe wider modeling community. It is a hard problem that requires innovations in many areas such as\nmodel\u2013PPL interface, handling of priors with long tails, amortization of rejection sampling routines\n[68], addressing schemes, IC network architectures, and distributed training and inference [19] which\nmake it dif\ufb01cult to cover in depth in a single paper.\nOur work allows one to use existing simulator code bases to perform model-based machine learning\nwith interpretability, where the simulator is no longer used as a black box to generate synthetic\ntraining data, but as a highly structured generative model that the simulator\u2019s code already speci\ufb01es.\nBayesian inference in this setting gives results that are highly interpretable, where we get to see the\nexact locations and processes in the model that are associated with each prediction and the uncertainty\nin each prediction. With this novel framework providing a clearly de\ufb01ned interface between domain-\nspeci\ufb01c simulators and probabilistic machine learning techniques, we expect to enable a wide range\nof applied work straddling machine learning and \ufb01elds of science and engineering. In the particle\nphysics setting, our ultimate aim is to run the inference stage of this approach on collision data from\nreal detectors by implementing a full LHC physics analysis together with the full posterior, so that\nit can be exploited for discovery of new physics via simulations that contain processes beyond the\ncurrent Standard Model.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their constructive comments that helped us improve this paper\nsigni\ufb01cantly. This research used resources of the National Energy Research Scienti\ufb01c Computing\nCenter (NERSC), a U.S. Department of Energy Of\ufb01ce of Science User Facility operated under\nContract No. DE-AC02-05CH11231. This work was partially supported by the NERSC Big Data\nCenter; we acknowledge Intel for their funding support. KC, LH, and GL were supported by the\nNational Science Foundation under the awards ACI-1450310. Additionally, KC was supported by the\nNational Science Foundation award OAC-1836650. BGH is supported by the EPRSC Autonomous\nIntelligent Machines and Systems grant. AGB and PT are supported by EPSRC/MURI grant\nEP/N019474/1 and AGB is also supported by Lawrence Berkeley National Lab. FW is supported by\nDARPA D3M, under Cooperative Agreement FA8750-17-2-0093, Intel under its LBNL NERSC Big\nData Center, and an NSERC Discovery grant.\n\nReferences\n\n[1] G. Aad, T. Abajyan, B. Abbott, J. Abdallah, S. Abdel Khalek, A. A. Abdelalim, O. Abdinov,\nR. Aben, B. Abi, M. Abolins, and et al. Observation of a new particle in the search for the\nStandard Model Higgs boson with the ATLAS detector at the LHC. Physics Letters B, 716:1\u201329,\nSept. 2012.\n\n[2] G. Aad et al. Reconstruction of hadronic decay products of tau leptons with the ATLAS\n\nexperiment. Eur. Phys. J., C76(5):295, 2016.\n\n[3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In 12th USENIX\nSymposium on Operating Systems Design and Implementation (OSDI 16), pages 265\u2013283,\n2016.\n\n[4] V. M. Abazov et al. A precision measurement of the mass of the top quark. Nature, 429:638\u2013642,\n\n2004.\n\n[5] J. Allison, K. Amako, J. Apostolakis, P. Arce, M. Asai, T. Aso, E. Bagli, A. Bagulya, S. Banerjee,\nG. Barrand, B. Beck, A. Bogdanov, D. Brandt, J. Brown, H. Burkhardt, P. Canal, D. Cano-Ott,\nS. Chauvie, K. Cho, G. Cirrone, G. Cooperman, M. Cort\u00e9s-Giraldo, G. Cosmo, G. Cuttone,\n\n9\n\n\fG. Depaola, L. Desorgher, X. Dong, A. Dotti, V. Elvira, G. Folger, Z. Francis, A. Galoyan,\nL. Garnier, M. Gayer, K. Genser, V. Grichine, S. Guatelli, P. Gu\u00e8ye, P. Gumplinger, A. Howard,\nI. H\u02c7rivn\u00e1\u02c7cov\u00e1, S. Hwang, S. Incerti, A. Ivanchenko, V. Ivanchenko, F. Jones, S. Jun, P. Kai-\ntaniemi, N. Karakatsanis, M. Karamitros, M. Kelsey, A. Kimura, T. Koi, H. Kurashige, A. Lech-\nner, S. Lee, F. Longo, M. Maire, D. Mancusi, A. Mantero, E. Mendoza, B. Morgan, K. Mu-\nrakami, T. Nikitina, L. Pandola, P. Paprocki, J. Perl, I. Petrovi\u00b4c, M. Pia, W. Pokorski, J. Quesada,\nM. Raine, M. Reis, A. Ribon, A. R. Fira, F. Romano, G. Russo, G. Santin, T. Sasaki, D. Sawkey,\nJ. Shin, I. Strakovsky, A. Taborda, S. Tanaka, B. Tom\u00e9, T. Toshito, H. Tran, P. Truscott, L. Urban,\nV. Uzhinsky, J. Verbeke, M. Verderi, B. Wendt, H. Wenzel, D. Wright, D. Wright, T. Yamashita,\nJ. Yarba, and H. Yoshida. Recent developments in GEANT4. Nuclear Instruments and Meth-\nods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated\nEquipment, 835(Supplement C):186 \u2013 225, 2016.\n\n[6] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.-S. Shao, T. Stelzer,\nP. Torrielli, and M. Zaro. The automated computation of tree-level and next-to-leading order\ndifferential cross sections, and their matching to parton shower simulations. Journal of High\nEnergy Physics, 2014(7):79, 2014.\n\n[7] J. Alwall, A. Freitas, and O. Mattelaer. The Matrix Element Method and QCD Radiation. Phys.\n\nRev., D83:074010, 2011.\n\n[8] J. R. Andersen, C. Englert, and M. Spannowsky. Extracting precise Higgs couplings by using\n\nthe matrix element method. Phys. Rev., D87(1):015019, 2013.\n\n[9] P. Artoisenet, P. de Aquino, F. Maltoni, and O. Mattelaer. Unravelling tth via the Matrix\n\nElement Method. Phys. Rev. Lett., 111(9):091802, 2013.\n\n[10] P. Artoisenet and O. Mattelaer. MadWeight: Automatic event reweighting with matrix elements.\n\nPoS, CHARGED2008:025, 2008.\n\n[11] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle \ufb01lters for\nonline nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing,\n50(2):174\u2013188, 2002.\n\n[12] A. Askew, P. Jaiswal, T. Okui, H. B. Prosper, and N. Sato. Prospect for measuring the CP phase\n\nin the h\u03c4 \u03c4 coupling at the LHC. Phys. Rev., D91(7):075014, 2015.\n\n[13] L. Asquith et al. Jet Substructure at the Large Hadron Collider : Experimental Review. 2018.\n\n[14] A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M. Messier, E. Niner, G. Pawloski, F. Psihas,\nA. Sousa, and P. Vahle. A convolutional neural network neutrino event classi\ufb01er. Journal of\nInstrumentation, 11(09):P09001, 2016.\n\n[15] P. Avery et al. Precision studies of the Higgs boson decay channel H \u2192 ZZ \u2192 4l with MEKD.\n\nPhys. Rev., D87(5):055006, 2013.\n\n[16] M. B\u00e4hr, S. Gieseke, M. A. Gigg, D. Grellscheid, K. Hamilton, O. Latunde-Dada, S. Pl\u00e4tzer,\nP. Richardson, M. H. Seymour, A. Sherstnev, et al. Herwig++ physics and manual. The\nEuropean Physical Journal C, 58(4):639\u2013707, 2008.\n\n[17] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics\n\nwith deep learning. Nature Communications, 5:4308, 2014.\n\n[18] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in\nmachine learning: a survey. Journal of Machine Learning Research (JMLR), 18(153):1\u201343,\n2018.\n\n[19] A. G. Baydin, L. Shao, W. Bhimji, L. Heinrich, L. F. Meadows, J. Liu, A. Munk, S. Naderiparizi,\nB. Gram-Hansen, G. Louppe, M. Ma, X. Zhao, P. Torr, V. Lee, K. Cranmer, Prabhat, and\nF. Wood. Etalumis: Bringing probabilistic programming to scienti\ufb01c simulators at scale. In\nProceedings of the International Conference for High Performance Computing, Networking,\nStorage, and Analysis (SC19), November 17\u201322, 2019, 2019.\n\n10\n\n\f[20] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh,\nP. Szerlip, P. Horsfall, and N. D. Goodman. Pyro: Deep universal probabilistic programming.\nJournal of Machine Learning Research, 2018.\n\n[21] C. M. Bishop. Mixture density networks. Technical Report NCRG/94/004, Neural Computing\n\nResearch Group, Aston University, 1994.\n\n[22] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[23] S. Bolognesi, Y. Gao, A. V. Gritsan, K. Melnikov, M. Schulze, N. V. Tran, and A. Whitbeck. On\nthe spin and parity of a single-produced resonance at the LHC. Phys. Rev., D86:095031, 2012.\n\n[24] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez. A Guide to Constraining Effective Field\n\nTheories with Machine Learning. Phys. Rev., D98(5):052004, 2018.\n\n[25] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov Chain Monte Carlo.\n\nCRC press, 2011.\n\n[26] S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative simula-\n\ntions. Journal of Computational and Graphical Statistics, 7(4):434\u2013455, 1998.\n\n[27] J. M. Campbell, R. K. Ellis, W. T. Giele, and C. Williams. Finding the Higgs boson in decays to\nZ\u03b3 using the matrix element method at Next-to-Leading Order. Phys. Rev., D87(7):073005,\n2013.\n\n[28] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker,\nJ. Guo, P. Li, A. Riddell, et al. Stan: A probabilistic programming language. Journal of\nStatistical Software, 76(i01), 2017.\n\n[29] S. Chatrchyan, V. Khachatryan, A. M. Sirunyan, A. Tumasyan, W. Adam, E. Aguilo, T. Bergauer,\nM. Dragicevic, J. Er\u00f6, C. Fabjan, and et al. Observation of a new boson at a mass of 125 GeV\nwith the CMS experiment at the LHC. Physics Letters B, 716:30\u201361, Sept. 2012.\n\n[30] K. Cranmer, J. Pavez, and G. Louppe. Approximating likelihood ratios with calibrated discrimi-\n\nnative classi\ufb01ers. arXiv preprint arXiv:1506.02169, 2015.\n\n[31] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman. Jet-images \u2013 deep\n\nlearning edition. Journal of High Energy Physics, 2016(7):69, 2016.\n\n[32] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi,\nM. Hoffman, and R. A. Saurous. Tensor\ufb02ow distributions. arXiv preprint arXiv:1711.10604,\n2017.\n\n[33] A. Djouadi. The Anatomy of electro-weak symmetry breaking. I: The Higgs boson in the\n\nstandard model. Phys. Rept., 457:1\u2013216, 2008.\n\n[34] A. Doucet and A. M. Johansen. A tutorial on particle \ufb01ltering and smoothing: Fifteen years\n\nlater. Handbook of Nonlinear Filtering, 12(656-704):3, 2009.\n\n[35] R. Dutta, J. Corander, S. Kaski, and M. U. Gutmann. Likelihood-free inference by penalised\n\nlogistic regression. arXiv preprint arXiv:1611.10242, 2016.\n\n[36] E. Endeve, C. Y. Cardall, R. D. Budiardja, S. W. Beck, A. Bejnood, R. J. Toedte, A. Mezzacappa,\nand J. M. Blondin. Turbulent magnetic \ufb01eld ampli\ufb01cation from spiral SASI modes: implications\nfor core-collapse supernovae and proto-neutron star magnetization. The Astrophysical Journal,\n751(1):26, 2012.\n\n[37] J. S. Gainer, J. Lykken, K. T. Matchev, S. Mrenna, and M. Park. The Matrix Element Method:\nPast, Present, and Future. In Proceedings, 2013 Community Summer Study on the Future of\nU.S. Particle Physics: Snowmass on the Mississippi (CSS2013): Minneapolis, MN, USA, July\n29-August 6, 2013, 2013.\n\n[38] Y. Gao, A. V. Gritsan, Z. Guo, K. Melnikov, M. Schulze, and N. V. Tran. Spin determination of\n\nsingle-produced resonances at hadron colliders. Phys. Rev., D81:075022, 2010.\n\n11\n\n\f[39] A. Gelman, D. Lee, and J. Guo. Stan: A Probabilistic Programming Language for Bayesian\nInference and Optimization. Journal of Educational and Behavioral Statistics, 40(5):530\u2013543,\n2015.\n\n[40] S. J. Gershman and N. D. Goodman. Amortized inference in probabilistic reasoning.\n\nProceedings of the 36th Annual Conference of the Cognitive Science Society, 2014.\n\nIn\n\n[41] W. R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied Statistics,\n\npages 337\u2013348, 1992.\n\n[42] T. Gleisberg, S. Hoeche, F. Krauss, M. Schonherr, S. Schumann, F. Siegert, and J. Winter. Event\n\ngeneration with SHERPA 1.1. Journal of High Energy Physics, 02:007, 2009.\n\n[43] N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a\n\nlanguage for generative models. arXiv preprint arXiv:1206.3255, 2012.\n\n[44] A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic programming. In\n\nProceedings of the Future of Software Engineering, pages 167\u2013181. ACM, 2014.\n\n[45] A. V. Gritsan, R. R\u00f6ntsch, M. Schulze, and M. Xiao. Constraining anomalous Higgs boson cou-\nplings to the heavy \ufb02avor fermions using matrix element techniques. Phys. Rev., D94(5):055023,\n2016.\n\n[46] B. Grzadkowski and J. F. Gunion. Using decay angle correlations to detect CP violation in the\n\nneutral Higgs sector. Phys. Lett., B350:218\u2013224, 1995.\n\n[47] R. Harnik, A. Martin, T. Okui, R. Primulando, and F. Yu. Measuring CP violation in h \u2192 \u03c4 +\u03c4\u2212\n\nat colliders. Phys. Rev., D88(7):076009, 2013.\n\n[48] F. Hartig, J. M. Calabrese, B. Reineking, T. Wiegand, and A. Huth. Statistical inference for\nstochastic simulation models\u2013theory and application. Ecology Letters, 14(8):816\u2013827, 2011.\n\n[49] A. Heinecke, A. Breuer, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties, A. Bode, W. Barth,\nX.-K. Liao, K. Vaidyanathan, et al. Petascale high order dynamic rupture earthquake simulations\non heterogeneous supercomputers. In Proceedings of the International Conference for High\nPerformance Computing, Networking, Storage and Analysis, pages 3\u201314. IEEE Press, 2014.\n\n[50] P. Hintjens. ZeroMQ: messaging for many applications. O\u2019Reilly Media, Inc., 2013.\n[51] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[52] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[53] M. D. Hoffman and A. Gelman. The No-U-turn Sampler: Adaptively Setting Path Lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593\u20131623, 2014.\n\n[54] B. Hooberman, A. Farbin, G. Khattak, V. Pacela, M. Pierini, J.-R. Vlimant, M. Spiropulu,\nW. Wei, M. Zhang, and S. Vallecorsa. Calorimetry with Deep Learning: Particle Classi\ufb01cation,\nEnergy Regression, and Simulation for High-Energy Physics, 2017. Deep Learning in Physical\nSciences (NIPS workshop). https://dl4physicalsciences.github.io/files/nips_\ndlps_2017_15.pdf.\n\n[55] G. Kasieczka. Boosted Top Tagging Method Overview. In 10th International Workshop on Top\n\nQuark Physics (TOP2017) Braga, Portugal, September 17-22, 2017, 2018.\n\n[56] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved\nvariational inference with inverse autoregressive \ufb02ow.\nIn D. D. Lee, M. Sugiyama, U. V.\nLuxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems\n29, pages 4743\u20134751. Curran Associates, Inc., 2016.\n\n[57] D. P. Kingma and M. Welling. Auto-encoding variational Bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\n12\n\n\f[58] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT\n\npress, 2009.\n\n[59] K. Kondo. Dynamical Likelihood Method for Reconstruction of Events With Missing Momen-\n\ntum. 1: Method and Toy Models. J. Phys. Soc. Jap., 57:4126\u20134140, 1988.\n\n[60] F. Krauss. Matrix elements and parton showers in hadronic interactions. Journal of High Energy\n\nPhysics, 2002(08):015, 2002.\n\n[61] W. Lampl, S. Laplace, D. Lelas, P. Loch, H. Ma, S. Menke, S. Rajagopalan, D. Rousseau,\nS. Snyder, and G. Unal. Calorimeter Clustering Algorithms: Description and Performance.\nTechnical Report ATL-LARG-PUB-2008-002. ATL-COM-LARG-2008-003, CERN, Geneva,\nApr 2008.\n\n[62] T. A. Le. Inference for higher order probabilistic programs. Masters Thesis, University of\n\nOxford, 2015.\n\n[63] T. A. Le, A. G. Baydin, and F. Wood.\n\nInference compilation and universal probabilistic\nprogramming. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages\n1338\u20131348, Fort Lauderdale, FL, USA, 2017. PMLR.\n\n[64] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[65] M. Lezcano Casado, A. G. Baydin, D. Martinez Rubio, T. A. Le, F. Wood, L. Heinrich,\nG. Louppe, K. Cranmer, W. Bhimji, K. Ng, and Prabhat. Improvements to inference compilation\nIn Neural Information\nfor probabilistic programming in large-scale scienti\ufb01c simulators.\nProcessing Systems (NIPS) 2017 workshop on Deep Learning for Physical Sciences (DLPS),\nLong Beach, CA, US, December 8, 2017, 2017.\n\n[66] T. Martini and P. Uwer. Extending the Matrix Element Method beyond the Born approximation:\n\nCalculating event weights at next-to-leading order accuracy. JHEP, 09:083, 2015.\n\n[67] T. Martini and P. Uwer. The Matrix Element Method at next-to-leading order QCD for hadronic\n\ncollisions: Single top-quark production at the LHC as an example application. 2017.\n\n[68] S. Naderiparizi, A. \u00b4Scibior, A. Munk, M. Ghadiri, A. G. Baydin, B. Gram-Hansen, C. S. de Witt,\nR. Zinkov, P. H. Torr, T. Rainforth, Y. W. Teh, and F. Wood. Amortized rejection sampling in\nuniversal probabilistic programming. arXiv preprint arXiv:1910.09056, 2019.\n\n[69] R. M. Neal. MCMC Using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n2011.\n\n[70] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive \ufb02ow for density estimation.\n\nIn Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[71] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop:\nThe Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA,\nUS, December 9, 2017, 2017.\n\n[72] P. Perdikaris, L. Grinberg, and G. E. Karniadakis. Multiscale modeling and simulation of brain\n\nblood \ufb02ow. Physics of Fluids, 28(2):021304, 2016.\n\n[73] M. Raberto, S. Cincotti, S. M. Focardi, and M. Marchesi. Agent-based simulation of a \ufb01nan-\ncial market. Physica A: Statistical Mechanics and its Applications, 299(1):319 \u2013 327, 2001.\nApplication of Physics in Economic Modelling.\n\n[74] E. Racah, S. Ko, P. Sadowski, W. Bhimji, C. Tull, S.-Y. Oh, P. Baldi, et al. Revealing fundamental\nphysics from the daya bay neutrino experiment using deep neural networks. In Machine Learning\nand Applications (ICMLA), 2016 15th IEEE International Conference on, pages 892\u2013897. IEEE,\n2016.\n\n13\n\n\f[75] T. Rainforth. Nesting probabilistic programs.\n\nIntelligence (UAI), 2018.\n\nIn Conference on Uncertainty in Arti\ufb01cial\n\n[76] T. Rainforth, R. Cornish, H. Yang, A. Warrington, and F. Wood. On nesting Monte Carlo\n\nestimators. In International Conference on Machine Learning (ICML), 2018.\n\n[77] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. arXiv preprint\n\narXiv:1505.05770, 2015.\n\n[78] D. Schouten, A. DeAbreu, and B. Stelzer. Accelerated Matrix Element Method with Parallel\n\nComputing. Comput. Phys. Commun., 192:54\u201359, 2015.\n\n[79] T. Sj\u00f6strand, S. Mrenna, and P. Skands. Pythia 6.4 physics and manual. Journal of High Energy\n\nPhysics, 2006(05):026, 2006.\n\n[80] D. E. Soper and M. Spannowsky. Finding physics signals with shower deconstruction. Phys.\n\nRev., D84:074002, 2011.\n\n[81] M. Sunn\u00e5ker, A. G. Busetto, E. Numminen, J. Corander, M. Foll, and C. Dessimoz. Approximate\n\nBayesian computation. PLoS Computational Biology, 9(1):e1002803, 2013.\n\n[82] D. Tran, M. W. Hoffman, D. Moore, C. Suter, S. Vasudevan, and A. Radul. Simple, distributed,\nand accelerated probabilistic programming. In Advances in Neural Information Processing\nSystems, pages 7598\u20137609, 2018.\n\n[83] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M. Blei. Edward: A library\nfor probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.\n\n[84] D. Tran, R. Ranganath, and D. Blei. Hierarchical implicit models and likelihood-free variational\n\ninference. In Advances in Neural Information Processing Systems, pages 5523\u20135533, 2017.\n\n[85] B. Uria, M.-A. C\u00f4t\u00e9, K. Gregor, I. Murray, and H. Larochelle. Neural autoregressive distribution\n\nestimation. Journal of Machine Learning Research, 17(205):1\u201337, 2016.\n\n[86] J.-W. van de Meent, B. Paige, H. Yang, and F. Wood. An Introduction to Probabilistic Program-\n\nming. arXiv e-prints, Sep 2018.\n\n[87] R. D. Wilkinson. Approximate Bayesian computation (ABC) gives exact results under the as-\nsumption of model error. Statistical Applications in Genetics and Molecular Biology, 12(2):129\u2013\n141.\n\n[88] D. Williams. Probability with Martingales. Cambridge University Press, 1991.\n\n[89] D. Wingate, A. Stuhlmueller, and N. Goodman. Lightweight implementations of probabilistic\nprogramming languages via transformational compilation. In Proceedings of the Fourteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 770\u2013778, 2011.\n\n[90] F. Wood, J. W. Meent, and V. Mansinghka. A new approach to probabilistic programming\n\ninference. In Arti\ufb01cial Intelligence and Statistics, pages 1024\u20131032, 2014.\n\n14\n\n\f", "award": [], "sourceid": 2922, "authors": [{"given_name": "Atilim Gunes", "family_name": "Baydin", "institution": "University of Oxford"}, {"given_name": "Lei", "family_name": "Shao", "institution": "Intel Corporation"}, {"given_name": "Wahid", "family_name": "Bhimji", "institution": "Berkeley lab"}, {"given_name": "Lukas", "family_name": "Heinrich", "institution": "New York University"}, {"given_name": "Saeid", "family_name": "Naderiparizi", "institution": "University of British Columbia"}, {"given_name": "Andreas", "family_name": "Munk", "institution": "University of British Columbia"}, {"given_name": "Jialin", "family_name": "Liu", "institution": "Lawrence Berkeley National Lab"}, {"given_name": "Bradley", "family_name": "Gram-Hansen", "institution": "University of Oxford"}, {"given_name": "Gilles", "family_name": "Louppe", "institution": "University of Li\u00e8ge"}, {"given_name": "Lawrence", "family_name": "Meadows", "institution": "Intel Corporation"}, {"given_name": "Philip", "family_name": "Torr", "institution": "University of Oxford"}, {"given_name": "Victor", "family_name": "Lee", "institution": "Intel Corporation"}, {"given_name": "Kyle", "family_name": "Cranmer", "institution": "New York University"}, {"given_name": "Mr.", "family_name": "Prabhat", "institution": "LBL/NERSC"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of British Columbia"}]}