{"title": "Balancing Suspense and Surprise: Timely Decision Making with Endogenous Information Acquisition", "book": "Advances in Neural Information Processing Systems", "page_first": 2910, "page_last": 2918, "abstract": "We develop a Bayesian model for decision-making under time pressure with endogenous information acquisition. In our model, the decision-maker decides when to observe (costly) information by sampling an underlying continuous-time stochastic process (time series) that conveys information about the potential occurrence/non-occurrence of an adverse event which will terminate the decision-making process. In her attempt to predict the occurrence of the adverse event, the decision-maker follows a policy that determines when to acquire information from the time series (continuation), and when to stop acquiring information and make a final prediction (stopping). We show that the optimal policy has a \"rendezvous\" structure, i.e. a structure in which whenever a new information sample is gathered from the time series, the optimal \"date\" for acquiring the next sample becomes computable. The optimal interval between two information samples balances a trade-off between the decision maker\u2019s \"surprise\", i.e. the drift in her posterior belief after observing new information, and \"suspense\", i.e. the probability that the adverse event occurs in the time interval between two information samples. Moreover, we characterize the continuation and stopping regions in the decision-maker\u2019s state-space, and show that they depend not only on the decision-maker\u2019s beliefs, but also on the \"context\", i.e. the current realization of the time series.", "full_text": "Balancing Suspense and Surprise: Timely Decision\nMaking with Endogenous Information Acquisition\n\nAhmed M. Alaa\n\nElectrical Engineering Department\n\nUniversity of California, Los Angeles\n\nMihaela van der Schaar\n\nElectrical Engineering Department\n\nUniversity of California, Los Angeles\n\nAbstract\n\nWe develop a Bayesian model for decision-making under time pressure with en-\ndogenous information acquisition.\nIn our model, the decision-maker decides\nwhen to observe (costly) information by sampling an underlying continuous-\ntime stochastic process (time series) that conveys information about the potential\noccurrence/non-occurrence of an adverse event which will terminate the decision-\nmaking process. In her attempt to predict the occurrence of the adverse event, the\ndecision-maker follows a policy that determines when to acquire information from\nthe time series (continuation), and when to stop acquiring information and make\na \ufb01nal prediction (stopping). We show that the optimal policy has a \"rendezvous\"\nstructure, i.e. a structure in which whenever a new information sample is gathered\nfrom the time series, the optimal \"date\" for acquiring the next sample becomes\ncomputable. The optimal interval between two information samples balances a\ntrade-off between the decision maker\u2019s \"surprise\", i.e.\nthe drift in her posterior\nbelief after observing new information, and \"suspense\", i.e. the probability that\nthe adverse event occurs in the time interval between two information samples.\nMoreover, we characterize the continuation and stopping regions in the decision-\nmaker\u2019s state-space, and show that they depend not only on the decision-maker\u2019s\nbeliefs, but also on the \"context\", i.e. the current realization of the time series.\n\n1\n\nIntroduction\n\nThe problem of timely risk assessment and decision-making based on a sequentially observed time\nseries is ubiquitous, with applications in \ufb01nance, medicine, cognitive science and signal processing\n[1-7]. A common setting that arises in all these domains is that a decision-maker, provided with\nsequential observations of a time series, needs to decide whether or not an adverse event (e.g. \ufb01nan-\ncial crisis, clinical acuity for ward patients, etc) will take place in the future. The decision-maker\u2019s\nrecognition of a forthcoming adverse event needs to be timely, for that a delayed decision may hin-\nder effective intervention (e.g. delayed admission of clinically acute patients to intensive care units\ncan lead to mortality [5]). In the context of cognitive science, this decision-making task is known\nas the two-alternative forced choice (2AFC) task [15]. Insightful structural solutions for the optimal\nBayesian 2AFC decision-making policies have been derived in [9-16], most of which are inspired\nby the classical work of Wald on sequential probability ratio tests (SPRT) [8].\nIn this paper, we present a Bayesian decision-making model in which a decision-maker adaptively\ndecides when to gather (costly) information from an underlying time series in order to accumulate\nevidence on the occurrence/non-occurrence of an adverse event. The decision-maker operates under\ntime pressure: occurrence of the adverse event terminates the decision-making process. Our abstract\nmodel is motivated and inspired by many practical decision-making tasks such as: constructing tem-\nporal patterns for gathering sensory information in perceptual decision-making [1], scheduling lab\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftests for ward patients in order to predict clinical deterioration in a timely manner [3, 5], designing\nbreast cancer screening programs for early tumor detection [7], etc.\nWe characterize the structure of the optimal decision-making policy that prescribes when should\nthe decision-maker acquire new information, and when should she stop acquiring information and\nissue a \ufb01nal prediction. We show that the decision-maker\u2019s posterior belief process, based on which\npolicies are prescribed, is a supermartingale that re\ufb02ects the decision-maker\u2019s tendency to deny\nthe occurrence of an adverse event in the future as she observes the survival of the time series for\nlonger time periods. Moreover, the information acquisition policy has a \"rendezvous\" structure;\nthe optimal \"date\" for acquiring the next information sample can be computed given the current\nsample. The optimal schedule for gathering information over time balances the information gain\n(surprise) obtained from acquiring new samples, and the probability of survival for the underlying\nstochastic process (suspense). Finally, we characterize the continuation and stopping regions in the\ndecision-maker\u2019s state-space and show that, unlike previous models, they depend on the time series\n\"context\" and not just the decision-maker\u2019s beliefs.\n\nRelated Works Mathematical models and analyses for perceptual decision-making based on\nsequential hypothesis testing have been developed in [9-17]. Most of these models use tools\nfrom sequential analysis developed by Wald [8] and Shiryaev [21, 22].\nIn [9,13,14], optimal\ndecision-making policies for the 2AFC task were computed by modelling the decision-maker\u2019s\nsensory evidence using diffusion processes [20]. These models assume an in\ufb01nite time horizon for\nthe decision-making policy, and an exogenous supply of sensory information.\nThe assumption of an in\ufb01nite time horizon was relaxed in [10] and [15], where decision-making is\nassumed to be performed under the pressure of a stochastic deadline; however, these deadlines were\nconsidered to be drawn from known distributions that are independent of the hypothesis and the\nrealized sensory evidence, and the assumption of an exogenous information supply was maintained.\nIn practical settings, the deadlines would naturally be dependent on the realized sensory information\n(e.g. patients\u2019 acuity events are correlated with their physiological information [5]), which induces\nmore complex dynamics in the decision-making process. Context-based decision-making models\nwere introduced in [17], but assuming an exogenous information supply and an in\ufb01nite time horizon.\nThe notions of \u201csuspense\" and \u201csurprise\" in Bayesian decision-making have also been recently intro-\nduced in the economics literature (see [18] and the references therein). These models use measures\nfor Bayesian surprise, originally introduced in the context of sensory neuroscience [19], in order\nto model the explicit preference of a decision-maker to non-instrumental information. The goal\nthere is to design information disclosure policies that are suspense-optimal or surprise-optimal. Un-\nlike our model, such models impose suspense (and/or surprise) as a (behavioral) preference of the\ndecision-maker, and hence they do not emerge endogenously by virtue of rational decision making.\n\n2 Timely Decision Making with Endogenous Information Gathering\n\nTime Series Model The decision-maker has access to a time-series X(t) modeled as a continuous-\ntime stochastic process that takes values in R, and is de\ufb01ned over the time domain t \u2208 R+, with an\nunderlying \ufb01ltered probability space (\u2126, F, {Ft}t\u2208R+ , P). The process X(t) is naturally adapted\nto {Ft}t\u2208R+, and hence the \ufb01ltration Ft abstracts the information conveyed in the time series real-\nization up to time t. The decision-maker extracts information from X(t) to guide her actions over\ntime.\nWe assume that X(t) is a stationary Markov process1, with a stationary transition kernel\nP\u03b8 (X(t) \u2208 A|Fs) = P\u03b8 (X(t) \u2208 A|X(s)) , \u2200A \u2282 R, \u2200s < t \u2208 R+, where \u03b8 is a realization\nof a latent Bernoulli random variable \u0398 \u2208 {0, 1} (unobservable by the decision-maker), with\nP(\u0398 = 1) = p. The distributional properties of the paths of X(t) are determined by \u03b8, since\nthe realization of \u03b8 decides which Markov kernel (Po or P1) generates X(t). If the realization \u03b8 is\nequal to 1, then an adverse event occurs almost surely at a (\ufb01nite) random time \u03c4 , the distribution of\nwhich is dependent on the realization of the path (X(t))0\u2264t\u2264\u03c4 .\n\n1Most of the insights distilled from our results would hold for more general dependency structures. However,\nwe keep this assumption to simplify the exposition and maintain the tractability and interpretability of the\nresults.\n\n2\n\n\f0.1\n\n0.05\n\n)\nt\n(\n\nX\n\n0\n\n\u22120.05\n\n\u22120.1\n\n \n0\n\nPt = {0, 0.1, 0.15, 0.325, 0.4, 0.45, 0.475, 0.5, 0.65, 0.7}\n\nInformation at t = 0.2:\n1) \u03c3(X(0), X(0.1), X(0.15))\n2) S0.2: survival up to t = 0.2\n\nAdverse event\n\n \n\nStopping time \u03c4\n\nContinuous-path X(t)\nPartitioned path X(Pt)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\nTime t\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 1: An exemplary stopped sample path for X \u03c4 (t)|\u0398 = 1, with an exemplary partition Pt.\n\nThe decision-maker\u2019s ultimate goal is to sequentially observe X(t), and infer \u03b8 before the adverse\nevent happens; inference is obsolete if it is declared after \u03c4 . Since \u0398 is latent, the decision-maker is\nunaware whether the adverse event will occur or not, i.e. whether her access to X(t) is temporary\n(\u03c4 < \u221e for \u03b8 = 1) or permanent (\u03c4 = \u221e for \u03b8 = 0). In order to model the occurrence of the\nadverse event; we de\ufb01ne \u03c4 as an F-stopping time for the process X(t), for which we assume the\nfollowing:\n\n\u2022 The stopping time \u03c4 |\u0398 = 1 is \ufb01nite almost surely, whereas \u03c4 |\u0398 = 0 is in\ufb01nite almost\n\nsurely, i.e. P (\u03c4 < \u221e |\u0398 = 1 ) = 1, and P (\u03c4 = \u221e |\u0398 = 0 ) = 1.\n\n\u2022 The stopping time \u03c4 |\u0398 = 1 is accessible2, with a Markovian dependency on history, i.e.\nP ( \u03c4 < t| Fs) = P ( \u03c4 < t| X(s)) , \u2200s < t, where P ( \u03c4 < t| X(s)) is an injective map from\nR to [0, 1] and P ( \u03c4 < t| X(s)) is non-decreasing in X(s).\n\nThus, unlike the stochastic deadline models in [10] and [15], the decision deadline in our model (i.e.\noccurrence of the adverse event) is context-dependent as it depends on the time series realization\n(i.e. P ( \u03c4 < t| X(s)) is not independent of X(t) as in [15]). We use the notation X \u03c4 (t) = X(t \u2227 \u03c4 ),\nwhere t \u2227 \u03c4 = min{t, \u03c4 } to denote the stopped process to which the decision-maker has ac-\ncess. Throughout the paper, the measures Po and P1 assign probability measures to the paths\nX \u03c4 (t)|\u0398 = 0 and X \u03c4 (t)|\u0398 = 1 respectively, and we assume that Po << P1\n\n3.\n\ni=0\n\nInformation The decision-maker can only observe a set of (costly) samples of X \u03c4 (t) rather\nthan the full continuous path. The samples observed by the decision-maker are captured by\npartitioning X(t) over speci\ufb01c time intervals: we de\ufb01ne Pt = {to, t1, . . ., tN (Pt)\u22121}, with\n0 \u2264 to < t1 < . . . < tN (Pt)\u22121 \u2264 t, as a size-N (Pt) partition of X \u03c4 (t) over the interval [0, t],\nwhere N (Pt) is the total number of samples in the partition Pt. The decision-maker observes the\nvalues that X \u03c4 (t) takes at the time instances in Pt; thus the sequence of observations is given by the\nX(ti)\u03b4ti , where \u03b4ti is the Dirac measure. The space of all partitions\nover the interval [0, t] is denoted by Pt = [0, t]N. We denote the probability measures for partitioned\npaths generated under \u0398 = 0 and 1 with a partition Pt as \u02dcPo(Pt) and \u02dcP1(Pt) respectively.\nSince the decision-maker observes X \u03c4 (t) through the partition Pt, her information at time t is\nconveyed in the \u03c3-algebra \u03c3(X \u03c4 (Pt)) \u2282 Ft. The stopping event is observable by the decision-\n\nprocess X(Pt) = PN (Pt)\u22121\n\nmaker even if \u03c4 /\u2208 P\u03c4 . We denote the \u03c3-algebra generated by the stopping event as St = \u03c3(cid:0)1{t\u2265\u03c4 }(cid:1).\n\nThus, the information that the decision-maker has at time t is expressed by the \ufb01ltration \u02dcFt =\n\u03c3(X \u03c4 (Pt)) \u2228 St. Hence, any decision-making policy needs to be \u02dcFt-measurable.\nFigure 1 depicts a Brownian path (a sample path of a Wiener process, which satis\ufb01es all the\nassumptions of our model)4, with an exemplary partition Pt over the time interval [0, 1]. The\ndecision-maker observes the samples in X(Pt) sequentially, and reasons about the realization of\nthe latent variable \u0398 based on these samples and the process survival, i.e. at t = 0.2, the decision-\nmaker\u2019s information resides in the \u03c3-algebra \u03c3(X(0), X(0.1), X(0.15)) generated by the samples\n\n2Our analyses hold if the stopping time is totally inaccessible.\n3The absolute continuity of Po with respect to P1 means that no sample path of X \u03c4 (t)|\u0398 = 0 should be\n\nfully revealing of the realization of \u0398.\n\n4In Figure 1, the stopping event was simulated as a totally inaccessible \ufb01rst jump of a Poisson process.\n\n3\n\n\fin P0.2 = {0, 0.1, 0.15}, and the \u03c3-algebra generated by the process\u2019 survival S0.2 = \u03c3(1{\u03c4 >0.2}).\n\nPolicies and Risks The decision-maker\u2019s goal is to come up with a (timely) decision \u02c6\u03b8 \u2208 {0, 1},\nthat re\ufb02ects her prediction for whether the actual realization \u03b8 is 0 or 1, before the process X \u03c4 (t)\npotentially stops at the unknown time \u03c4 . The decision-maker follows a policy: a (continuous-time)\nmapping from the observations gathered up to every time instance t to two types of actions:\n\n\u2022 A sensing action \u03b4t \u2208 {0, 1}: if \u03b4t = 1, then the decision-maker decides to observe a new\n\nsample from the running process X \u03c4 (t) at time t.\n\n\u2022 A continuation/stopping action \u02c6\u03b8t \u2208 {\u2205, 0, 1}: if \u02c6\u03b8t \u2208 {0, 1}, then the decision-maker\ndecides to stop gathering samples from X \u03c4 (t), and declares a \ufb01nal decision (estimate) for\n\u03b8. Whenever \u02c6\u03b8t = \u2205, the decision-maker continues observing X \u03c4 (t) and postpones her\ndeclaration for the estimate of \u03b8.\n\nA policy \u03c0 = (\u03c0t)t\u2208R+ is a ( \u02dcFt-measurable) mapping rule that maps the information in \u02dcFt to an\naction tuple \u03c0t = (\u03b4t, \u02c6\u03b8t) at every time instance t. We assume that every single observation that the\ndecision-maker draws from X \u03c4 (t) entails a \ufb01xed cost, hence the process (\u03b4t)t\u2208R+ has to be a point\nprocess under any optimal policy5. We denote the space of all such policies by \u03a0.\nA policy \u03c0 generates the following random quantities as a function of the paths X \u03c4 (t) on the proba-\nbility space (\u2126, F, {Ft}t\u2208R+ , P):\n1- A stopping time T\u03c0: The \ufb01rst time at which the decision-maker declares its estimate for \u03b8, i.e.\nT\u03c0 = inf{t \u2208 R+ : \u02c6\u03b8t \u2208 {0, 1}}.\n2- A decision (estimate of \u03b8) \u02c6\u03b8\u03c0: Given by \u02c6\u03b8\u03c0 = \u02c6\u03b8T\u03c0\u2227\u03c4 .\n3- A random partition P \u03c0\nT\u03c0 : A realization of the point process (\u03b4t)t\u2208R+, comprising a \ufb01nite set of\nstrictly increasing F-stopping times at which the decision-maker decides to sample the path X \u03c4 (t).\n\nA loss function is associated with every realization of the policy \u03c0, representing the overall cost\nincurred when following that policy for a speci\ufb01c path X \u03c4 (t). The loss function is given by\n\n|\n\n{z\n\nType I error\n\nType II error\n\n+ Cd T\u03c0\n\n) 1{T\u03c0\u2264\u03c4 }+ Cr 1{T\u03c0 >\u03c4 }\n\n+ Co 1{\u02c6\u03b8\u03c0=1,\u03b8=0}\n\n\u2113 (\u03c0; \u0398) , (C1 1{\u02c6\u03b8\u03c0=0,\u03b8=1}\n\nT\u03c0\u2227\u03c4 )\nInformation\n(1)\nwhere C1 is the cost of type I error (failure to anticipate the adverse event), Co is the cost of type II\nerror (falsely predicting that an adverse event will occur), Cd is the cost of the delay in declaring the\nestimate \u02c6\u03b8\u03c0, Cr is the cost incurred when the adverse event occurs before an estimate \u02c6\u03b8\u03c0 is declared\n(cost of missing the deadline), and Cs is the cost of every observation sample (cost of information).\nThe risk of each policy \u03c0 is de\ufb01ned as its expected loss\n\n| {z }Delay\n\n}\n\n|\n\n+ CsN (P \u03c0\n\n{z\n\n}\n\nDeadline missed\n\n{z\n\n}\n\n|\n\n|\n\n{z\n\n,\n\n}\n\nwhere the expectation is taken over the paths of X \u03c4 (t). In the next section, we characterize the\nstructure of the optimal policy \u03c0\u2217 = arg inf\u03c0\u2208\u03a0R(\u03c0).\n\nR(\u03c0) , E [\u2113 (\u03c0; \u0398)] ,\n\n(2)\n\n3 Structure of the Optimal Policy\n\nSince the decision-maker\u2019s posterior belief at time t, de\ufb01ned as \u00b5t = P( \u0398 = 1| \u02dcFt), is an impor-\ntant statistic for designing sequential policies [10, 21-22], we start our characterization for \u03c0\u2217 by\ninvestigating the belief process (\u00b5t)t\u2208R+.\n\n3.1 The Posterior Belief Process\n\nRecall that the decision-maker distills information from two types of observations: the realization\nof the partitioned time series X \u03c4 (Pt) (i.e. the information in \u03c3(X \u03c4 (Pt))), and 2) the survival of the\n\n5Note that the cost of observing any local continuous path is in\ufb01nite, hence any optimal policy must have\n\n(\u03b4t)t\u2208R+ being a point process to keep the number of observed samples \ufb01nite.\n\n4\n\n\ft\n\n\u00b5\n\ns\ns\ne\nc\no\nr\np\n\nf\ne\ni\nl\ne\nb\n\nr\no\ni\nr\ne\nt\ns\no\nP\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n \n0\n\nSuspense phase (risk bearing)\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n500\n\n600\n\n700\n\n800\n\nSurprise phase (risk assessment)\n\nInformation gain\nIt1(t2 \u2212 t1) = \u00b5t2\n\n\u2212 \u00b5t1\n\nPolicy \u03c01 with partition P \u03c01\nPolicy \u03c02, with P \u03c01 \u2282 P \u03c02\nWait-and-watch policy\n\n \n\nStopping time \u03c4\n\n200\n\n400\n\n600\n\n800\n\nTime t\n\n1000\n\n1200\n\n1400\n\n1600\n\nFigure 2: Depiction for exemplary belief paths of different policies under \u0398 = 1.\n\nprocess up to time t (i.e. the information in St). In the following Theorem, we study the evolution\nof the decision-maker\u2019s beliefs as she integrates these pieces of information over time6.\nTheorem 1 (Information and beliefs). Every posterior belief trajectory (\u00b5t)t\u2208R+ associated with\na policy \u03c0 \u2208 \u03a0 that creates a partition P \u03c0\n\nt \u2208 Pt of X \u03c4 (t) is a c\u00e0dl\u00e0g path given by\n\n\u00b5t = 1{t\u2265\u03c4 } + 1{0\u2264t<\u03c4 } \u00b7 1 +\n\n1 \u2212 p\n\np\n\nd\u02dcPo(P \u03c0\nt )\nd\u02dcP1(P \u03c0\n\nt )!\u22121\n\n,\n\n\u00b7\n\nwhere d\u02dcPo(P \u03c0\nt )\nd\u02dcP1(P \u03c0\nt )\nand is given by the following elementary predictable process\n\nis the Radon\u2013Nikodym derivative7 of the measure \u02dcPo(P \u03c0\n\nt ) with respect to \u02dcP1(P \u03c0\n\nt ),\n\nP(\u03c4 > t|\u03c3(X(P \u03c0\n\nt ), \u0398 = 1)\n\n1{P \u03c0\n\nt (k)\u2264t\u2264P \u03c0\n\nt (k+1)},\n\nSurvival probability\n\n{z\n\n}\n\n1\n\nd\u02dcPo(P \u03c0\nt )\nd\u02dcP1(P \u03c0\nt )\n\n=\n\nN (P \u03c0\n\nt )\u22121\n\nXk=1\n\nP(X(P \u03c0\nP(X(P \u03c0\n\nt )|\u0398 = 1)\nt )|\u0398 = 0)\n\nLikelihood ratio\n\n|\n\n{z\n\n|\n\n}\n\nfor t \u2265 P \u03c0\nN (P \u03c0\n\nt (1), and p P(\u03c4 > t|\u0398 = 1) for t < P \u03c0\n\nT\u03c0\u2227\u03c4 ) + 1{\u03c4 <\u221e} jumps at the time indexes in P \u03c0\n\nt\u2227\u03c4 \u222a {\u03c4 }.\n\nt (k). Moreover, the path (\u00b5t)t\u2208R+ has exactly\n(cid:3)\n\nTheorem 1 says that every belief path is right-continuous with left limits, and has jumps at the time\nt , whereas between each two jumps, the paths (\u00b5t)t\u2208[t1,t2), t1, t2 \u2208 P \u03c0\nindexes in the partition P \u03c0\nt\nare predictable (i.e.\nthey are known ahead of time once we know the magnitudes of the jumps\npreceding them). This means that the decision-maker obtains \"active\" information by probing\nthe time series to observe new samples (i.e. the information in \u03c3(X \u03c4 (Pt))), inducing jumps that\nrevive her beliefs, whereas the progression of time without witnessing a stopping event offers the\ndecision-maker \"passive information\" that is distilled just from the costless observation of process\nsurvival information. Both sources of information manifest themselves in terms of the likelihood\nratio, and the survival probability in the expression of d\u02dcPo(P \u03c0\nt )\nd\u02dcP1(P \u03c0\nt )\n\nabove.\n\nIn Figure 2, we plot the c\u00e0dl\u00e0g belief paths for policies \u03c01 and \u03c02, where P \u03c01 \u2282 P \u03c02 (i.e. policy\n\u03c01 observe a subset of the samples observed by \u03c02). We also plot the (predictable) belief path of\na wait-and-watch policy that observes no samples. We can see that \u03c02, which has more jumps of\n\"active information\", copes faster with the truthful belief over time. Between each two jumps, the\nbelief process exhibits a non-increasing predictable path until fed with a new piece of information.\nThe wait-and-watch policy has its belief drifting away from the prior p = 0.5 towards the wrong\nbelief \u00b5t = 0 since it only distills information from the process survival, which favors the hypothesis\n\u0398 = 0. This discussion motivates the introduction of the following key quantities.\n\nInformation gain (surprise) It(\u2206t): The amount of drift in the decision-maker\u2019s belief at time\nt + \u2206t with respect to her belief at time t, given the information available up to time t, i.e.\nIt(\u2206t) = (\u00b5t+\u2206t \u2212 \u00b5t) | \u02dcFt.\n\n6All proofs are provided in the supplementary material\n7Since we impose the condition Po << P1 and \ufb01x a partition Pt, then the Radon\u2013Nikodym derivative\n\nexists.\n\n5\n\n\fPosterior survival function (suspense) St(\u2206t): The probability that a process generated\nwith \u0398 = 1 survives up to time t + \u2206t given the information observed up to time t,\ni.e.\nSt(\u2206t) = P(\u03c4 > t + \u2206t| \u02dcFt, \u0398 = 1). The function St(\u2206t) is a non-increasing function in \u2206t, i.e.\n\u2202St(\u2206t)\n\n\u2202\u2206t \u2264 0.\n\nThat is, the information gain is the amount of \u201csurprise\" that the decision-maker experiences in\nresponse to a new information sample expressed in terms of the change in here belief, i.e. the jumps\nin \u00b5t, whereas the survival probability (suspense) is her assessment for the risk of having the adverse\nevent taking places in the next \u2206t time interval. As we will see in the next subsection, the optimal\npolicy would balance the two quantities when scheduling the times to sense X \u03c4 (t).\nWe conclude our analysis for the process \u00b5t by noting that lack of information samples creates bias\ntowards the belief that \u0398 = 0 (e.g. see the belief path of the wait-and-watch policy in Figure 2). We\nformally express this behavior in the following Corollary.\nCorollary 1 (Leaning towards denial). For every policy \u03c0 \u2208 \u03a0, the posterior belief process \u00b5t is\na supermartingale with respect to \u02dcFt, where\n\nE[\u00b5t+\u2206t| \u02dcFt] = \u00b5t \u2212 \u00b52\n\nt St(\u2206t)(1 \u2212 St(\u2206t)) \u2264 \u00b5t, \u2200\u2206t \u2208 R+.\n\n(cid:3)\n\nThus, unlike classical Bayesian learning models with a belief martingale [18, 21-23], the belief\nprocess in our model is a supermartingale that leans toward decreasing over time. The reason for\nthis is that in our model, time conveys information. That is, unlike [10] and [15] where the decision\ndeadline is hypothesis-independent and is almost surely occurring in \ufb01nite time for any path, in our\nmodel the occurrence of the adverse event is itself a hypothesis, hence observing the survival of the\nprocess is informative and contributes to the evolution of the belief. The informativeness of both the\nacquired information samples and process survival can be disentangled using Doob decomposition,\nby writing \u00b5t as \u00b5t = \u02dc\u00b5t + A(\u00b5t, St(\u2206t)), where \u02dc\u00b5t is a martingale, capturing the information gain\nfrom the acquired samples, and A(\u00b5t, St(\u2206t)) is a predictable compensator process [23], capturing\ninformation extracted from the process survival.\n\n3.2 The Optimal Policy\n\nThe optimal policy \u03c0\u2217 minimizes the expected risk as de\ufb01ned in (1) and (2) by generating the tuple\nof random processes (T\u03c0, \u02c6\u03b8\u03c0, P \u03c0\nt ) in response to the paths of X \u03c4 (t) on (\u2126, F, {Ft}t\u2208R+, P) in a\nway that \"shapes\" a belief process \u00b5t that maximizes informativeness, maintains timeliness and\ncontrols cost. In the following, we introduce the notion of a \"rendezvous policy\", then in Theorem\n2, we show that the optimal policy \u03c0\u2217 complies with this de\ufb01nition.\n\nRendezvous policies We say that a policy \u03c0 is a rendezvous policy, if the random partition P \u03c0\nT\u03c0\nt )t\u2208[0,T\u03c0], is a point process with predictable\nconstructed by the sequence of sensing actions (\u03b4\u03c0\njumps, where for every two consecutive jumps at times t and t\nT\u03c0 , we\nhave that t\n\n\u2032 is \u02dcFt-measurable.\n\n> t and t, t\n\n\u2032, with t\n\n\u2032\n\n\u2032\n\n\u2208 P \u03c0\n\nt )t\u2208[0,T\u03c0], such that\nThat is, a rendezvous policy is a policy that constructs a sensing schedule (\u03b4\u03c0\n\u2032 at which the decision-maker acquires information is actually computable using the\nevery time t\ninformation available up to time t, the previous time instance at which information was gathered.\nHence, the decision-maker can decide the next \"date\" in which she will gather information directly\nafter she senses a new information sample. This structure is a natural consequence of the informa-\ntion structure in Theorem 1, since the belief paths between every two jumps are predictable, then\nthey convey no \"actionable\" information, i.e. if the decision-maker was to respond to a predictable\nbelief path, say by sensing or making a stopping decision, then she should have taken that decision\nright before the predictable path starts, which leads her to better off by saving the delay cost Cd.\nWe denote the space of all rendezvous policies by \u03a0r. In the following Theorem, we establish that\nthe rendezvous structure is optimal.\n\nTheorem 2 (Rendezvous). The optimal policy \u03c0\u2217 is a rendezvous policy (\u03c0\u2217 \u2208 \u03a0r).\n\n(cid:3)\n\n6\n\n\fA direct implication of Theorem 2 is that the time variable can now be viewed as a state\nvariable, whereas the problem is virtually solved in \"discrete-time\" since the decision-maker\neffectively jumps from one time instance to another in a discrete manner. Hence, we alter the\nde\ufb01nition of the action \u03b4t from an indicator variable that indicates sensing the time series at time t,\nto a \"rendezvous action\" that takes real values, and speci\ufb01es the time after which the decision-maker\nwould sense a new sample, i.e. if \u03b4t = \u2206t, then the decision-maker gathers the new sample at t+\u2206t.\nThis transformation restricts our policy design problem to the space of rendezvous policies \u03a0r,\nwhich we know from Theorem 2 that it contains the optimal policy (i.e. \u03c0\u2217 = arg inf\u03c0\u2208\u03a0r R(\u03c0)).\nHaving established the result in Theorem 2, in the following Theorem, we characterize the optimal\npolicy \u03c0\u2217 in terms of the random process (T\u03c0\u2217, \u02c6\u03b8\u03c0\u2217, P \u03c0\u2217\n) using discrete-time Bellman optimality\nconditions [24].\n\nt\n\nTheorem 3 (The optimal policy). The optimal policy \u03c0\u2217 is a sequence of actions (\u02c6\u03b8\u03c0\u2217\nt\nresulting in a random process (\u02c6\u03b8\u03c0\u2217, T\u03c0\u2217, P \u03c0\u2217\n\nT\u03c0\u2217 ) with the following properties:\n\n, \u03b4\u03c0\u2217\n\nt )t\u2208R+ ,\n\n(Continuation and stopping)\n\n1. The process (t, \u00b5t, \u00afX(P \u03c0\u2217\n\nt\n\n(\u02c6\u03b8\u03c0\u2217, T\u03c0\u2217 , P \u03c0\u2217\n\u00afX(P \u03c0\u2217\n\nT\u03c0\u2217 ), where \u00afX(P \u03c0\u2217\n) = X(t\u2217), t\u2217 = max P \u03c0\u2217\n\nt\n\nt\n\n))t\u2208R+ is a Markov suf\ufb01cient statistic for the distribution of\n, i.e.\n\n) is the most recent sample in the partition P \u03c0\u2217\n.\n\nt\n\nt\n\n2. The policy \u03c0\u2217 recommends continuation, i.e. \u02c6\u03b8\u03c0\u2217\n\nC(t, \u00afX(P \u03c0\u2217\nthe following properties: C(t\n\n)), where C(t, \u00afX(P \u03c0\u2217\n\nt\n\nt\n\n\u2032\n\nt = \u2205, as long as the belief \u00b5t \u2208\n)), is a time and context-dependent continuation set with\n> X.\n\n> t, and C(t, X\n\n) \u2282 C(t, X), \u2200X\n\n, X) \u2282 C(t, X), \u2200t\n\n\u2032\n\n\u2032\n\n\u2032\n\n(Rendezvous and decisions)\n\n1. Whenever \u00b5t \u2208 C(t, \u00afX(P \u03c0\u2217\n\nt\n\n)), and t \u2208 P \u03c0\u2217\n\nT\u03c0\u2217 , then the rendezvous \u03b4\u03c0\u2217\n\nt\n\nis set as follows\n\nwhere f (E[It(\u03b4)], St(\u03b4)) is decreasing in E[It(\u03b4)] and St(\u03b4).\n\n\u03b4\u03c0\u2217\nt = arg inf\u03b4\u2208R+f (E[It(\u03b4)], St(\u03b4)),\n\n2. Whenever \u00b5t /\u2208 C(t, \u00afX(P \u03c0\u2217\n\n)), then a decision \u02c6\u03b8\u03c0\u2217\non a belief threshold as follows: \u02c6\u03b8\u03c0\u2217 = 1n\u00b5t\u2265\nT\u03c0\u2217 = inf{t \u2208 R+ : \u00b5t /\u2208 C(t, \u00afX(P \u03c0\u2217\n\n))}.\n\nt\n\nt\n\nt = \u02c6\u03b8\u03c0\u2217 \u2208 {0, 1} is issued, and is based\nCo+C1 o. The stopping time is given by\n\nC1\n\n(cid:3)\n\nt\n\nTheorem 3 establishes the structure of the optimal policy and its prescribed actions in the decision-\nmaker\u2019s state-space. The \ufb01rst part of the Theorem says that in order to generate the random\ntuple (T\u03c0\u2217, \u02c6\u03b8\u03c0\u2217, P \u03c0\u2217\n) optimally, we only need to keep track of the realization of the process\n(t, \u00b5t, \u00afX(Pt))t\u2208R+ in every time instance. That is, an optimal policy maps the current belief, the\ncurrent time, and the most recently observed realization of the time series to an action tuple (\u02c6\u03b8\u03c0\nt ),\nt , \u03b4\u03c0\ni.e. a decision on whether to stop and declare an estimate for \u03b8 or sense a new sample. Hence, the\nprocess (t, \u00b5t, \u00afX(Pt))t\u2208R+ represents the \"state\" of the decision-maker, and the decision-maker\u2019s\nactions can partially in\ufb02uence the state through the belief process, i.e. a decision on when to ac-\nquire the next sample affects the distributional properties of the posterior belief. The remaining state\nvariables t and X(t) are beyond the decision-maker\u2019s control.\nWe note that unlike the previous models in [9-16], with the exception of [17], a policy in our model\nis context-dependent. That is, since the state is (t, \u00b5t, \u00afX(P \u03c0\nt )) and not just the time-belief tuple\n(t, \u00b5t), a policy \u03c0 can recommend different actions for the same belief and at the same time but for\na different context. This is because, while \u00b5t captures what the decision-maker learned from the\nhistory, \u00afX(P \u03c0\nt ) captures her foresightedness into the future, i.e. it can be that the belief \u00b5t is not\ndecisive (e.g. \u00b5t \u2248 p), but the context is \"risky\" (i.e. \u00afX(P \u03c0\nt ) is large), which means that a potential\nforthcoming adverse event is likely to happen in the near future, hence the decision-maker would be\nmore eager to make a stopping decision and declare an estimate \u02c6\u03b8\u03c0. This is manifested through the\ndependence of the continuation set C(t, \u00afX(P \u03c0\nt )) on both time and context; the continuation set is\nmonotonically decreasing in time due to the deadline pressure, and is also monotonically decreasing\nin \u00afX(P \u03c0\n\nt ) due to the dependence of the deadline on the time series realization.\n\n7\n\n\f\u00b5t\n\nSample path 1\n\nSample path 2\n\n\u00af\u00b5\n\nX(t)\n\nPolicy \u03c0:\nContinue sampling X \u03c4 (t)\n\n\u00aft\n\nt\n\n\u02c6\u03b8\u03c0 = 1\n\nPolicy \u03c0:\nStop and declare \u02c6\u03b8\u03c0\n\nFigure 3: Context-dependence of the policy \u03c0.\n\nThe context dependence of the optimal policy is pictorially depicted in Figure 3 where we show\ntwo exemplary trajectories for the decision-maker\u2019s state, and the actions recommended by a policy\n\u03c0 for the same time and belief, but a different context, i.e. a stopping action recommended when\nX(t) is large since it corresponds to a low survival probability, whereas for the same belief and\ntime, a continuation action can be recommended if X(t) is low since it is safer to keep observing\nthe process for that the survival probability is high. Such a prescription speci\ufb01es optimal decision-\nmaking in context-driven settings such as clinical decision-making in critical care environment [3-5],\nwhere a combination of a patient\u2019s length of hospital stay (i.e. t), clinical risk score (i.e. \u00b5t) and\ncurrent physiological test measurements (i.e. \u00afX(P \u03c0\nt )) determine the decision on whether or not a\npatient should be admitted to an intensive care unit.\nThe second part of Theorem 3 says that whenever the optimal policy decides to stop gathering\ninformation and issue a conclusive decision, it imposes a threshold on the posterior belief, based on\nwhich it issues the estimate \u02c6\u03b8\u03c0\u2217; the threshold is\n, and hence weights the estimates by their\nrespective risks. When the policy favors continuation, it issues a rendezvous action, i.e. the next time\ninstance at which information will be gathered. This rendezvous balances surprise and suspense:\nthe decision-maker prefers maximizing surprise in order to draw the maximum informativeness\nfrom the costly sample it will acquire; this is captured in terms of the expected information gain\nE[It(\u03b4)]. Maximizing surprise may increase suspense, i.e. the probability of process termination,\nwhich is controlled by the survival function St(\u03b4), and hence it can be that harvesting the maximum\ninformativeness entails a survival risk when Cr is high. Therefore, the optimal policy selects a\nrendezvous \u03b4\u03c0\u2217\nthat optimizes a combination of the survival risk survival, captured by the cost Cr\nand the survival function St(\u2206t), and the value of information, captured by the costs Co, C1 and the\nexpected information gain E[It(\u03b4)].\n\nCo+C1\n\nC1\n\nt\n\n4 Conclusions\n\nWe developed a model for decision-making with endogenous information acquisition under time\npressure, where a decision-maker needs to issue a conclusive decision before an adverse event (po-\ntentially) takes place. We have shown that the optimal policy has a \"rendezvous\" structure, i.e. the\noptimal policy sets a \"date\" for gathering a new sample whenever the current information sample is\nobserved. The optimal policy selects the time between two information samples such that it balances\nthe information gain (surprise) with the survival probability (suspense). Moreover, we characterized\nthe optimal policy\u2019s continuation and stopping conditions, and showed that they depend on the con-\ntext and not just on beliefs. Our model can help understanding the nature of optimal decision-making\nin settings where timely risk assessment and information gathering is essential.\n\n5 Acknowledgments\n\nThis work was supported by the ONR and the NSF (Grant number: ECCS 1462245).\n\n8\n\n\fReferences\n\n[1] Balci, F., Freestone, D., Simen, P., de Souza, L., Cohen, J. D., & Holmes, P. (2011) Optimal temporal risk\nassessment, Frontiers in Integrative Neuroscience, 5(56), 1-15.\n\n[2] Banerjee, T. & Veeravalli, V. V. (2012) Data-ef\ufb01cient quickest change detection with on\u2013off observation\ncontrol, Sequential Analysis, 31(1), 40-77.\n\n[3] Wiens, J., Horvitz, E., & Guttag, J. V. (2012) Patient risk strati\ufb01cation for hospital-associated c. diff as a\ntime-series classi\ufb01cation task, In Advances in Neural Information Processing Systems, pp. 467-475.\n\n[4] Schulam, P., & Saria, S. (2015) A Framework for Individualizing Predictions of Disease Trajectories by\nExploiting Multi-resolution Structure, In Advances in Neural Information Processing Systems, pp. 748-756.\n\n[5] Chal\ufb01n, D. B., Trzeciak, S., Likourezos, A., Baumann, B. M., Dellinger, R. P., & DELAY-ED study group.\n(2007) Impact of delayed transfer of critically ill patients from the emergency department to the intensive care\nunit, Critical care medicine, 35(6), pp. 1477-1483.\n\n[6] Bortfeld, T., Ramakrishnan, J., Tsitsiklis, J. N., & Unkelbach, J. (2015) Optimization of radiation therapy\nfractionation schedules in the presence of tumor repopulation, INFORMS Journal on Computing, 27(4), pp.\n788-803.\n\n[7] Shapiro, S., et al., (1998) Breast cancer screening programmes in 22 countries: current policies, administra-\ntion and guidelines, International journal of epidemiology, 27(5), pp. 735-742.\n\n[8] Wald, A., Sequential analysis, Courier Corporation, 1973.\n\n[9] Khalvati, K., & Rao, R. P. (2015) A Bayesian Framework for Modeling Con\ufb01dence in Perceptual Decision\nMaking, In Advances in neural information processing systems, pp. 2404-2412.\n\n[10] Dayanik, S., & Angela, J. Y. (2013) Reward-Rate Maximization in Sequential Identi\ufb01cation under a\nStochastic Deadline, SIAM J. Control Optim., 51(4), pp. 2922\u20132948.\n\n[11] Zhang, S., & Angela, J.Y. (2013) Forgetful Bayes and myopic planning: Human learning and decision-\nmaking in a bandit setting, In Advances in neural information processing systems, pp. 2607-2615.\n\n[12] Shenoy, P., & Angela, J.Y. (2012) Strategic impatience in Go/NoGo versus forced-choice decision-making,\nIn Advances in neural information processing systems, pp. 2123-2131.\n\n[13] Drugowitsch, J., Moreno-Bote, R., & Pouget, A. (2014) Optimal decision-making with time-varying\nevidence reliability, In Advances in neural information processing systems, pp. 748-756.\n\n[14] Yu, A. J., Dayan, P., & Cohen, J. D. (2009) Dynamics of attentional selection under con\ufb02ict: toward a\nrational Bayesian account, Journal of Experimental Psychology: Human Perception and Performance, 35(3),\n700.\n\n[15] Frazier, P. & Angela, J. Y. (2007) Sequential hypothesis testing under stochastic deadlines, In Advances in\nNeural Information Processing Systems, pp. 465-472.\n\n[16] Drugowitsch, J., Moreno-Bote, R., Churchland, A. K., Shadlen, M. N., & Pouget, A. (2012) The cost of\naccumulating evidence in perceptual decision making, The Journal of Neuroscience, 32(11), 3612-3628.\n\n[17] Shvartsman, M., Srivastava, V., & Cohen J. D. (2015) A Theory of Decision Making Under Dynamic\nContext, In Advances in Neural Information Processing Systems, pp. 2476-2484. 2015.\n\n[18] Ely, J., Frankel, A., & Kamenica, E. (2015) Suspense and surprise, Journal of Political Economy, 123(1),\npp. 215-260.\n\n[19] Itti, L., & Baldi, P. (2005) Bayesian Surprise Attracts Human Attention, In Advances in Neural Information\nProcessing Systems, pp. 547-554.\n\n[20] Bogacz, R., Brown, E., Moehlis, J., Holmes, P. J., & Cohen J. D. (2006) The physics of optimal decision\nmaking: A formal analysis of models of performance in two-alternative forced-choice tasks, Psychological\nReview, 113(4), pp. 700\u2013765.\n\n[21] Peskir, G., & Shiryaev, A. (2006) Optimal stopping and free-boundary problems, Birkh\u00e4user Basel.\n\n[22] Shiryaev, A. N. (2007) Optimal stopping rules (Vol. 8). Springer Science & Business Media.\n\n[23] Shreve, Steven E. (2004) Stochastic calculus for \ufb01nance II: Continuous-time models (Vol. 11), Springer\nScience & Business Media, 2004.\n\n[24] Bertsekas, D. P., & Shreve, S. E. Stochastic optimal control: The discrete time case (Vol. 23), New York:\nAcademic Press, 1978.\n\n9\n\n\f", "award": [], "sourceid": 1463, "authors": [{"given_name": "Ahmed", "family_name": "Alaa", "institution": "University of California"}, {"given_name": "Mihaela", "family_name": "van der Schaar", "institution": "University of California, Los Angeles"}]}