## Show the R code

```
data{
int<lower=0> N;
vector[N] ce;
vector[N] p;
}parameters{
real alpha;
real beta;
real<lower=0> sigma;
}model{
ce ~ normal(alpha + beta * p , sigma) ; }
```

Hierarchical models are seldom discussed in econometrics texbooks in economics. They are, however, essential tools in epidemiology, psychology, and education research. A large part of the reason for that is historical and down to path dependence. In panel data analysis, dummies at the lowest level of analysis have long been used to implement “fixed effects”, thus allowing for clean identification and avoiding that estimated effects be contaminated by selection effect. This is an important goal, albeit one that can be easily obtained using alternative implementations providing more flexibility, as we will see before too long.

Hierarchies are extremely common. And while ignoring such hierarchies where they are relevant may result in biased estimators, imposing hierarchical structure where it is not relevant is (almost) costless: the estimator will simply collapse to its OLS equivalent. Economists thus ignore hierarchies at their own risk. Below, I will give a number of examples of cases that are best thought of as hierarchies. Classical cases are, for instance, measurement error models or meta-analysis. Far from being marginal or specialist topics, this general structure can be used to tackle a large variety of practical problems, such as for instance concerns about multiple testing that are often voiced in classical statistics.

Interestingly, the opposite concern of insufficient power in single studies can be addressed by exactly the same means, showing how both share a common origin (see discussion of the canonical 8 schools example). We will also see how hierarchical analysis is essential when trying to capture a complex sampling setup. I will provide an extended example based on panel data analysis. I will also show how stratified sampling setups ought to be reflected in the data analysis, and how explicitly modelling the data collection procedures can actually affect our conclusions (or rather, how ignoring such structure may lead to loss of information and biased inferences).

To see the value of hierarchical modelling in nonlinear estimation, imagine the following situation. You have collected choice data from a large set of individuals, each of whom has made a number of choices e.g. between lotteries. How should you analyze these data? Obvious responses that have been given to this question in the literature are: 1) we can assume that all choices come from a representative agent, and run an aggregate estimation; or 2) We treat each subject as a separate unit, and estimate the model for each of these units. We have seen both approaches in the last chapter. The probelm is that aggregate estimation may ignore heterogeneity between agents, and individual-level estimation runs the risk of overfitting sparse noisy data.

A more principled approach may thus be to first assess the variance in responses for each agent, possibly after conditioning on a model (i.e., obtain the residual variance after fitting a given model). This corresponds to an individual-level estimate. Then compare the within subject variation to the variation in parameters across subjects. Little variation across subjects may then indicate the legitimacy of aggregate estimation, or vice versa. However, why not let the relative variances deterine which weight each observation gets? That is exactly what hierarchical analysis does. It thus constitutes a principled way of compromising between aggregate and individual data patterns, which are taken into account in a way that is proportional to their relative precision (the inverse of the variance).

Yet a different intuition can be had if we think of the aggregate estimation as the prior. We have discussed in the previous chapter that defining a prior can be beneficial for individual-level estimation. One issue, however, concenrned what prior we could reasonably adopt without “unduly” influencing the results. One answer may be to obtain an aggregate estimate across all individuals, and to then use this as a prior for the individual-level estimates. As we will see shortly, the hierarchicall model does something very close to that, labeit taking into account the relative noisiness of different observations.

I will start by illustrating the workings of the random effects models using a simple linear regression. Here is the code for an aggregate regression model as a reminder:

```
data{
int<lower=0> N;
vector[N] ce;
vector[N] p;
}parameters{
real alpha;
real beta;
real<lower=0> sigma;
}model{
ce ~ normal(alpha + beta * p , sigma) ; }
```

The code below implements a linear regression identical to the one seen in chapter 2 on my 30 country risk preference dataset. For simplicity, we will use only gains and lotteries with 0 lower outcomes. Let us regress the normalized certainty equivalent on the probability of winning the prize.

```
<- read.csv(file="data/data_30countries.csv")
d30 <- d30 %>% filter(risk==1 & gain==1 & low==0) %>%
d30 mutate(ce_norm = equivalent/high)
# creates data to send to Stan
<- list(N = nrow(d30),
stanD ce = d30$ce_norm,
p = d30$probability )
# compile the Stan programme:
<- cmdstan_model("stancode/agg_normal.stan")
sr
# run the programme:
<- sr$sample(
nm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=0,
init=0,
show_messages = TRUE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)
print(nm, c("alpha","beta","sigma"),digits = 3)
$save_object(file="stanoutput/agg_30c.Rds") nm
```

We have now obtained the intercept as well as the slope (the coefficient of probability) and the noise term. The slope has an immediate interpretation as probabilistic sensitivity (or we could use \(1-\beta\) as a measure of *insensitivity*). The intercept, on the other hand, has no natural interpretation. The good news is that we can easily use the two parameters to obtain other indices. In fact, in the Bayesian context we can work directly with the posterior to carry along all the parameter uncertainty, so that we do not lose any information, as we would in traditional analysis. Below, I show this for the insensitivity parameter, and for a parameter capturing pessimism, which is defined as \(m = 1 - \beta - 2\alpha\).

```
<- readRDS(file="stanoutput/agg_30c.Rda")
nm print(nm, c("alpha","beta","sigma"),digits = 3)
```

```
Inference for Stan model: agg_normal.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 0.142 0 0.003 0.135 0.14 0.142 0.144 0.148 1673 1.000
beta 0.704 0 0.006 0.692 0.70 0.704 0.708 0.716 1658 1.000
sigma 0.210 0 0.001 0.208 0.21 0.210 0.211 0.212 1948 1.002
Samples were drawn using NUTS(diag_e) at Mon Jun 13 20:59:31 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
```

Treating all 3000 subjects across 30 countries as identical (or at least: drawn from one common population) may appear a stretch, yet this is the standard assumption made in aggregate models. Of course, one could go to the opposite extreme, and estimate the same regression subject-by-subject. The risk, in that case, is that we may over-fit noisy data given the relatively few observation points at the individual level, and the typically high levels of noise inherent in data of tradeoffs under risk.

To start, however, let us consider a rather obvious consideration given the context of the data—the subjects in the 30 countries are clearly drawn from different subject pools. Once we accept that logic, we should aim to capture it in our econometric model. One way of doing that is to allow the data to follow a hierarchical structure. That is, we can model the parameters as country-specific, while also being drawn from one over-reaching population.

To do that, let us start from a model in which only the intercept can vary country by country, so that we will write \(\alpha_c\), where the subscript reminds us that the intercept is now country-specific. At the same time, we will model the intercept as being randomly drawn from a larger sample, i.e. the sample of all countries, so that \(\alpha_c \sim \mathcal{N}(\widehat{\alpha},\tau)\), where \(\widehat{\alpha}\) is the aggregate intercept.The Stan model used to estimate this looks as follows:

```
data{
int<lower=0> N; //size of dataset
int<lower=0> Nc; //number of countries
vector[N] ce; // normalized CE, ce/x
vector[N] p; // probability of winning
array[N] int ctry; //unique and sequential integer identifying country
}parameters{
real alpha_hat;
vector[Nc] alpha;
real beta;
real<lower=0> sigma;
real<lower=0> tau;
}model{
// no prior for aggregate parameters (empirical Bayes)
//prior with endogenous aggregate parameters
alpha ~ normal(alpha_hat, tau); //regression model
ce ~ normal( alpha[ctry] + beta * p , sigma ) ; }
```

In this particular instance, I have programmed the aggregate parameter estimate as a prior (while not specifying any priors on the aggregate-level parameters for the time being). This way of writing the programme is not a coincidence, and drives home the point that the aggregate-level estimates in hierarchical models *serves as an endogenously-estimated prior for the lower-level estimates*. That is, we can exploit the power of the global dataset to obtain the best possible estimate of a prior, and then let this aggregate prior inform the individual-level estimates, which will often be based on much less information (we will see an example of this shortly).

```
# creates data to send to Stan
<- list(N = nrow(d30),
stanD ce = d30$ce_norm,
p = d30$probability,
Nc = length(unique(d30$country)),
ctry = d30$country)
# compile the Stan programme:
<- cmdstan_model("stancode/ri_30countries.stan")
sr
# run the programme:
<- sr$sample(
nm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=200,
init=0,
show_messages = TRUE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)
print(nm, c("alpha_hat","beta","sigma"),digits = 3)
$save_object(file="stanoutput/ri_30c.Rds") nm
```

```
<- readRDS("stanoutput/ri_30c.Rds")
ri print(ri, c("alpha_hat","tau","sigma","alpha"),digits = 3)
```

```
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
alpha_hat 0.143 0.143 0.007 0.007 0.131 0.155 1.000 3200 3264
tau 0.036 0.036 0.005 0.005 0.029 0.046 1.001 6875 2628
sigma 0.207 0.207 0.001 0.001 0.206 0.209 1.001 6618 2856
alpha[1] 0.108 0.107 0.008 0.009 0.094 0.121 1.000 4521 2707
alpha[2] 0.128 0.128 0.007 0.007 0.116 0.140 1.000 4007 2874
alpha[3] 0.158 0.158 0.008 0.008 0.145 0.171 1.001 4446 2518
alpha[4] 0.145 0.145 0.008 0.008 0.133 0.158 1.000 4794 3195
alpha[5] 0.106 0.106 0.007 0.007 0.094 0.118 1.000 3942 2847
alpha[6] 0.129 0.129 0.005 0.005 0.120 0.138 1.001 2987 3288
alpha[7] 0.130 0.130 0.006 0.006 0.120 0.141 1.000 3899 3212
# showing 10 of 33 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
```

```
<- as_draws_df(ri)
p <- ri$summary(variables = c("alpha"), "mean" , "sd")
zz <- zz %>% separate(variable,c("[","]")) %>%
z rename( var = `[`) %>% rename(nid = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
ggplot() +
geom_abline(intercept = z$mean_alpha , slope = mean(p$beta), color="cornflowerblue") +
geom_abline(intercept = mean(p$alpha_hat) , slope = mean(p$beta), color="navy" , linewidth=1 ) +
geom_abline(intercept = 0 , slope = 1 , color="red") +
xlim(0,1) + ylim(0,1) + labs(x="probability",y="probability weighting")
```

The mean intercept turns out to be the same as before, although this is not necessarily the case, given that the weight for different observations can shift due to the hierarchical structure. The variance is smaller, since we now capture more of the heterogeneity. And indeed, we find significant variation in intercepts between countries, as shown by the value of \(\tau\). The model thus results in a series of parallel lines with different elevations, as shown in Figure 4.1. In terms of interpretation, we thus find very different levels of optimism across the 30 countries.

Once we have allowed for intercepts to differ between countries, however, there is really no reason to force the slope to be the same. We can thus modify the model to allow for a hierarchical slope parameter (sometimes called, *random intercepts, random slopes* model) in addition to the random intercepts (or hierarchical intercepts—the term “random effects” is best avoided if one wants to publish results in economics, since it triggers memories of panel data models that “one should never use” in economists). The Stan code implementing this looks as follows:

```
data{
int<lower=0> N;
int<lower=0> Nc;
vector[N] ce;
vector[N] p;
array[N] int ctry;
}parameters{
real alpha_hat;
real beta_hat;
vector[Nc] alpha;
vector[Nc] beta;
real<lower=0> sigma;
real<lower=0> tau_a;
real<lower=0> tau_b;
}model{
//random intercepts
alpha ~ normal(alpha_hat, tau_a); //random slopes
beta ~ normal(beta_hat, tau_b); for ( i in 1:N )
ce[i] ~ normal( alpha[ctry[i]] + beta[ctry[i]] * p[i] , sigma ) ; }
```

The code looks much like the one before, with one hierarchical parameter added. However, I now loop through the observations using the `for ( i in 1:N )`

loop. This avoids issues of dimensionality when multiplying the parameter vector `beta`

(30x1) with the data vector \(p\) (Nx1). We can now estimate the programme like before. As a note of caution, this model may run for a little while, depending on the clock speed of your processors (we could of course parallelize the model to speed it up, but let us take one step at a time).

```
# creates data to send to Stan
<- list(N = nrow(d30),
stanD ce = d30$ce_norm,
p = d30$probability,
Nc = length(unique(d30$country)),
ctry = d30$country)
# compile the Stan programme:
<- cmdstan_model("stancode/rirs_30countries.stan")
sr
# run the programme:
<- sr$sample(
nm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=200,
init=0,
show_messages = TRUE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)
print(nm, c("alpha_hat","beta_hat","tau_a","tau_b","sigma","alpha","beta"),digits = 3)
$save_object(file="stanoutput/rirs_30c.Rds") nm
```

```
<- readRDS("stanoutput/rirs_30c.Rds")
rirs print(rirs, c("alpha_hat","beta_hat","tau_a","tau_b","sigma","alpha","beta"),digits = 3)
```

```
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
alpha_hat 0.134 0.134 0.015 0.015 0.109 0.159 1.000 6808 2995
beta_hat 0.718 0.718 0.024 0.025 0.678 0.758 1.001 6521 2939
tau_a 0.081 0.080 0.012 0.011 0.064 0.102 1.001 7768 3153
tau_b 0.128 0.127 0.019 0.018 0.102 0.163 1.001 7043 3126
sigma 0.205 0.205 0.001 0.001 0.203 0.206 1.003 5920 2485
alpha[1] 0.076 0.076 0.020 0.021 0.043 0.110 1.000 7736 3209
alpha[2] 0.085 0.085 0.017 0.016 0.057 0.112 1.005 7977 2554
alpha[3] 0.120 0.120 0.018 0.018 0.091 0.148 1.004 7742 2915
alpha[4] 0.187 0.187 0.018 0.018 0.158 0.216 1.001 7542 2900
alpha[5] 0.124 0.124 0.016 0.016 0.097 0.150 1.001 8002 3205
# showing 10 of 65 rows (change via 'max_rows' argument or 'cmdstanr_max_rows' option)
```

```
<- as_draws_df(rirs)
p <- rirs$summary(variables = c("alpha","beta"), "mean" , "sd")
zz <- zz %>% separate(variable,c("[","]")) %>%
z rename( var = `[`) %>% rename(nid = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
ggplot() +
geom_abline(intercept = z$mean_alpha , slope = z$mean_beta, color="cornflowerblue") +
geom_abline(intercept = mean(p$alpha_hat) , slope = mean(p$beta_hat), color="navy" , linewidth=1 ) +
geom_abline(intercept = 0 , slope = 1 , color="red") +
xlim(0,1) + ylim(0,1) + labs(x="probability",y="probability weighting")
```

Several things are noteworthy in the results shown in Figure 4.2. We now capture considerable heterogeneity in both the intercept and the slope. Indeed, providing more flexibility for the slope has *increased* the variation in intercepts. At the same time, the average intercept has now changed, too, and is somewhat lower than before. The lines no longer run parallel to each other.

It seems natural to ask whether any difference exists between estimating the global hierarchical model, or estimating the model country by country, or indeed using a fixed effects estimator, which amounts to the same thing (can you see how to implement such a model starting from the code above?). Below, I compare the country-level estimates obtained from the two approaches.

```
<- readRDS("stanoutput/fifs_30c.Rds")
fe <- readRDS("stanoutput/rirs_30c.Rds")
re
<- as_draws_df(re)
p
<- re$summary(variables = c("alpha","beta"), "mean" , "sd")
zz <- zz %>% separate(variable,c("[","]")) %>%
zr rename( var = `[`) %>% rename(nid = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
<- fe$summary(variables = c("alpha","beta"), "mean" , "sd")
zz <- zz %>% separate(variable,c("[","]")) %>%
zf rename( var = `[`) %>% rename(nid = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
ggplot() +
geom_point(aes(x=zf$mean_alpha,y=zf$mean_beta,color="Fixed effects estimates"),shape=1,size=3) +
geom_point(aes(x=zr$mean_alpha,y=zr$mean_beta,color="Bayesian hierarchical estimates"),shape=2,size=3) +
scale_colour_manual("Legend", values = c("darkgoldenrod","steelblue3","#999999")) +
geom_vline(xintercept=mean(p$alpha_hat),color="gray",lty="dashed",lwd=1) +
geom_hline(yintercept=mean(p$beta_hat),color="gray",lty="dashed",lwd=1) +
labs(x="intercept",y="sensitivity") +
theme(legend.position = c(0.75, 0.85))
```

Some striking patterns become apparent in Figure 4.3. For one, the intercept and slope parameters show a strong, negative correlation—a point to which we will return shortly. More interestingly for the question at hand, the estimates from the hierarchical model are not the same as the ones from the fixed effects model. The difference is indeed systematic, with hierarchical estimate being less dispersed about the mean values, indicated by the vertical and horizontal lines. This is shrinkage at work. That is, estimates that fall far from the mean will be shrunk or pooled towards their mean estimate, with the shrinkage increasing in the noisiness of the data. This process follows the exact same equation as the regression to mean observed previously, and once again illustrates the influence of the prior, except that the latter is now endogenously estimated from the aggregate data.

We are now ready to apply the methods learned above to the estimation of nonlinear models. In principle, this application constitutes a straightforward extension of what we have already seen. In practice, however, the estimation of multiple parameters at once—which are often highly constrained to boot—may pose problems for Stan to explore. Luckily, the programmes can be written using trasnformed parameters in such a way as to make them much more efficient.

Let us start from a simple EU model, in which only the utility curvature parameter is hierarchical.

```
data {
int<lower=1> N;
int<lower=1> Nid;
array[N] int id;
array[N] real ce;
array[N] real high;
array[N] real low;
array[N] real p;
}parameters {
vector<lower=0>[Nid] rho;
real<lower=0> rho_hat;
real<lower=0> tau;
real<lower=0> sigma;
}model {
vector[N] uh;
vector[N] ul;
vector[N] pv;
1,2); // hyperprior for the mean of utility curvature
rho_hat ~ normal(0,5); // hyperprior of variance of rho
tau ~ cauchy(0,5); // hyperprior for (aggregate) residual variance
sigma ~ cauchy(
for (n in 1:Nid)
// hierarchical model
rho[n] ~ normal(rho_hat , tau);
for (i in 1:N){
uh[i] = high[i]^rho[id[i]];
ul[i] = low[i]^rho[id[i]];1-p[i]) * ul[i];
pv[i] = p[i] * uh[i] + (1/rho[id[i]]) , sigma * high[i] );
ce[i] ~ normal( pv[i]^(
} }
```

```
<- read_csv("data/data_RR_ch06.csv")
d <- d %>% filter(z1 > 0) %>%
dg mutate(high = z1) %>%
mutate(low = z2) %>%
mutate(prob = p1) %>%
group_by(id) %>%
mutate(id = cur_group_id()) %>%
ungroup()
<- cmdstan_model("stancode/hm_EU.stan")
pt
<- list(N = nrow(dg),
stanD Nid = max(dg$id),
id = dg$id,
high = dg$high,
low = dg$low,
p = dg$prob,
ce = dg$ce)
<- pt$sample(
hm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=200,
show_messages = FALSE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)
$save_object(file="stanoutput/rr_ch06.Rds") hm
```

```
<- readRDS("stanoutput/rr_ch06.Rds")
hm
# print aggregate parameters
$print(c("rho_hat","tau","sigma"), digits = 3,
hm"mean", SE = "sd" , "rhat",
~quantile(., probs = c(0.025, 0.975)),max_rows=20)
```

```
variable mean SE rhat 2.5% 97.5%
rho_hat 1.014 0.039 1.000 0.943 1.094
tau 0.358 0.033 1.022 0.299 0.429
sigma 0.134 0.002 1.004 0.130 0.137
```

```
# recover individual-level estimates
<- hm$summary(variables = c("rho"), "mean" , "sd" )
zz <- zz %>% separate(variable,c("[","]")) %>%
z rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
# plot distribution
ggplot(z,aes(x=mean_rho)) +
geom_line(stat="density",color="steelblue3",lwd=1) +
geom_vline(xintercept = 1, color="red", lty="dashed") +
xlab("power utility parameter across subjects")
```

The model above converges readily (it takes about 30 seconds on my laptop). No divergences are produced, and the R-hat statistics look reasonable. It also seems like we are getting a reasonable distribution of parameters, as shown in Figure 4.4. While in practice you will want to look at the standard deviations of the parameters to make sure that they are sufficiently tightly estimated, we will skip this here. Instead, we will consider an alterantive way of programming the same model.

The issue is that the model above does not always scale well to multiple parameters. Chances are that even at 2 parameters it will already become problematic. Once you have 3 or more hierarchical parameters, a simple generalization of the model used above will rarely fare well. The model may also behave less well when the parameter to be estimated is highly constrained and many observations fall close to the constraint. The solution in such cases is to reparameterize the model to help Stan explore the posterior distribution. Here, I show how to use a parameter reschaled to a standard normal distribution for the one-parameter model above, before moving on to more complex models.

```
data {
int<lower=1> N;
int<lower=1> Nid;
array[N] int id;
array[N] real ce;
array[N] real high;
array[N] real low;
array[N] real p;
}parameters {
real rho_hat;
real<lower=0> tau;
real<lower=0> sigma;
array[Nid] real rho_z; //parameter on standard-normal scale
}transformed parameters{
array[Nid] real pars;
vector<lower=0>[Nid] rho;
for (n in 1:Nid){
// mean plus var times rescaled par.
pars[n] = rho_hat + tau * rho_z[n]; //derive parameter on original scale to use in model
rho[n] = exp( pars[n] );
}
}model {
vector[N] uh;
vector[N] ul;
vector[N] pv;
1,2);
rho_hat ~ normal(0,5);
tau ~ cauchy(0,5);
sigma ~ cauchy(
for (n in 1:Nid)
//prior for rescaled parameters
rho_z[n] ~ std_normal( );
for (i in 1:N){
uh[i] = high[i]^rho[id[i]];
ul[i] = low[i]^rho[id[i]];1-p[i]) * ul[i];
pv[i] = p[i] * uh[i] + (1/rho[id[i]]) , sigma * high[i] );
ce[i] ~ normal( pv[i]^(
} }
```

You can see the re-parameterized model above. The data block has remained unchanged. The parameter block now contains a new variable `rho_z`

instead of `rho`

. Two things about this are worth noting. First, I write this now as an array instead of a vector. This is not hugely important here, but will make the code more efficient down the road. The second—and indeed crucial—point is that, other than `rho`

above, `rho_z`

is no longer constrained to be positive (this is done for illustrative purposes, since power utility parameters do not have to be positive, but this substantive issue is beyond the scope of this tutorial).

The reason for this becomes apparent in the newly added `transformed parameters`

block. Here `rho`

is now derived at the end of the block as the exponential of a variable I have called `pars`

, since it contains the model parameters, albeit on a transformed scale. This variable can be seen to be made up by the mean `rho_hat`

, which is also no longer constrained to be positive (see also the prior in the model block). The reason for this is simply that this is now the mean of the `rho_z`

parameters, which are pre-multiplied by the parameter variance `tau`

. The `rho_z`

parameters are estimated on a standard normal scale, which makes them straightforward to explore for Stan. The parameters of main interest, contained in the vector `rho`

, are now transformed parameters, and are given no prior of their own. While this way of writing the model may seem backwards at first, it can yield substantial improvements in estimation speed, and for more complex models, may well make the difference between a model that converges and one that does not.

```
<- read_csv("data/data_RR_ch06.csv")
d
<- d %>% filter(z1 > 0) %>%
dg mutate(high = z1) %>%
mutate(low = z2) %>%
mutate(prob = p1) %>%
group_by(id) %>%
mutate(id = cur_group_id()) %>%
ungroup()
<- cmdstan_model("stancode/hm_EU_resc.stan")
pt
<- list(N = nrow(dg),
stanD Nid = max(dg$id),
id = dg$id,
high = dg$high,
low = dg$low,
p = dg$prob,
ce = dg$ce)
<- pt$sample(
hm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=500,
show_messages = FALSE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)
$save_object(file="stanoutput/rr_ch06_resc.Rds") hm
```

```
<- readRDS("stanoutput/rr_ch06.Rds")
hm
# print aggregate parameters
$print(c("rho_hat","tau","sigma"), digits = 3,
hm"mean", SE = "sd" , "rhat",
~quantile(., probs = c(0.025, 0.975)),max_rows=20)
```

```
variable mean SE rhat 2.5% 97.5%
rho_hat 1.014 0.039 1.000 0.943 1.094
tau 0.358 0.033 1.022 0.299 0.429
sigma 0.134 0.002 1.004 0.130 0.137
```

```
# recover individual-level estimates
<- hm$summary(variables = c("rho"), "mean" , "sd" )
zz <- zz %>% separate(variable,c("[","]")) %>%
z rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
# plot distribution
ggplot(z,aes(x=mean_rho)) +
geom_line(stat="density",color="steelblue3",lwd=1) +
geom_vline(xintercept = 1, color="red", lty="dashed") +
xlab("power utility parameter across subjects")
```

Let us go ahead and estimate the model now. Again, there are no divergences, and the R-hats on the aggregate level parameters look good (note that some of these parameters are different, but that is because I have taken the shortcut of using the same name for quantities that are in part expressed on different scales). The distribution of the individual-level means looks much like before. The model now converges in some 20 seconds on my laptop, instead of the 34 seconds of the previous model. While this improvement may not seem much, it amounts to close to a 40% speed gain. Once we are dealing with models that take hours or days to converge, that can make a real difference. The distribution of individual-level parameters in Figure 4.5 looks *almost* exactly the same as before. The slight shift to the left of the distribution derives from the fact that the aggregate mean is no longer constrained to be positive, thus providing a more accurate representation of its distribution.

With the decentralized model above in hand, it is now straightforward to build a model with multiple hierarchical parameters. Let us again use the same data, and let us build a PT model, restricted to gains only for simplicity (adding additional parameters for losses is straighforward). First, we will want to change the model to include a probability weighting function. Making the error term hierarchical, too, yields 4 hierarchical parameters.

Next, we get to work on the transformed parameters block. We change the designation of `pars`

from an array of real numbers to an array of vectors. This will allow us to loop through vectors of the 4 parameters in the subsequent code. Do not forget to actually create the vectors of parameters. In place of `rho_hat`

, we now use a 4-dimensional vector called `means`

, since the dimensionality of the different elements in the loop needs to line up for the code to work (don’t forget to give it priors, keeping in mind that they are expressed on a transformed scale). For the same reason, `tau`

now becomes a vector of parameter variances. The `diag_matrix`

function maps this vector into a matrix, i.e. it creates a matrix \(\tau \times \text{I}\), where \(\text{I}\) is an indentity matrix with the same number of rows and columns as hierarchical parameters in the model. What used to be `rho_z`

becomes an array of vectors, which we simply call `Z`

. Note that it needs to be indexed by the subject identifier `n`

in the loop, which extracts one vector at the time, thus ensuring again that all expressions have the same dimension. We then obain the original parameters of interest as usual, by defining them based on the transformed parameter line, and applying the appropriate constraints.

```
data {
int<lower=1> N;
int<lower=1> Nid;
array[N] int id;
array[N] real ce;
array[N] real high;
array[N] real low;
array[N] real p;
}
parameters {
vector[4] means;
vector<lower=0>[4] tau;
array[Nid] vector[4] Z;
}
transformed parameters{
array[Nid] vector[4] pars;
vector<lower=0>[Nid] rho;
vector<lower=0>[Nid] gamma;
vector<lower=0>[Nid] delta;
vector<lower=0>[Nid] sigma;
for (n in 1:Nid){
pars[n] = means + diag_matrix(tau) * Z[n];
rho[n] = exp( pars[n,1] );
gamma[n] = exp( pars[n,2] );
delta[n] = exp( pars[n,3] );
sigma[n] = exp( pars[n,4] );
}
}
model {
vector[N] uh;
vector[N] ul;
vector[N] pw;
vector[N] pv;
means[1] ~ normal(0,1);
means[2] ~ normal(0,1);
means[3] ~ normal(0,1);
means[4] ~ normal(0,1);
tau ~ exponential(5);
for (n in 1:Nid)
Z[n] ~ std_normal( );
for (i in 1:N){
uh[i] = high[i]^rho[id[i]];
ul[i] = low[i]^rho[id[i]];
pw[i] = (delta[id[i]] * p[i]^gamma[id[i]])/
(delta[id[i]] * p[i]^gamma[id[i]] + (1 - p[i])^gamma[id[i]]);
pv[i] = pw[i] * uh[i] + (1-pw[i]) * ul[i];
ce[i] ~ normal( pv[i]^(1/rho[id[i]]) , sigma[id[i]] * (high[i] - low[i]) );
}
}
```

```
<- read_csv("data/data_RR_ch06.csv")
d
<- d %>% filter(z1 > 0) %>%
dg mutate(high = z1) %>%
mutate(low = z2) %>%
mutate(prob = p1) %>%
group_by(id) %>%
mutate(id = cur_group_id()) %>%
ungroup()
<- cmdstan_model("stancode/hm_RDU_var.stan")
pt
<- list(N = nrow(dg),
stanD Nid = max(dg$id),
id = dg$id,
high = dg$high,
low = dg$low,
p = dg$prob,
ce = dg$ce)
<- pt$sample(
hm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=200,
adapt_delta = 0.99,
max_treedepth = 15,
show_messages = FALSE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)#257.0 seconds
$save_object(file="stanoutput/RR_ch6_RDU_var.Rds") hm
```

We are now ready to estimate the model. To do this, it will be wise to lower the stepsize to small values using the `adapt_delta`

option—a value of 0.99 is typically recommended. At the same time, let us allow for a larger value of `max_treedepth`

(the default is 10), which means that the algorithm may explore for longer. Both these tunings are made necessary by the highly nonlinear nature of the model, which creates parameter regions that are tricky to explore. Even so, such models may on occasion produce a small number of divergences. This is not unusual in highly nonlinear models such as this one. You would nevertheless be well advised to check the convergence statistics in depth, especially if this is a new model that you have never used before.

```
<- readRDS("stanoutput/RR_ch6_RDU_var.Rds")
hm
# print aggregate parameters
$print(c("means","tau"), digits = 3,
hm"mean", SE = "sd" , "rhat",
~quantile(., probs = c(0.025, 0.975)),max_rows=20)
```

```
variable mean SE rhat 2.5% 97.5%
means[1] -0.068 0.020 1.002 -0.109 -0.031
means[2] -0.773 0.041 1.002 -0.852 -0.694
means[3] -0.171 0.032 1.001 -0.235 -0.108
means[4] -2.364 0.050 1.003 -2.462 -2.264
tau[1] 0.083 0.035 1.012 0.011 0.149
tau[2] 0.418 0.034 1.001 0.358 0.489
tau[3] 0.290 0.025 1.001 0.243 0.341
tau[4] 0.523 0.040 1.002 0.451 0.608
```

```
# recover individual-level estimates
<- hm$summary(variables = c("rho","gamma","delta","sigma"), "mean" , "sd" )
zz <- zz %>% separate(variable,c("[","]")) %>%
z rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd))
# plot distribution
ggplot(z,aes(x=mean_gamma)) +
geom_line(stat="density",color="steelblue3",lwd=1) +
geom_vline(xintercept = 1, color="red", lty="dashed") +
xlab("likelihood-insensitivity across subjects")
```

Figure 4.6 shows the distribution of the means of individual-level likelihood-sensitivity parameters. The means are somewhat difficult to interpret, since they are on a transformed scale (and in any case, we may want to examine the means or medians of the individual-level parameters instead). The standard deviations of the individual-level estimates seem to be suitably small, indicating a fairly good identification of the parameters, although there is also quite some heterogenity from subject to subject.

The model above works perfectly fine in estimating the desired PT parameters. It does, however, treat the various model parameters as independent from each other. From a Bayesian point of view—if there is some correlation between the parameters—then we should be able to improve our estimations by explicitly modelling the co-variation structure between the parameters. This is due as usual to the posterior uncertainty about the true parameters. To illustrate, we can use the fixed-effects (individual-level) estimates on the data of Bruhin et al. discussed in chapter 2. I display them again here below for easily access. A remarjable feature of the data is that likelihood-sensitivity and the residual error are strongly negatively correlated (with subject nr. 9 constituting a noteworthy outlier). Such correlations indeed also occur between other model parameters—(**vieider_decisions_2021?**) provides a theoretical account of what may be behind this.

`::include_graphics('figures/scatter_RR.png') knitr`

We will now try to leverage this additional information by estimating a model including a covariance matrix between all the different model parameters. The change to the model needed to achieve this—displayed below—is fairly trivial. We simply replace the `diag_matrix`

containing the parameter variance vector, with the `diag_pre_multiply`

matrix. The latter contains the parameter variance vector, and a Cholesky-decomposed covariance matrix. Of course, we give that matrix an appropriate prior, following best practices as recommended by the Stan community. For good measure we also define a transformed parameter `Rho`

, which recovers the correlation parameters on the original scale (this could of course also be done post-estimation in R, or we could specify the same line of code in the generated quantities block, since it is not used in the estimations).

```
data {
int<lower=1> N;
int<lower=1> Nid;
array[N] int id;
array[N] real ce;
array[N] real high;
array[N] real low;
array[N] real p;
}
parameters {
vector[4] means;
vector<lower=0>[4] tau;
array[Nid] vector[4] Z;
cholesky_factor_corr[4] L_omega;
}
transformed parameters{
matrix[4,4] Rho = L_omega*L_omega`;
array[Nid] vector[4] pars;
vector<lower=0>[Nid] rho;
vector<lower=0>[Nid] gamma;
vector<lower=0>[Nid] delta;
vector<lower=0>[Nid] sigma;
for (n in 1:Nid){
pars[n] = means + diag_pre_multiply(tau, L_omega) * Z[n];
rho[n] = exp( pars[n,1] );
gamma[n] = exp( pars[n,2] );
delta[n] = exp( pars[n,3] );
sigma[n] = exp( pars[n,4] );
}
}
model {
vector[N] uh;
vector[N] ul;
vector[N] pw;
vector[N] pv;
means[1] ~ normal(0,1);
means[2] ~ normal(0,1);
means[3] ~ normal(0,1);
means[4] ~ normal(0,1);
tau ~ exponential(5);
L_omega ~ lkj_corr_cholesky(4);
for (n in 1:Nid)
Z[n] ~ std_normal( );
for (i in 1:N){
uh[i] = high[i]^rho[id[i]];
ul[i] = low[i]^rho[id[i]];
pw[i] = (delta[id[i]] * p[i]^gamma[id[i]])/
(delta[id[i]] * p[i]^gamma[id[i]] + (1 - p[i])^gamma[id[i]]);
pv[i] = pw[i] * uh[i] + (1-pw[i]) * ul[i];
ce[i] ~ normal( pv[i]^(1/rho[id[i]]) , sigma[id[i]] * (high[i] - low[i]) );
}
}
```

```
<- read_csv("data/data_RR_ch06.csv")
d
<- d %>% filter(z1 > 0) %>%
dg mutate(high = z1) %>%
mutate(low = z2) %>%
mutate(prob = p1) %>%
group_by(id) %>%
mutate(id = cur_group_id()) %>%
ungroup()
<- cmdstan_model("stancode/hm_RDU.stan")
pt
<- list(N = nrow(dg),
stanD Nid = max(dg$id),
id = dg$id,
high = dg$high,
low = dg$low,
p = dg$prob,
ce = dg$ce)
<- pt$sample(
hm data = stanD,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh=200,
adapt_delta = 0.99,
max_treedepth = 15,
show_messages = FALSE,
diagnostics = c("divergences", "treedepth", "ebfmi")
)#841.9 seconds
$save_object(file="stanoutput/RR_ch6_RDU.Rds") hm
```

Figure 4.7 plots the likelihood-sensitivity parameter from the model with variances only against the one including the covariance matrix. Most of the parameters are virtually the same. This is reassuring, inasmuch as we would not expect the covariance matrix to fundamentally affected our inferences. However, especially the estimates falling into the fourth quartile of the residual variance—i.e. the observations with the largest errors—can be seen to be adjusted downwards. This happens because large error variance is associated with low likelihood-sensitivity, which is exactly the effect that most strongly emerges from the correlation matrix. Using this information, the model thus corrects observations indicating strong likelihood-sensitivity with litte confidence downwards, to maximize predictive performance.

```
<- readRDS(file="stanoutput/RR_ch6_RDU.Rds")
hc <- readRDS(file="stanoutput/RR_ch6_RDU_var.Rds")
hv
<- hc$summary(variables = c("rho","gamma","delta","sigma"), "mean" , "sd" )
zz <- zz %>% separate(variable,c("[","]")) %>%
zc rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd)) %>%
rename(gamma_c = mean_gamma,rho_c = mean_rho,delta_c = mean_delta,sigma_c = mean_sigma)
<- hv$summary(variables = c("rho","gamma","delta","sigma"), "mean" , "sd" )
zz <- zz %>% separate(variable,c("[","]")) %>%
zv rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd)) %>%
rename(gamma_v = mean_gamma,rho_v = mean_rho,delta_v = mean_delta,sigma_v = mean_sigma)
<- left_join(zc,zv,by="id")
z <- z %>% mutate(qs = ntile(sigma_v,4))
z
ggplot(z,aes(x=gamma_v,y=gamma_c,color=as.factor(qs))) +
geom_point() +
geom_abline(intercept=0,slope=1,color="red") +
#geom_vline(xintercept=mean(z$gamma_v)) +
geom_label_repel(aes(label = id),
box.padding = 0.35,
point.padding = 0.5,
size = 1.5,
segment.color = 'grey50') +
scale_colour_manual("Error quartile", values = c("black","darkgoldenrod","#999999","steelblue3")) +
theme(legend.position = c(0.25, 0.8))
```

A particularly strong adjustment can be seen for subject 9. Figure 4.8 shows three probability weighting functions fit to the datapoints for subject 9 in the Zurich 2006 data of Bruhin et al.—a function based on individual-level estimation (fixed effects); a function based on the hierarchical model with variance only; and a function based on the hierarchical model including the covariance matrix. The fixed effect model seems to provide the best fit to the nonparametric data. However, this fit seems only slightly better than the one of the hierarchical estimate with avriance only, which changes our inference from the subject being oversensitive to probabilities to being slightly insensitive. Indeed, a closer look reveals that the hierarchical estimate—even though it is (slightly) inverse S-shaped instead of S-shaped—may not be statistically different from the fixed effect estimate, which carries very low confidence, as shown by the wide variation in the posterior draws, indicated by the light grey lines (made up of 200 randomly drawn rowns of the posterior).

```
# load fixed effects estimate
<- readRDS(file="stanoutput/ind_ZCH06_fixed.RDS")
hf
<- hf$summary(variables = c("rho","gamma","delta","sigma"), "mean" , "sd")
zz <- zz %>% separate(variable,c("[","]")) %>%
zf rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd)) %>%
rename(rho = mean_rho , gamma = mean_gamma , delta = mean_delta , sigma = mean_sigma)
# load hierarchical estimate with covar
<- readRDS(file="stanoutput/RR_CH6_RDU.Rds")
hc
<- hc$summary(variables = c("rho","gamma","delta","sigma"), "mean" , "sd")
zz <- zz %>% separate(variable,c("[","]")) %>%
zc rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd)) %>%
rename(rho = mean_rho , gamma = mean_gamma , delta = mean_delta , sigma = mean_sigma)
# load hierarchical estimate with var only
<- readRDS(file="stanoutput/RR_CH6_RDU_var.Rds")
hv <- hv$summary(variables = c("rho","gamma","delta","sigma"), "mean" , "sd")
zzv <- zzv %>% separate(variable,c("[","]")) %>%
zv rename( var = `[`) %>% rename(id = `]`) %>%
pivot_wider(names_from = var , values_from = c(mean,sd)) %>%
rename(rho = mean_rho , gamma = mean_gamma , delta = mean_delta , sigma = mean_sigma)
<- read_csv("data/data_rr_ch06.csv")
d <- d %>% filter(z1 > 0) %>%
dg9 mutate(high = z1) %>%
mutate(low = z2) %>%
mutate(prob = p1) %>%
group_by(id) %>%
mutate(id = cur_group_id()) %>%
ungroup() %>%
filter(id==9) %>%
mutate( dw = (ce - low)/(high - low) ) %>%
group_by(prob,high,low) %>%
mutate(mean_dw = mean(dw)) %>%
mutate(median_dw = median(dw)) %>%
ungroup()
###
<- as_draws_df(hf)
d9 <- d9$`gamma[9]`
g9 <- d9$`delta[9]`
de9 <- as.data.frame(cbind(g9,de9))
r9
<- r9 %>% sample_n(size=200) %>%
r9 mutate(row = row_number())
<- function(x) (zf$delta[id=9] * x^zf$gamma[id=9] / (zf$delta[id=9] * x^zf$gamma[id=9] + (1-x)^zf$gamma[id=9]))
fe9 <- function(x) (zc$delta[id=9] * x^zc$gamma[id=9] / (zc$delta[id=9] * x^zc$gamma[id=9] + (1-x)^zc$gamma[id=9]))
hm9 <- function(x) (zv$delta[id=9] * x^zv$gamma[id=9] / (zv$delta[id=9] * x^zv$gamma[id=9] + (1-x)^zv$gamma[id=9]))
hm9v ggplot() +
::pmap(r9, function(de9,g9, row) {
purrrstat_function( fun = function(x) ( de9*x^g9 )/( de9*x^g9 + (1-x)^g9) ,
color="grey" , linewidth = 0.25)
+
}) stat_function(fun = fe9,aes(color="fixed effect"),linewidth=1) + xlim(0,1) +
stat_function(fun = hm9,aes(color="hierarchical covar."),linewidth=1) +
stat_function(fun = hm9v,aes(color="hierarchical"),linewidth=1) +
geom_point(aes(x=dg9$prob,y=dg9$mean_dw),size=3, color="chartreuse4",shape=10) +
geom_abline(intercept=0, slope=1,linetype="dashed",color="gray") +
scale_colour_manual("Legend", values = c("steelblue3","darkgoldenrod", "chocolate4")) +
xlab("probability") + ylab("decision weights") + theme(legend.position = c(0.8, 0.2))
```

Most importantly, however, the hierarchical function reflects the fact that this particular observation seems unlikely, *given the distribution of parameters in the population*. Since the estimate is furthermore very noisy, it is shrunk towards the population mean. This follows standard Bayesian procedures, whereby noisy outliers are discounted and drawn towards the prior, which in this case is endogenously estimated from the aggregate data (see BOX for a short discussion). Such shrinkage serves to optimize *predictive* performance, and is based on the probability distributions gathered from the environment.

An example of hierarchical aggregation

Take a parameter \(\gamma\) that is distributed \(\gamma_i \sim \mathcal{N}(\widehat{\gamma}_i,\sigma_i^2)\), where \(\widehat{\gamma}_i\) indicates the true parameter mean to be estimated for subject \(i\), whereas \(\gamma_i\) is the maximum likelihood estimator (individual fixed effect). The two may not coincide, for instance because of large sampling variation in small samples. Assume further that the prior distribution of this estimator in the population is also normal, so that \(\widehat{\gamma} \sim \mathcal{N}(\mu,\tau^2)\). The Bayesian posterior will then itself follow a normal distribution, with mean \(\widehat{\gamma}_i = \frac{\sigma^2}{\sigma^2 + \tau^2}\gamma_i + \frac{\tau^2}{\sigma^2 + \tau^2}\mu\) and variance \(\frac{\sigma_i^2\tau^2}{\sigma_i^2 + \tau^2} = \frac{1}{\sigma_i^2} + \frac{1}{\tau^2}\). That is, the posterior mean will be a convex combination of the fixed effect estimator and the prior mean. The “shrinkage weight” \(\frac{\sigma^2}{\sigma^2 + \tau^2} = 1 - \frac{\tau^2}{\sigma^2 + \tau^2}\) will be a function of the variance of the parameter itself, and of the prior variance. The larger the variance of the individual parameter, the lower the weight attributed to it, and the larger the weight attributed to the prior mean. The greater the population variance \(\sigma^2\), the less weight is attributed to the prior mean, capturing the intuition that outliers from more homogeneous populations are discounted more. Notice that these equations are directly applicable to our calculations for subject 9. We have \(\gamma_9 = 2.08\), \(\sigma_9^2 = 0.36\), \(\mu = 0.45\), and \(\tau^2 = 0.16\). We thus obtain \(\widehat{\gamma}_9 =\) 0.9515385. The value estimated from the hierarchical model is very similar at 0.934. Note, however, that including a covariance matrix further lowers this value to 0.73. This is due to the more complex structure implemented in the estimated model, which takes into account the strongly negative correlation between likelihood-sensitivity and noise.

This is particularly apparent from the function estimated from the hierarchical model including a covariance matrix. The latter indeed seems to fall even farther from the fixed effects estimator, and is now arguably statistically distinct from it at conventional levels taken to indicate significance. Nevertheless, it appears the most likely function in the context of the entirety of the data, incorporating not only the likelihood of any given parameter value based on the parameter distribution in the population, but also the correlation structure between the different parameters in the model. Note that these large differences emerge because the outlier is *noisy*—perfectly identified outliers will not be shrunk.

Covariance matrices can thus be a great way of improving the information content in hierarchical models. Nevertheless, some caution is called for. In particular, there can be circumstances where a covariance matrix does more harm than good. This will for instance occur in data with large, low-noise outliers. Such outliers may then distort the parametric correlations in the matrix, in extreme cases even in the opposite direction of the nonparametric correlations observed between estimated parameters. It is thus always a good idea to check the correlations in both ways (i.e., to compare the parametric estimation in the matrix to rank-correlations of the posterior means). While some quantitative differences are expected, wildly differing correlation coefficients that may even go in opposite directions are a sign of trouble afoot.

This is the end as of 4 March 2024. Stand by for additional sections to be added soon: - complex hierarchies - cross-classification - meta-analysis - measurement error models - regression in hierarchical models - hierarchical versus sandwich error clustering