1  Introduction

Author

Ferdinand M. Vieider

Why Bayesian statistics? The relative merits of Bayesian and frequentist statistics have been debated for a long time, without any clear outcome. In these notes, I will thus focus on practical advantages of conducting Bayesian data analysis. When it comes to the practical advantages of Bayesian analysis, four elements stand out: 1) ease of interpretation; 2) the possibility to accept the null hypothesis; 3) estimation and manipulation of complex models; and 4) systematic accumulation of knowledge. I will provide a brief discussion of each in in this introduction. Together, I believe that these practical advantages constitute a powerful argument for the use of Bayesian statistics as the researcher’s standard tool for data analysis.

Given these advantages, one may ask why Bayesian statistics are not used more commonly. The answer appears to be: path dependency. Bayesian models—especially the hierarchical models with multiple levels where Bayesian statistics has its main strengths—require intense computations that in the past would have been very costly. This meant that Bayesian analysis has not been commonly taught as a “standard tool” of analysis. Advances in computing power, however, have made general Bayesian approaches a viable alternative to classical statistics. The software package Stan provides a fast and flexible platform that allows us to implement such models.

1.1 Probabilistic interpretation of parameters

In classical statistics, model parameters are treated as certain quantities that are unknown to the statistician and need to be discovered. The data, on the other hand, are conceived of as a random sample from a much larger—and presumably infinitely large—population. This allows to build up a fiction that, as long as we can sample enough data, estimators of the fixed but unknown parameters will be unbiased. A p-value of 0.05 is then interpreted following the fiction that, in repeated data samples from the same population, only 1 out of 20 samples is expected to show to such an extreme value by chance. That is, if one were to sample from the population 20 times, on average the data should show such an extreme value 1 time if the null hypothesis holds true.

Bayesian analysis turns this interpretation on its head. The data at hand are considered given (i.e., they are a datum, a given), and may well be unique. The parameters, on the other hand, are both unknown to the statistician and inherently uncertain. This allows us to directly reason in terms of uncertainty surrounding the parameter estimates. In Bayesian statistics, a statement like there is a 95% chance that the parameter falls below 1 is actually correct, while such simple statements cannot be made in standard statistics. Intuitively, we may thus all be Bayesians.

1.2 Accepting the null

Hypothesis testing in classical statistics invariably takes the form of establishing and testing a point hypothesis (e.g., the effect is different from 0). This may be fine in most circumstances, even though it creates issues with large samples, since any point hypothesis will be rejected by definition as long as the sample is large enough. In some circumstances, however, it may be necessary to be able to accept the null hypothesis. For instance, when we test two models making different predictions about a parameter (the parameter is equal to vs. the parameter is smaller than a certain value) against each other, taking the point prediction as the null hypothesis will invariably bias the analysis against the model making that prediction.

Bayesian analysis can be coupled with the establishment of so-called regions of practical equivalence around a null hypothesis (I will simply refer to these regions as acceptance regions). This means that instead of establishing a point null hypothesis (e.g., the parameter is 1), we establish a region of equivalence to this value. For instance, we could say that we reject the null hypothesis if at least 95% of the posterior probability mass falls outside the region \([0.97,1.03]\). We accept the null hypothesis if at least 95% of the probability density falls within that region (we could of course also pick another, equally arbitrary probability value, such as 97.5%). As we will see, such a setup can have many practical advantages, including when deciding between nested model, or discriminating between several sub-cases of one more general model.

This aspect of Bayesian statistics also has substantive implications for best practices in experimental studies, and for potential pre-registrations of experimental designs. Notably, if one establishes an acceptance region for one’s hypotheses, sticking to a fixed number of observations in data collection is no longer meaningful. That is because, as long as the data do not give a clear verdict one way or another, the optimal course of action will always be to continue data collection until such a clear verdict is reached. This is clearly legitimate since the method is no longer inherently biased against the null hypothesis—the null can now be accepted as well as rejected. All that being said, transparent procedures would then require the pre-registration of the acceptance regions, since there may otherwise be a risk of ex post adjustment of such regions. Nevertheless, this risk is mitigated in most practical cases by scale constraints to these regions emerging from the scale of the data—an issue that is however beyond the scope of these technical notes.

1.3 Estimating complex models

Perhaps the greatest advantage of Bayesian analysis is the ease with which complex models with thousands of parameters can be estimated. In principle, we can even estimate models with more (nominal) parameters than data points. The parameters can furthermore be manipulated and combined post estimation, since the simulated posterior estimate allows to carry along all the information about the confidence we have in the inferences even while multiplying, dividing, subtracting, and otherwise manipulating parameters. The greatest advantage of the ease in handling complexity, however, is the possibility to estimate hierarchical or multilevel models.

Many types of data present natural structure, which is often ignored in standard analysis. Students participating in experiments are often recruited into sessions, and treatments are allocated at the session level. In field settings, we usually follow stratified sampling procedures, e.g. we select regions, from which we select provinces, from which we select households, and this is often done to maximize representativeness along dimensions relevant for the study. In meta-analysis, study-level effects are collected together with their standard errors. Such studies may stem from a handful of research groups, implement different measurement methods, etc. Neglecting such structure can yield severe biases in the estimation results. It also throws away valuable information, e.g. on the correlation structure of observations within any given group. Hierarchical models try to explicitly capture these structures in the statistical model, painting a much richer picture of the results.

One advantage of hierarchical modelling is that even fairly complex models with multiple parameters can be estimated starting from sparse data. The reason for this is that aggregate parameter estimates can serve as endogenous priors for lower-level estimates. Also, by pooling information across similar units of analysis Bayesian models can extract the most information from thin data by exploiting similarities, thus yielding an optimal compromise between aggregate and individual-level estimation that is endogenously determined by the relative confidence we have in these quantities. We will thus spend some time on building up the foundations of such models, and illustrating their power based on a number of examples.

1.4 Accumulation of knowledge

Finally, Bayesian statistics provide a principled way of accumulating knowledge across studies. Indeed, it seems natural to aggregate past results into probability distributions that summarize them; and to then use these distributions as priors for future estimations. This also means that Bayesian analysis is extremely suitable for contexts in which we may want to implement sequential learning, e.g. taking one data point at the time and updating our parameter estimates with it.

1.5 Further resources

This is not a general purpose tutorial of Bayesian statistics. An excellent introduction to Bayesian statistics light on technicalities is provided by McElreath (2016). A complete formal treatment can be found in Gelman et al. (2014) .