__ Summary and Contributions__: This paper develops a Bayesian calibration model to estimate fairness outcomes in arbitrary models, with unlabelled data.

__ Strengths__: - Clean and simple approach
- Nice and clear empirical results
- Addresses the case of unlabelled data

__ Weaknesses__: - The novelty is somewhat minor, given existing work on Bayesian modelling of fairness
- The paper focuses only on estimation and not on how to obtain fair policies
- The process seems limited to fairness settings where Bayesian calibration is applicable

__ Correctness__: Nothing significant

__ Clarity__: Yes

__ Relation to Prior Work__: Foulds et al, 2019. "Bayesian Modeling of Intersectional Fairness: The Variance of Bias" should be discussed

__ Reproducibility__: Yes

__ Additional Feedback__: This paper proposes to use Bayesian estimates of fairness metrics. It combines this with Bayesian calibration models (one for each protected attribute value in this particular case) in order to use unlabelled data. In light of existing work (Foulds et al 2019) on Bayesian modelling of fairness, the contribution is rather minor and is limited to the case where we have unlabelled data. The approach the authors use, as it is based on calibration, seems limited to rather specific notions of fairness where Bayesian calibration can be usefully applied.
Although in l.64 the definition of calibration is correct, in l. 105-107 you write that $s_j = P_M(y_j = 1 | s_j)$. Since $j$ is a specific example, there should not be any randomness here. Calibration is a property of the classifier with respect to a data distribution, not a specific example.
It would also be nice to discuss how you would extend this idea to other fairness metrics such as (the unfortunately named) calibration and balance.
The paper would be significantly strengthened by generalising the approach to other settings where unlabelled data is available, and by using the Bayesian estimates to obtain fair(er) policies.
I appreciate the author's response. Since the method is applicable to essentially any fairness metric classification, and since they have clarified the relation to prior work, I am revising my score. I wouldn't mind the paper being accepted, as the significance of a paper is a somewhat subjective criterion, but there is definitely nothing wrong with it and it's a valueable contribution in itself.

__ Summary and Contributions__: This paper proposes a Bayesian approach to reducing uncertainty in group fairness metrics in contexts where the amount of unlabeled examples greatly exceeds the number of labeled examples. The method uses the labeled data to infer the posterior densities of the parameters of per-group calibration curves, then marginalizes over these parameters to estimate the probabilities of different labeling outcomes for the unlabeled examples, before finally combining these into a posterior estimate of the desired fairness metric.
The approach is well validated and shows significant performance gains over the frequentist and beta-binomial Bayesian estimates in the case of few labeled examples.

__ Strengths__: The problem of uncertainty in fairness metrics is under-studied, despite being of extreme importance in many low-data situations where fairness is a concern (for example, predicting rare diseases or behaviors). This paper presents a nice simple approach to quantifying and reducing the uncertainty in fairness metrics, and is applicable to many use-cases and metrics.

__ Weaknesses__: This paper acknowledges some limitations that I don't find consequential enough to prevent publication, for example:
- the possibility that the CIs for their method might be overconfident if there are many labeled examples.
- the challenge of balancing the bias-variance tradeoff for this method.

__ Correctness__: Yes, the empirical methodology appears sound and well-justified.

__ Clarity__: Yes, the paper is exceptionally clear and well written.

__ Relation to Prior Work__: Yes, prior work in Bayesian fairness and data augmentation approaches are covered appropriately.

__ Reproducibility__: Yes

__ Additional Feedback__: I'm not sure it's appropriate to use "Eskimo" in Fig 2. My understanding is that that name is not the people's preferred name.
UPDATE: The authors' feedback has not changed my evaluation of this good paper.

__ Summary and Contributions__: The paper addresses a timely topic.

__ Strengths__: The empirical results seem promising.

__ Weaknesses__: The hyper parameter selection (lines 152-153) needs more explanations. The values seem informative and it would be useful to see a proper sensitivity analysis. Also, it would be good to perform a comparison to related work (non-hierarchical variant) to assess contributions of the proposed variant.
Eq 1 is central to the paper and the 'For example,' undermines the contributions. What other alternatives could be used or why to the particular form?
More theoretical results would be useful to establish the relevance of the proposed approach.
Posterior computations contain no novelties or contributions.
---
Thanks for the rebuttal. Based on the feedback I am revising my score.

__ Correctness__: The method requires more explanations.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The related work should be better separated from the contributions of the paper.

__ Reproducibility__: Yes

__ Additional Feedback__: