Skip to content

Multiclass Fairness Metrics

Venturalitica includes 7 multiclass fairness metrics for evaluating AI systems with more than 2 output classes (e.g., credit risk grades A/B/C/D, multi-label classification, sentiment categories). These extend traditional binary fairness concepts to multi-class settings.

Use these metrics when your model produces 3+ classes. Binary metrics like disparate_impact or demographic_parity_diff only work with 2-class outputs. Multiclass metrics aggregate fairness across all class labels.

Common scenarios:

  • Credit risk grading (A, B, C, D, E)
  • Job recommendation categories
  • Medical diagnosis classification
  • Content moderation labels

What it measures: Maximum disparity in prediction rates across protected groups, aggregated over all classes using one-vs-rest decomposition.

Formula: For each class c, compute P(Y_hat=c | A=a) for each group a. The disparity for class c is max(rates) - min(rates). Return the maximum disparity across all classes.

Ideal value: 0.0 (all groups receive each class at equal rates).

Registry key: multiclass_demographic_parity

Required inputs: target, prediction, dimension

- control-id: mc-demographic-parity
description: "Multi-class demographic parity < 0.15"
props:
- name: metric_key
value: multiclass_demographic_parity
- name: threshold
value: "0.15"
- name: operator
value: lt
- name: "input:target"
value: target
- name: "input:prediction"
value: prediction
- name: "input:dimension"
value: gender

What it measures: Maximum disparity in true positive rates (TPR) across protected groups, using one-vs-rest decomposition. Ensures each group has equal chance of being correctly classified for each class.

Formula: For each class c, compute TPR per group: P(Y_hat=c | Y=c, A=a). Disparity = max(TPRs) - min(TPRs). Return maximum disparity across classes.

Ideal value: 0.0 (equal recall for all groups in every class).

Registry key: multiclass_equal_opportunity

Required inputs: target, prediction, dimension


What it measures: Per-class precision/recall and per-group accuracy. Returns a dictionary (not a scalar), useful for detailed diagnostics rather than policy thresholds.

Return type: Dict with keys per_class_metrics (precision/recall per class) and per_group_performance (accuracy per group).

Registry key: multiclass_confusion_metrics

Required inputs: target, prediction, dimension


What it measures: Demographic parity with configurable aggregation strategy across classes.

Strategies (set via strategy parameter):

StrategyDescription
macro (default)Maximum disparity across all classes
microMaximum disparity using normalized prediction distributions
one-vs-restSame as macro but explicit one-vs-rest decomposition
weightedDisparities weighted by class prevalence

Formula (macro): Same as multiclass_demographic_parity, but with strategy control.

Ideal value: 0.0

Registry key: weighted_demographic_parity_multiclass

Required inputs: target (unused but validated), prediction, dimension

Minimum samples: 30


What it measures: Macro-averaged equal opportunity. Computes TPR disparity for each class (one-vs-rest), then returns the maximum.

Formula: For each class c, binarize as y_true_c = (y == c). Compute TPR per group. Disparity = max(TPRs) - min(TPRs). Return max(disparities).

Ideal value: 0.0

Registry key: macro_equal_opportunity_multiclass

Required inputs: target, prediction, dimension

Minimum samples: 30


What it measures: Combined TPR + FPR disparity across groups. Measures whether the model’s overall accuracy and error rate are equitable across protected groups.

Formula: For each group, compute overall accuracy and error rate. Return (max_accuracy - min_accuracy) + (max_error_rate - min_error_rate).

Ideal value: 0.0 (no accuracy/error disparity between groups).

Registry key: micro_equalized_odds_multiclass

Required inputs: target, prediction, dimension

Minimum samples: 30


What it measures: Precision disparity across protected groups for each class. Ensures that when the model predicts a class, it is equally accurate for all groups.

Strategies: macro (default), weighted

Formula (macro): For each class c, compute precision per group: P(Y=c | Y_hat=c, A=a). Disparity = max(precisions) - min(precisions). Return max across classes.

Ideal value: 0.0

Registry key: predictive_parity_multiclass

Required inputs: target, prediction, dimension


Registry KeyWhat It ChecksIdealStrategies
multiclass_demographic_parityPrediction rate parity (OVR)0.0max, macro aggregation
multiclass_equal_opportunityTPR parity (OVR)0.0
multiclass_confusion_metricsPer-class/group diagnosticsDict
weighted_demographic_parity_multiclassPrediction rate parity0.0macro, micro, one-vs-rest, weighted
macro_equal_opportunity_multiclassTPR parity (macro)0.0
micro_equalized_odds_multiclassAccuracy + error parity0.0
predictive_parity_multiclassPrecision parity0.0macro, weighted

For a combined view, use the calc_multiclass_fairness_report() function in Python:

from venturalitica.metrics import calc_multiclass_fairness_report
report = calc_multiclass_fairness_report(
y_true=df["target"],
y_pred=df["prediction"],
protected_attr=df["gender"]
)
# Returns dict with:
# - weighted_demographic_parity_macro
# - macro_equal_opportunity
# - micro_equalized_odds
# - predictive_parity_macro

For intersectional fairness (e.g., gender x age), pass multiple attributes:

from venturalitica.assurance.fairness.multiclass_reporting import calc_intersectional_metrics
results = calc_intersectional_metrics(
y_true=df["target"],
y_pred=df["prediction"],
protected_attrs={
"gender": df["gender"],
"age_group": df["age_group"]
}
)
# Returns:
# - intersectional_disparity: max - min accuracy across slices
# - worst_slice: e.g., "female x elderly"
# - best_slice: e.g., "male x young"
# - slice_details: accuracy per intersection

  • Minimum samples: Most multiclass metrics require >= 30 samples and raise ValueError otherwise.
  • Minimum groups: At least 2 protected groups required.
  • Minimum classes: At least 2 classes required (though for 2-class problems, prefer the simpler binary metrics).
  • Optional dependency: Some metrics use Fairlearn internally. Install with pip install fairlearn if needed.