This lesson is being piloted (Beta version)

Systematics and Statistics

Overview

Teaching: 90 min
Exercises: 0 min
Questions
  • How do we perform statistical inference?

  • What tools do we use?

  • How do we visualize and interpret our results?

Objectives
  • Learn how to construct statistical models.

  • Learn how to interpret these statistical models.

Introduction

It’s beyond the scope of this tutorial to cover a lot of statistical background needed here. You may find some references at the end of this episode and further references found within. We will cover hopefully the bare-minimum needed to give you an idea of what’s going on here and to give an introduction to some useful tools.

A statistical model \(f(\vec{x} \vert \vec{\phi})\) describes the probability of the data \(\vec{x}\) given model paramters \(\vec{\phi}\).

HistFactory is a tool to construct probabilty distribution functions from template histograms, constructing a likelihood function. In this exercise we will be using HistFactory via pyhf, a python implementation of this tool. In addition, we will be using the cabinetry package, which is a python library for constructing and implementing HistFactory models.

The HistFactory template model can expressed like this:

What’s here?

This is a lot to take in but we’ll press on regardless with implementation.

Essentially:

Making models and fitting

It should be noted that everything in this part is contained in the use_cabinetry.py script. This file can be found in your copy of the lesson repository

Let’s start by importing the necessary modules, most importantly cabinetry:

import logging
import cabinetry

Systematics

In an analysis such as this one there are many systematics to consider, both experimental and theoretical (as encapsulated in the MC). For the former these include trigger and selection efficiencies, jet energy scale and resolution, b-tagging and misidentification efficiencies, and integrated luminosity. The latter can include uncertainties due to choice of hadronizer, choice of generator, QCD scale choices, and the parton shower scale. This isn’t a complete list of systematics and here will only cover a few of these (more on this later).

Histograms

Let’s quickly inspect our histograms in the histograms.root file by running ROOT and opening up a TBrowser.

Try this if you have ROOT installed and have gone through the ROOT pre-exercise. If not, skip this part and try it later. We’ll show the output below anyway.

root histograms.root
   ------------------------------------------------------------------
  | Welcome to ROOT 6.26/00                        https://root.cern |
  | (c) 1995-2021, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Jun 14 2022, 14:46:00                 |
  | From tag , 3 March 2022                                          |
  | With c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0                   |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q'       |
   ------------------------------------------------------------------

root [0] 
Attaching file histograms.root as _file0...
(TFile *) 0x563baef326f0
root [1] TBrowser b
(TBrowser &) Name: Browser Title: ROOT Object Browser

Click on one of the histogram titles on the left ofr 4j1b to view it:

Click on another for 4j2b:

Recall that our observable for the 4j1b control region is the scalar sum of jet transverse momentum, \(H_{T}\) and our observable for the 4j2b signal region is the mass of b-jet system \(m_{b_{jj}}\).

Cabinetry workspace

A statistical model can be define in a declarative way using cabinetry, capturing the \(\mathrm{region \otimes sample \otimes systematic}\) structure.

General settings General:, list of phase space regions such as signal and control regions Regions:, list of samples (MC and data) Samples:, list of systematic uncertainties Systematics:, and a list of normalization factors NormFactors:.

Let’s have a look at each of the parts of the the configuration file:

General

General:
  Measurement: "CMS_ttbar"
  POI: "ttbar_norm"
  HistogramFolder: "histograms/"
  InputPath: "histograms.root:{RegionPath}_{SamplePath}{VariationPath}"
  VariationPath: ""

Regions

Regions:
  - Name: "4j1b CR"
    RegionPath: "4j1b"

  - Name: "4j2b SR"
    RegionPath: "4j2b"

Samples

Samples:
  - Name: "Pseudodata"
    SamplePath: "pseudodata"
    Data: True

  - Name: "ttbar"
    SamplePath: "ttbar"

  - Name: "W+jets"
    SamplePath: "wjets"

  - Name: "single top, t-channel"
    SamplePath: "single_top_t_chan"

  - Name: "single atop, t-channel"
    SamplePath: "single_atop_t_chan"
    
  - Name: "tW"
    SamplePath: "single_top_tW"

Systematics

Systematics:
  - Name: "ME variation"
    Type: "NormPlusShape"
    Up:
      VariationPath: "_ME_var"
    Down:
      Symmetrize: True
    Samples: "ttbar"

  - Name: "PS variation"
    Type: "NormPlusShape"
    Up:
      VariationPath: "_PS_var"
    Down:
      Symmetrize: True
    Samples: "ttbar"

Here we specify which systematics we want to take into account. In addition to the W+jets scale variations, b-tagging variations, and jet energy scale and resolution (shown in the full file) we show here for the ttbar samples _ME_var (what do the result look like if we choose another generator?) and _PS_var (what do the results look like if we use a different hadronizer?).

NormFactors

NormFactors:
  - Name: "ttbar_norm"
    Samples: "ttbar"
    Nominal: 1.0
    Bounds: [0, 10]

Running cabinetry and results

Let’s load the cabinetry configuration file and combine the histograms into a pyhf workspace which we will save:

config = cabinetry.configuration.load("cabinetry_config.yml")
cabinetry.templates.collect(config)
ws = cabinetry.workspace.build(config)
cabinetry.workspace.save(ws, "workspace.json")

pyhf can be run on the command line to inspect the workspace:

pyhf inspect workspace | head -n 20

which outputs the following:

                 Summary       
            ------------------  
               channels  2
                samples  5
             parameters  14
              modifiers  14

               channels  nbins
             ----------  -----
                4j1b CR   11  
                4j2b SR   11  

                samples
             ----------
                 W+jets
 single atop, t-channel
  single top, t-channel
                     tW
                  ttbar

Now we perform our maximum likelihood fit

model, data = cabinetry.model_utils.model_and_data(ws)
fit_results = cabinetry.fit.fit(model, data)

and visualize the pulls of parameters in the fit:


pull_fig = cabinetry.visualize.pulls(
    fit_results, exclude="ttbar_norm", close_figure=True, save_figure=True
)

Note that the figures produced by running the script or your commands are to be found in the figures/ directory.

poi_index = model.config.poi_index
print(f"\nfit result for ttbar_norm: {fit_results.bestfit[poi_index]:.3f} +/- {fit_results.uncertainty[poi_index]:.3f}")

What does the model look like before and after the fit? We can visualize each with the following code:

model_prediction = cabinetry.model_utils.prediction(model)
figs = cabinetry.visualize.data_mc(model_prediction, data, close_figure=True)

model_prediction_postfit = cabinetry.model_utils.prediction(model, fit_results=fit_results)
figs = cabinetry.visualize.data_mc(model_prediction_postfit, data, close_figure=True)

We can see that there is very good post-fit agreement:

References and further reading

Acknowledgements

Thanks to the authors of the AGC using CMS data on which much of the episode was based.

Key Points

  • We construct statistical models to estimate systematics in our analysis.

  • cabinetry and pyhf are the tools we use here.