您的位置:首页 > 运维架构

opengl层次建模_层次建模简介

2020-08-13 10:24 232 查看

opengl层次建模

介绍 (Introduction)

It is not uncommon to find samples in our datasets that are not completely independent. Samples in datasets often form clusters or groups within which some properties are shared. This often requires some special attention while modeling to build reliable models for two major reasons. First, the independence of samples is an important assumption for several statistical analysis procedures like maximum likelihood estimation. Second, we want to capture the variation of the influence of the predictors across groups in our model called contextual effect.

在我们的数据集中发现并非完全独立的样本并不少见。 数据集中的样本通常形成群集或组,在其中共享某些属性。 在建模以建立可靠的模型时,通常需要特别注意两个方面。 首先,样本的独立性是一些统计分析程序(例如最大似然估计)的重要假设。 其次,我们要捕获模型中称为上下文效应的各组预测变量影响的变化。

A commonly used example, that I find sufficient to understand the scenario is that of the performance of students in a school. In a school with students in multiple classes, the academic performance of a student is influenced by their individual capabilities (called fixed effects) and the class they’re a part of (called random effects). Maybe the teacher assigned to a particular class teaches better than the others, or a higher proportion of intelligent students in a class creates a competitive environment for the students to perform better.

我发现足以理解这种情况的一个常用示例是学校学生的表现。 在有多个班级学生的学校中,学生的学习成绩受其个人能力(称为固定效应)及其所属班级(称为随机效应)的影响。 也许分配给特定班级的老师的教学要比其他班级的老师好,或者班级中有更多聪明的学生为学生创造更好的竞争环境。

One approach to handle the case is to build multiple models for each class, referred to as pooling. But such an approach might not always produce reliable results. For example, the model corresponding to a class with very few students will be very misleading. A single unpooled model might not be able to fit sufficiently on the data. We want to find a middle ground that finds a compromise between these extremes — partial pooling. This brings us to Bayesian hierarchical modeling, also known as multilevel modeling.

处理这种情况的一种方法是为每个类构建多个模型,称为pooling 。 但是,这种方法可能并不总是产生可靠的结果。 例如,与一个只有很少学生的班级相对应的模型将具有很大的误导性。 单个非池化模型可能无法充分适合数据。 我们希望找到一个在这些极端之间折衷的中间立场-部分池。 这使我们进入了贝叶斯层次建模,也称为多层次建模。

In this method, parameters are nested within one another at different levels of groups. Roughly, it gives us the weighted average of the unpooled and pooled model estimates. Hierarchical modeling is one of the most powerful, yet simple, techniques in Bayesian inference and possibly in statistical modeling. In this post, I will introduce the idea with a practical example. Note that this post does not cover the fundamentals of Bayesian analysis. The source code for the example is available as a notebook in GitHub.

在这种方法中,参数彼此嵌套在不同级别的组中。 粗略地讲,它为我们提供了未合并和合并的模型估计值的加权平均值。 层次建模是贝叶斯推断以及统计建模中最强大但最简单的技术之一。 在这篇文章中,我将通过一个实际的例子来介绍这个想法。 请注意,本文不涵盖贝叶斯分析的基础知识。 该示例的源代码可在GitHub中作为笔记本使用

数据 (Data)

The dataset used for illustration is similar to the students example above except we’re trying to find the math scores of students from different schools. We use the homework completion as a predictor for our example. You can find the original data here and it’s csv version here.

用于说明的数据集与上面的学生示例相似,除了我们试图查找来自不同学校的学生的数学成绩。 在本例中,我们将作业完成情况作为预测指标。 你可以找到原始数据在这里 ,它的CSV版本在这里

第一眼 (First look)

Plotting the math scores of the students against homework along with the unpooled OLS regression fit gives us this:

通过对学生的数学成绩与作业进行比较,以及未汇总的 OLS回归拟合,可以得出以下结论

The data samples with unpooled regression fit 具有非池化回归拟合的数据样本

Visualizing the data at the school level reveals some interesting patterns. We also plot the pooled regression lines fit on each school and the unpooled regression fit for reference. For simplicity, we use OLS regression.

在学校级别对数据进行可视化显示了一些有趣的模式。 我们还绘制了每所学校的合并回归线拟合值和非合并回归拟合值以供参考。 为简单起见,我们使用OLS回归。

The data samples with pooled regression fit across groups 具有合并回归的数据样本适合各组

The plot shows the variation of the relationship across groups. We also notice that the estimates are highly influenced by the few data points (possible outliers) with high homework completions in some of the groups.

该图显示了各组之间关系的变化。 我们还注意到,在某些组中,这些估计值受功课完成率高的几个数据点(可能的异常值)的影响很大。

层次模型 (Hierarchical model)

We will construct our Bayesian hierarchical model using PyMC3. We will construct hyperpriors on our group-level parameters to allow the model to share the individual properties of the student among the groups.

我们将使用PyMC3构建贝叶斯分层模型。 我们将在组级别参数上构建超级优先级,以允许模型在组之间共享学生的个人属性。

The model can be represented as yᵢ = αⱼᵢ + βⱼᵢxᵢ + εᵢ ,

该模型可以表示为yᵢ=αⱼᵢ+βⱼᵢxᵢ+εᵢ,

or in probabilistic notation as y ∼ N(αⱼ + βⱼx, ε).

或以概率符号y〜N(αⱼ+βⱼx,ε)表示。

For this model, we will use a random slope β and intercept α. This means that they will vary with each group instead of a constant slope and intercept for the entire data. The graphical representation of the probabilistic model is shown below.

对于此模型,我们将使用随机斜率β和截距α。 这意味着它们将随每个组而变化,而不是恒定的斜率,并且会截取整个数据。 概率模型的图形表示如下所示。

Graph representation of the hierarchical model used in this example 本示例中使用的层次模型的图形表示

While I choose my priors here by eyeballing the general distribution of the samples, using uninformative priors will lead to similar results. The code snippet below defines the PyMC3 model used.

虽然我通过关注样本的总体分布来选择我的先验,但是使用无信息的先验会导致相似的结果。 下面的代码段定义了使用的PyMC3模型。

with pm.Model() as model:
# Hyperpriors
mu_a = pm.Normal('mu_a', mu=40, sigma=50)
sigma_a = pm.HalfNormal('sigma_a', 50)

mu_b = pm.Normal('mu_b', mu=0, sigma=10)
sigma_b = pm.HalfNormal('sigma_b', 5)

# Intercept
a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=n_schools) # Slope
b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=n_schools)

# Model error
eps = pm.HalfCauchy('eps', 5)

# Model
y_hat = a[school] + b[school] * homework

# Likelihood
y_like = pm.Normal('y_like', mu=y_hat, sigma=eps, observed=math)

We will use the NUTS sampler for drawing samples from the posterior distribution.

我们将使用NUTS采样器从后验分布中抽取样本。

with model:
step = pm.NUTS()
trace = pm.sample(2000, tune=1000)

The trace plot and the posterior distribution of the slope and intercept corresponding to each school is visualized below.

下面显示了与每个学校相对应的坡度和截距的轨迹图和后验分布。

Trace plot for the hierarchical model 层次模型的迹线图 Posterior distribution of the intercept and slope for each group 每组的截距和斜率的后验分布

We can see the variation in the estimates of our coefficients across different schools. We can also clearly interpret the uncertainty associated with our estimates from the distributions. The posterior predictive regression (gray) lines below, sampled from the posterior distribution of the estimates of each group, gives a better picture of the model with respect to the data.

我们可以看到不同学校对我们系数的估计的差异。 我们还可以从分布中清楚地解释与我们的估计相关的不确定性。 下面的后预测回归(灰色)线是从每组估计的后分布中采样的,相对于数据,该模型可以提供更好的模型描述。

Posterior predictive fits of the hierarchical model 层次模型的后验预测拟合

Note the general higher uncertainty around groups that show a negative slope. The model finds a compromise between sensitivity to noise at the group level and the global estimates at the student level (apparent in IDs 7472, 7930, 25456, 25642). This implies that we must be a little warier of the decisions derived from the model on these groups. We also observe that with more data and lesser deviation, the Bayesian model converges to the OLS model of the group (ID 62821) as expected. We can also check the student level relationship by plotting the regression lines from

mu_a
and
mu_b
(which I omit here).

请注意,在显示负斜率的组周围,总体上不确定性较高。 该模型在小组级别的噪声敏感性和学生级别的全局估计(在ID 7472、7930、25456、25642中明显)之间找到了折衷方案。 这意味着我们必须对从这些群体的模型得出的决策有所警惕。 我们还观察到,随着数据的增加和偏差的减小,贝叶斯模型可以收敛到该组的OLS模型(ID 62821),这与预期的一样。 我们还可以通过绘制来自

mu_a
mu_b
(在此省略)来检查学生水平的关系。

Cross-validation with the different models will show the superiority of the hierarchical modeling approach. Cross-validation can be performed at 2 levels:

与不同模型的交叉验证将显示分层建模方法的优越性。 交叉验证可以在2个级别执行:

  1. Hold out students within a group and evaluate against its prediction.

    支持小组中的学生,并根据其预测进行评估。
  2. Hold out an entire group and evaluate its prediction. Note that this is not possible with the pooling model.

    支持整个小组并评估其预测。 请注意,这对于合并模型是不可能的。

I do not perform validation here as the frequentist and Bayesian models used here don’t make for a fair (or easy) comparison. But the CV can be performed by replacing the OLS regression with Bayesian linear regression and comparing their Root Mean Squared Deviation (RMSD) of the models.

我在这里不进行验证,因为此处使用的常驻模型和贝叶斯模型无法进行公平(或简单)的比较。 但是,可以通过用贝叶斯线性回归代替OLS回归并比较模型的均方根偏差(RMSD)来执行CV。

结论 (Conclusion)

Bayesian hierarchical modeling can produce robust models with naturally clustered data. They often allow us to build simple and interpretable models as opposed to the frequentist techniques like ensembling or neural networks that are commonly used for such complex data. They also prevent overfitting despite the increase in the number of parameters in the model. The post is merely an introduction to hierarchical modeling and its inherent simplicity allows us to implement different variations of the model specific to our data (eg: adding sub-groups, using more group-level predictors) and conduct different types of analysis (eg: find correlation among levels).

贝叶斯分层建模可以生成具有自然聚类数据的健壮模型。 与通常用于此类复杂数据的集成或神经网络之类的频频技术相比,它们通常使我们能够构建简单且可解释的模型。 尽管模型中参数的数量增加了,但它们也可以防止过度拟合。 这篇文章只是对层次建模的介绍,其固有的简单性使我们能够实现针对数据的模型的不同变体(例如:添加子组,使用更多的组级预测变量)并进行不同类型的分析(例如:找到各个级别之间的相关性)。

资源资源 (Resources)

A. Gelman et al, Bayesian Data Analysis (2013), Chapter 5, CRC press

A.Gelman等人,贝叶斯数据分析(2013),第5章,CRC出版社

Thank you for reading! I would appreciate any feedback on the post.

感谢您的阅读! 我将不胜感激对此职位的任何反馈。

You can connect with me on Linkedin and follow me on GitHub.

您可以在Linkedin上与我联系,并在GitHub上关注我。

翻译自: https://towardsdatascience.com/introduction-to-hierarchical-modeling-a5c7b2ebb1ca

opengl层次建模

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: