尺度变换与方差计算

数据尺度变换下均值与方差计算结果差异。
统计
方差
作者

胡华平

发布于

2024年4月27日

尺度变换

考虑如下的数据尺度变换

  • 给定变量\(X_i\)以及权重\(f_i\)

  • 分别给出尺度因子\(w\)\(v\)(大于0),也即\(X^\ast_i= w*X_i\) ,以及\(f_i^\ast = v*f_i\)

现在考虑尺度变换前后,均值和方差计算的关系。

(1)变换前:

\[ \begin{aligned} \overline{X} &= \frac{\sum{(X_i\cdot f_i)}}{\sum{f_i}} \\ S^2_{X} &= \frac{\sum{\left((X_i - \overline{X})^2f_i\right)}}{(\sum{f_i}) - 1} \end{aligned} \]

(2)变换后:

\[ \begin{aligned} \overline{X}^{\ast} &= \frac{\sum{(X^{\ast}_i\cdot f^{\ast}_i)}}{\sum{f^{\ast}_i}} = \frac{\sum{(wX_i\cdot vf_i)}}{\sum{vf_i}} = w\cdot \frac{\sum{(X_i\cdot f_i)}}{\sum{f_i}} = w \overline{X}\\ S^2_{X^{\ast}} &= \frac{\sum{\left((X^{\ast}_i - \overline{X}^{\ast})^2f^{\ast}_i\right)}}{(\sum{f^{\ast}_i}) - 1} = \frac{\sum{\left((wX_i - w\overline{X})^2vf_i\right)}}{(\sum{vf_i}) - 1} \\ &= \frac{w^2v \cdot \sum{\left((X_i - \overline{X})^2f_i\right)}}{(\sum{vf_i}) - 1} \\ & = \frac{w^2 \cdot \sum{\left((X_i - \overline{X})^2f_i\right)}}{(\sum{f_i}) - 1/v} \quad \text{ (if } v \approx 1\text{ )} \\ & \approx w^2 \cdot S^2_X \end{aligned} \]

注记

尺度变换下,方差计算可能会有数值近似差异。理论上,方差公式有两种形式:

\[ \begin{aligned} S^2_{X} &= \frac{\sum{\left((X_i - \overline{X})^2f_i\right)}}{(\sum{f_i}) - 1} && \text{(calcuation)} \end{aligned} \tag{1}\]

\[ \begin{aligned} S^2_{X} &= \frac{\sum{\left((X_i - \overline{X})^2f_i\right)}}{\sum{f_i}} && \text{(theory)} \end{aligned} \tag{2}\]

  • 如果使用 式 2 ,那么因子变换的方差结果严格等于\(S^2_{X^{\ast}} = w^2 \cdot S^2_X\)

  • 如果使用 式 1 ,那么因子变换的方差结果可能会与上述情况非常不同\(S^2_{X^{\ast}} \approx w^2 \cdot S^2_X\)。这取决于\(\sum{f_i}\)\(1/v\)的大小关系。如果\(\sum{f_i} \ggg 1/v\),则理论方差和数值方差基本接近;否则,理论方差和数值方差的差异会比较明显。

R示例说明

数据准备

Show the code
library(tidyverse)
library(magrittr)
#library(openxlsx)
#library(here)
library(janitor)
library(Hmisc)
library(scales)
Show the code
w <- 10
v <- 1/10
set.seed(1254)
X0 <- sample(10:20, 5, replace = FALSE)
X1 <- w*X0

f0 <- c(30, 20, 10, 20, 20) 
f1 <- v*f0

df <- data.frame(
  X0 = X0, f0 = f0,
  X1 = X1, f1 = f1
)

手动计算结果

Show the code
# calculate step-by step
source(here::here("Rscript/calc-var.R"))

result_a <- calc_weighted_var(
  dt = df, x = "X0", w = "f0")

result_b <- calc_weighted_var(
  dt = df, x = "X1", w = "f1")
Show the code
bind_rows(
  result_a$tbl %>% 
    mutate(trans = "before",
           w = 1,
           v = 1) %>% select(trans, w, v, everything()),
  result_b$tbl %>% 
    mutate(trans = "after",
           w = w,
           v = v) %>% select(trans, w, v, everything())
  
) %>%
  knitr::kable(caption = "计算表", align = "c")
计算表
trans w v x f xf x_dm x_dm_sqr x_dm_sqr_f
before 1 1.0 10 30 300 -5.3 28.09 842.7
before 1 1.0 19 20 380 3.7 13.69 273.8
before 1 1.0 13 10 130 -2.3 5.29 52.9
before 1 1.0 20 20 400 4.7 22.09 441.8
before 1 1.0 16 20 320 0.7 0.49 9.8
before 1 1.0 合计 100 1530 1.5 69.65 1621.0
after 10 0.1 100 3 300 -53.0 2809.00 8427.0
after 10 0.1 190 2 380 37.0 1369.00 2738.0
after 10 0.1 130 1 130 -23.0 529.00 529.0
after 10 0.1 200 2 400 47.0 2209.00 4418.0
after 10 0.1 160 2 320 7.0 49.00 98.0
after 10 0.1 合计 10 1530 15.0 6965.00 16210.0

我们来验证两类方差公式的计算结果:

Show the code
var1 <- result_a$sum$x_dm_sqr_f /(result_a$sum$f - 1)
var1_alt <- result_a$sum$x_dm_sqr_f /(result_a$sum$f)
se1 <- sqrt(var1)

var2 <- result_b$sum$x_dm_sqr_f /(result_b$sum$f - 1)
var2_alt <- result_b$sum$x_dm_sqr_f /(result_b$sum$f)
se2 <- sqrt(var2)

(1)采用经典方差计算公式 (式 1

Show the code
list(w=w, var1 = var1, var2 = var2)
$w
[1] 10

$var1
[1] 16.37374

$var2
[1] 1801.111
Show the code
identical(var2, w^2*var1)
[1] FALSE
Show the code
number(var2, 0.000000001)
[1] "1 801.111111111"
Show the code
number(w^2*var1, 0.000000001)
[1] "1 637.373737374"

(2)采用其他理论方差计算公式 (式 2

Show the code
list(w=w, var1_alt = var1_alt, var2_alt = var2_alt)
$w
[1] 10

$var1_alt
[1] 16.21

$var2_alt
[1] 1621
Show the code
identical(var2_alt, w^2*var1_alt)
[1] TRUE
Show the code
number(var2_alt, 0.000000001)
[1] "1 621.000000000"
Show the code
number(w^2*var1_alt, 0.000000001)
[1] "1 621.000000000"

计算机结果

实际上,计算机程序默认采用经典方差计算公式 (式 1

Show the code
# ==== quick calculation : weighted====
avr1 <- weighted.mean(df$X0, df$f0)
avr2 <- weighted.mean(df$X1, df$f1)

v1 <- Hmisc::wtd.var(x = df$X0, weights = df$f0)
sd1 <- sqrt(v1)

v2 <- Hmisc::wtd.var(x = df$X1, weights = df$f1)
sd2 <- sqrt(v2)
Show the code
w
[1] 10
Show the code
identical(v2, w^2*v1)
[1] FALSE
Show the code
number(v2, 0.000000001)
[1] "1 801.111111111"
Show the code
number(w^2*v1, 0.000000001)
[1] "1 637.373737374"