# Chi-squared test, Poisson distribution, type I error overestimated - well-suited test for discrete distributions?

Cross Validated Asked by slava-kohut on December 25, 2020

UPDATE I edited my original question to make it as clear as possible.

My goal is to find a reliable goodness-of-fit test for Poisson-distributed samples. There are a few discussions here related to goodness-of-fit tests for discrete distributions, e.g., the Poisson distribution (for example, here and here). I have created a simulation to understand what happens to the type I error in the case of the chi-squared test. I am working with a sum of Poisson-distributed variables (which is in turn a Poisson-distributed variable itself):

set.seed(123)

n <- 100000
alpha <- 0.05 # significance level
n_sim <- 10
res_chi2 <- vector(mode = "list", length = n_sim)
res_ks <- vector(mode = "list", length = n_sim)

lambda_i <- 10^sample(-10:-2, 100, replace = TRUE) # 100 Poisson-distributed variables
total_lambda <- sum(lambda_i) # the random variable of interest is a sum of Poisson-distributed variables

for (i in 1:n_sim){
set.seed(i)

# observed frequencies
my_sample <- rowSums(sapply(lambda_i, function(x) rpois(n, x))) # generate a sample by aggregating event counts of subsamples
sample_freq <- table(my_sample)

# expected frequencies
# calculated using the density function for the aggregate Poisson distribution
theor_freq <- dpois(as.numeric(names(sample_freq)), total_lambda)*n
# add missing count for (n,+ inf) to the last bin
# now frequencies are normalized to n (sample size)
theor_freq[length(theor_freq)] <- theor_freq[length(theor_freq)] + n - sum(theor_freq)

# test statistic, the first formula below
#  https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
test_statistic <- sum((theor_freq - sample_freq)^2/theor_freq)
# no estimated parameters, df = number of categories - 1
p_value <- 1 - pchisq(test_statistic, df = length(theor_freq)-1)
# if TRUE, the null is accepted
res_chi2[[i]] <- p_value >= alpha
}

sum_passed_chi2 <- Reduce(+,res_chi2)

# 1000 simulations
> 1000 - sum_passed_chi2
> 92
# the null was rejected 92 times


The type I error is equal to 9% for the chi-squared test. Why is it overestimated? Can I assume that a well-suited goodness-of-fit test will give an error of approximately 5% (my significance level)? How do I implement/design a proper goodness-of-fit to test whether a sample is distributed according to a Poisson distribution with known parameters?

UPDATE 2 I also ran a simulation with a single sample drawn from a Poisson distribution, i.e.:

my_sample <- rpois(n, total_lambda)


In this case, the type I error rate is 8%.

## Related Questions

### How to predict unknown time series in using Facebook Prophet?

0  Asked on November 20, 2021

### Choosing the right model for prediction

1  Asked on November 18, 2021

### Relation between P-value in a randomness test, number of samples, and entropy

1  Asked on November 18, 2021

### Is there a way to use cor function with factor variables without creating dummy variables? (R)

1  Asked on November 18, 2021 by charles-orlando

### What’s the intuition behind contrastive learning or approach?

2  Asked on November 18, 2021

### What is the difference between econometrics and statistics?

6  Asked on November 18, 2021

### Central limit theorem – num random variables vs. sample size?

1  Asked on November 18, 2021 by imagineerthat

### How to smooth an existing PDF?

1  Asked on November 18, 2021 by crimson_idiot

### Finding variance and mean of an expression?

0  Asked on November 18, 2021

### One-sample t-test on count data

0  Asked on November 16, 2021 by zarya

### Book recommendation for ANOVA and linear models

2  Asked on November 16, 2021 by shial-de

### Can you run intraclass-correlations with different raters, and different numbers of raters per participant?

2  Asked on November 16, 2021 by bruce-rawlings

### How to calculate estimated log-sum coefficient in the context of nested logit model in R?

1  Asked on November 16, 2021

### Classification Problem With Estimated and Dependent Covariates

0  Asked on November 16, 2021 by dimitriy

### Variable of importance and Q2Y in PLSR

0  Asked on November 16, 2021

### Measurement error in one indep variable in OLS with multiple regression

3  Asked on November 16, 2021

### How is sklearn’s Logistic Regression’s Score Calculated?

0  Asked on November 16, 2021 by user291976

### Data Visualization – how to display error of a color scale?

1  Asked on November 16, 2021 by ben-s

### Acceptance sampling

1  Asked on November 16, 2021