TransWikia.com

How to find the 95% confidence interval when there is outliers?

Cross Validated Asked by Sadisha on December 8, 2021

I know how to find the 95% confidence interval for normal distributions. But how to find it when there is outliers?

Question: The health-care costs, in thousands of dollars for 20 males aged 75 or over, are shown in data set. Calculate the 95% confidence interval on the mean health-care cost for such individuals.

Data set = [8.5, 8.0, 16.0, 12.0, 2.5, 515.0, 5.0, 15.0, 13.0, 2.0, 950.0, 15.0, 9.0, 6.0, 12.0, 5.5, 19.5, 7.5, 37.5, 12.5]

2 Answers

Bootstrap might be one way to do this. In python...

from sklearn.utils import resample
import numpy as np


x = np.array([8.5, 8.0, 16.0, 12.0, 2.5, 515.0, 5.0, 15.0, 13.0, 2.0, 950.0, 15.0, 9.0, 6.0, 12.0, 5.5, 19.5, 7.5, 37.5, 12.5])

xb = np.array([ resample(x).mean() for j in range(10000)])

low, high = np.quantile(xb, [0.025, 0.975])

This yields a bootstrap CI of (9.95 , 200.72).

However, I think there is something driving higher costs. Because your data are from older patients, I imagine some patients have more co-morbidities than others which may lead to complications and hence higher costs. In the absence if additional information, or strong assumptions on the data generating processes, I think this is going to be the best you can do.

Answered by Demetri Pananos on December 8, 2021

Quick preliminary results from R:

x=c(8.5, 8.0, 16.0, 12.0, 2.5, 515.0, 5.0, 15.0, 13.0, 2.0, 
    950.0, 15.0, 9.0, 6.0, 12.0, 5.5, 19.5, 7.5, 37.5, 12.5)

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   7.125  12.000  83.575  15.250 950.000 

(1) t interval for mean. Assumes normal data, which seems a bad assumption here.

t.test(x)
...
95 percent confidence interval:
  -25.47337 192.62337

(2) Nonparametric Wilcoxon CI for population median. May be slightly inaccurate because of ties in your data.

wilcox.test(x, conf.int=T)
...
95 percent confidence interval:
  8.500058 21.750047

(3) 95% nonparamatric bootstrap quantile CI for population mean: $(10, 200).$

set.seed(2020)
a.re=replicate(10^4, mean(sample(x, rep=T)))
quantile(a.re, c(.025,.975))
    2.5%    97.5% 
  9.9750 200.5269 

Leave comments/questions as appropriate. More later.

Answered by BruceET on December 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP