TransWikia.com

How to choose resample size when drawing without replacement?

Cross Validated Asked on February 17, 2021

Say I have some second-order statistic $m(x)$ where $x$ is a data vector of length $n$. Let’s also assume that the limiting distribution of $x$ is gaussian-ish, but generally unknown, so that the assumptions that enable one to derive the usual error analysis expressions do not hold. In this case, if I want to get an estimate in the uncertainty in the measure of $m$, I will have to simulate it, using a bootstrap or something. So, I generate 1,000 unique realizations of $x$, $x_i$ (1 $leq$ $i$ $leq$ 1000), and use the distribution of all $m(x_i)$ to get an idea of the error in $m$

Now, since $m$ is second-order, it is preferable to draw without replacement when generating the 1,000 resamples of $x$. This is all fine, and the bootstrapping routines I’ve implemented work well. Here is my problem:

  • I have to choose a size of the resample
  • If I choose that size to be $n$, then all $x_i$ will be identical since I’m sampling without replacement
  • So, the resample size must obviously be smaller than the size of $x$

Problem is, if I choose the size of each $x_i$ to be high, say $0.9n$, then my error is going to be very small. If I choose the size to be small, $0.1n$, then the error can blow up. So, I can effectively make the error in $m$ whatever I’d like, which obviously isn’t right…

What do I do at this point, while maintaining integrity?!

One Answer

I'm posting my own answer here because I think it is probably correct, but feel free to please input ideas.

@MossMurderer brought my attention, in the comments to the original question, an interesting proof. That is, ${n choose k}$ is maximum at $k=n/2$. This means that if I choose my resample size $k$ to be $n/2$, I will be maximizing the number of unique draws $x_i$.

This is desirable for a very simple reason: to preform the bootstrap, I compute the statistic $m$ for each realization $x_i$. Let's call the results of that calculation a vector $M = [m(x_1),... m(x_{1000})]$. The estimated error in the statistic $m$ is then something like $sigma(M)$. Of course, you only ever use the standard deviation $sigma$ as a statistic if you can safely assume your limiting distribution to be gaussian.

If we ensure that we maximize the number of unique $x_i$ by setting $k = n/2$, then we have sampled the $M$ space as well as we possibly can, and therefore $M$ will be as close to gaussian as is possible. In this way, the measure $sigma(M)$ will be most reliable when the length $k$ of each $x_i$ is $n/2$.

I apologize if this is difficult to read - please edit if you feel you are more eloquent than I.

Answered by Anonymous on February 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP