# Convert a pdf into a conditional pdf such that mean increases and std dev falls

Data Science Asked by claudius on October 30, 2020

Let success metric(for some business use case I am working on) be a continuous random variable S.
The mean of pdf defined on S indicates the chance of success. Higher the mean more is the chance of success. Let std dev of pdf defined on S indicates risk. Lower the std deviation lower the risk of failure.

I have data,let’s call them X, which affects S. Let X also be modelled as bunch of random variables.

P(S|X) changes based on X.
The problem statement is I want to pick Xs such that P(S|X) has mean higher than P(S) and std deviation lower than P(S).

Just to illustrate my point I have taken X of 1 dimension.
Scatter plot between X(horizontal) and Y(on vertical): You can see that P(S|X) changes for different values of X as given in the below plot: For X between 4500 and 10225, mean of S is 3.889 and std dev is 0.041 compared to mean of 3.7 and std dev of 0.112 when there is no constraint on X.

What I am interested in is given S and bunch of Xs… pick range of Xs such that resulting distribution of P(S|X) has higher mean and lower standard deviation… Please help me find a standard technique that would help me achieve this.

Also I don’t want to condition on X such that number of samples are too small to generalise.I want to avoid cases such as on the left most side of tree where number of samples is 1.

Just apply an optimization to search for the X values that satisfy the criteria you're looking for. Here's a simple demo:

set.seed(123)
mu_x_true = 1e4
mu_y_true = 3.75
n = 1e2

x <- rpois(n, mu_x_true)
y <- rnorm(n, sqrt(mu_y_true))^2

plot(x, y)

# conditions:
# E[Y|X] > E[Y]
# std(Y|x) < std(Y)

mu_y_emp = mean(y)
sd_y_emp = sd(y)

objective <- function(par, alpha=0.5){
if (par>par) par = rev(par)
ix <- which((par < x) & (x < par))
k <- length(ix)
if (k==0) return(1e12)
mu_yx <- mean(y[ix])
sd_yx <- sd(y[ix])

alpha*(mu_y_emp - mu_yx) + (1-alpha)*(sd_yx - sd_y_emp)
}

init <- mean(x) + c(-sd(x), sd(x))
test <- optim(objective, par=init)

ix <- which((par < x) & (x < par))

mean(y[ix]) > mean(y)
# TRUE

sd(y) > sd(y[ix])
# TRUE


Answered by David Marx on October 30, 2020

## Related Questions

### Optimal points of $f(x,y)=x^2 + y^2 + beta xy + x + 2y$

1  Asked on June 21, 2021

### Logbook: Machine Learning approaches

3  Asked on June 21, 2021 by jorge

### Classification problem with 2 level features

0  Asked on June 21, 2021 by molse

### Generating normally distributed data frame with 3 columns

1  Asked on June 21, 2021 by nku

### Is there any time series model which handles data at variable frequencies.?

2  Asked on June 21, 2021 by michaelrazum

### Neural Network Architecture for Identifying Image Copies

2  Asked on June 21, 2021 by duhaime

### While Merging image datasets which of the image parameters should be prepossessed/Normalized before giving to a CNN Model?

0  Asked on June 21, 2021 by faizi

### Tweedie Loss for Keras

2  Asked on June 21, 2021 by odyse

### Reason behind the sum of rate factors for calculating cost function derivative

1  Asked on June 21, 2021

0  Asked on June 21, 2021 by rachithr

### Replacing dataloader samples in training pytorch

1  Asked on June 20, 2021

### What are the different ways to feature engineer webpage data for input into a webpage classification model?

0  Asked on June 20, 2021 by mkerrig

### Extracting structure and content from invoices

1  Asked on June 20, 2021 by don-draper

### Speech Dataset for Spanish ASR

1  Asked on June 20, 2021 by dhiraj-bhalerao

### Alternatives with better GPU than Google Colab Pro

2  Asked on June 20, 2021 by the-dan

### DQL for detecting next move in games

0  Asked on June 20, 2021 by user117272

### How can collaborative filtering be extended to include more features?

1  Asked on June 20, 2021 by nick-smith

### Pattern detection in sequence of users behaviour for clustering

0  Asked on June 19, 2021 by mara