TransWikia.com

Which probability distribution will you use to model outliers?

Data Science Asked on January 13, 2021

I was asked this question in a recent interview for the position of a Data Scientist: Which probability distribution will you use to model outliers ?

I told him outliers are like rare events which can be modelled by a Poisson distribution. I pretty sure I’m wrong and the interviewer seemed to think the same. But I don’t know the answer to this.

Please advise.

One Answer

I think the answer is Gaussian distribution. This is a famous approach that is used in Anomaly Detection. What you do is to fit your feature to the Gaussian distribution and the samples which have the probability below the specific threshold are labeled as an outlier.

Quoting from the paper Modeling Outlier Score Distributions:

Many existing unsupervised outlier detection algorithms calculate some kind of score per data object which serves as a measure of the degree of outlier. Scores are used in ranking data points such that the top n points are considered as outliers. For example, the statistical-based approach proposed in [4], uses a Gaussian mixture model to represent normal behavior and each datum is given a score on the basis of changes in the model. A high score indicates a high possibility of being an outlier.

Example of usage - 1.

Example of usage - 2

Correct answer by Shahriyar Mammadli on January 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP