TransWikia.com

How to Integrate/map Population Data with/over Sample Data

Data Science Asked by Asif Khawaja on July 18, 2021

Problem Definition:
Our organization is conducting different type of surveys and Census in our country. The basic difference between Census and Survey is that, the target of Census is the Complete Population but the target of Survey is just a sample (sub-set) taken from that Population. Now the organization is willing to integrate the results of surveys and Census to map the data of surveys over the census data. Currently the organization is using some statistical approaches to integrate the data but these technique have loads of issues. I am interested to solve this issue by using some state of the art approaches from Data Science, Machine Learning or Deep Learning.

Consider population (census) data set contains 220 million rows, whereas sample (survey) data contains 40 million rows approximately.

Now, my question is for all the data scientists having sound background of statistics, how to do that?
Need a step by step guidance to achieve this goal. Kindly recommend me the algorithms to get this task done. Also suggest me some resources to read and understand this problem.

Example to Illustrate the Problem:

Let for Census the Population is P having attributes A,
Similarly, for Survey the Population (sample taken from P) is S having attributes B
i.e. S⊆P and A⊆B
For Example

Census Data (Table 1)

Block Code HouseHoldID PersonID Sex Marital Status Age
1 1 1 Male Married 25
1 1 2 Female Un-Married 30
1 1 3 Male Married 22
1 2 4 Male Married 40
1 2 5 Male Un-Married 30
1 3 6 Male Un-Married 17
2 4 7 Female Married 50
3 5 8 Female Married 52
3 5 9 Female Married 45
4 6 10 Female Un-Married 45
4 7 11 Female Un-Married 42
5 8 12 Male Married 36
5 9 13 Female Married 33

Survey Data (Table 2)

Block Code HouseHoldID PersonID Sex Marital Status Age Employment Status Education Level
1 1 1 Male Married 25 Employed Graduate
1 2 4 Male Married 40 Employed No Schooling
1 2 5 Male Un-Married 30 Un-Employed Primary
3 3 6 Male Un-Married 17 Un-Employed Middle
3 4 7 Female Married 50 Employed Middle

As per our policy we divide our country into Provinces, then provinces into districts, then districts into Tehsils and similarly following the pattern we reach to the lowest level known as BLOCKS. Each Block is composed of at-least 500 households (Families) and each family consist of members.
The above Table i.e. Table 1 showing Census data collected from 5 different blocks but in Table 2 the Survey data is collected from two different blocks. Here you can see that Survey data is subset of Census data. Similarly, there may be many blocks collected in Census may not be visited in Survey.
Furthermore, the attributes in Census data are subset of Attributes of Survey Data (Survey is short but detailed)

Now if I am supposed to map Survey data over Census data i.e. I want to see the percentage of un-employed members in Block with Code 2 (where as this block is not visited in Survey), then how would I do that?
We may need such type of mapping at block level or upper level.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP