TransWikia.com

Appropriate statistical test

Data Science Asked on June 1, 2021

I am working on a project where I have twitter user profiles and their tweets. The users are divided based on their number of followers in two groups (g1 and g2). Then with each user in g1, one user from g2 were matched based on their profile and activity using nearest neighbor (not propensity score). Now I want to do some statistical tests, for example, how differently the sentiment of the tweets changes for these two groups before and after some events. So I have lets say tweets posted within 7 days before and after some date and estimated the mean sentiment scores of all tweets posted by each user in each group. For two groups sample sizes are different (even though they were matched) since not every one posted any tweets within the date range. Now if I want to do a t-test to see if people in g1 has larger positive change in sentiment than g2 after the reference date. I have the following questions:

  • I am doing an independent sample t-test, by treating each user in each group independent. For each person, take the difference (before and after) in mean sentiment scores for both groups, then testing to see if there is significant different across the groups in terms of change in sentiment scores. Is this appropriate or I should do the matched pairs test? I have gone through other posts here but did not find and definitive answer.
  • The users who did not have and tweet for the time range, is this okay to assign difference in mean zero, or I should exclude them from the samples?

Thanks in advance. Cheers!

One Answer

Regarding your second question:

The users who did not have and tweet for the time range, is this okay to assign difference in mean zero, or I should exclude them from the samples?

You essentially have missing data in this case. How you can deal with this will depend on the model you are using, and if it is robust to missing data. If the model can ignore $mu = 0, sigma = 0$ values, then try it out. Otherwise, you might want to leave them out as you suggest, or perhaps even impute them with their previous known values. If you are e.g. using something like an ARIMA model, then it keeps track of a moving average. In this case, using zero values will have an undesired impact (assuming zeros are not common in general).


I'm not sure I understand what you are asking in your first question. What have you tried already? Have you got some results?

Answered by n1k31t4 on June 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP