TransWikia.com

Why does horizontal lines in plt.plot(feature, '.') mean that the data have been properly shuffled?

Data Science Asked on March 31, 2021

I am following a Mooc and in this lecture about visualisation in explenatory data analysis the lecturer claims that when plotting the row indexes against feature values, if we have lines on the feature value axis it means that the data have been properly shuffled. I can’t see why.

  1. Shouldn’t an index have only one value in the feature axis?
  2. One horizontal line should mean that the feature values for all indexes have been uniformized, not randomized?

enter image description here

On the contrary, in the following lecture, the lecturer claims that from the absence of vertical lines, the data hasn’t been properly shuffled:

enter image description here

I think I get it as if it was, I would have seen clear lines. But how can I bee sure there isn’t more classes hidden in these subs?

One Answer

  1. Shouldn't an index have only one value in the feature axis?

Yes, that's correct. On the graph given as example this is not visible because there are too many row indexes (50000). As a consequence it's impossible to distinguish a particular index from its neighbors, but if the X axis was stretched long enough one would see a single feature value for every index.

  1. One horizontal line should mean that the feature values for all indexes have been uniformized, not randomized?

I think there could be two different confusions here:

  • An horizontal line means that a single feature value is distributed uniformly across the indexes, which is equivalent to saying that the indexes are random for this feature value. In other words, the chance that this feature value appears at a particular index is the same as at any other index. This is what the author means: the order (indexes) is random for any feature value.
  • The values for all the features have not been uniformized, this can be seen from the fact that vertically the density of the points is different around the middle (say 0.4-0.6) and the extremes (say 0-0.2 and 0.8-1). Of course this would be more visible with a standard histogram, which would show a kind of peak in the middle but with two high bars at the extremes for 0 and 1 (it can be seen from the continuous lines for these two features values that they appear much more frequently).

One may also note on this graph that there is some kind of underlying discrete distribution of the values: very clearly for values 0 and 1, but also from all the white horizontal lines which show that some values seldom exist in the data.

Correct answer by Erwan on March 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP