Histograms are the best way to visualize distribution of data points

 

Data distribution plots help visualize how quantitative data points are spread over the range of their values. Distribution of quantitative data can be shown in various ways such as box-plots, violin-plots, histograms, and scatter-plot with artificially introduced deviations to depicts density of the data points. However, I think the relatively simple looking histogram is the best way we can visualize the data distribution, because it visualizes exactly how many data points are present in a given range of numbers.


Figure: Data distribution of MSRP of a category of car. (Data taken from Kaggle)

 

Boxplots are perhaps the worst because they only depict certain benchmark intervals like minimum, maximum, 25th percentile, median, and 75th percentile. We can’t really know what is the distribution of data points withing these percentile ranges.

Violin-plot fairs a bit better in giving us an idea of how many data points are there near a value. However, the smoothening of the width of the violin curve in the plots gives us a false impression of presence of data where there isn’t any. In the figure we can see the same data-points plotted in various forms. The violin plot’s smoothening effect causes the violin to never depict the absence of data in a range.

The scatter plot with random deviations, clearly shows that in certain range there is no data-point, but it fails to show us how many data points are there in a range of values.

The histogram solves all these problems. By adjusting the number of bins to show, we can exactly see how many data-points are there in a range of values and thus visualize the exact data distribution.

Popular posts from this blog

Principal Coordinate analysis in R and python

Principal Coordinate Analysis (PCoA) in R