Day 14: Getting started with Statistics

If you're doing statistics on vast swathes of data, you could use PDL!

Santa's Naughty and Nice list has over a billion names and the Elf Data Analytics section of the workshop produces a display of trends for the January retrospective. If there is an increase in naughtiness, is it because nice children are starting to forget their manners or are naughty children using rude words, taking up smoking and creating merge conflicts?

You would think this discussion goes on using the social media tailor-made for cold weather climates ... Mastodon!

The Basics

PDL gives you basic statistics out of the box

Yes, you'll want the mean or the average and the median and ... yada, yada, yada. Yes, there are a lot of aliases available as well as slightly different versions of a function so that you get exactly what you want. Assume that PDL has the lot. The fun is in trying to find just where it is in the documentation. Oh, Santa! For Christmas, I want a customizable cheat-sheet for stats functions with only the aliases that I use hyperlinked to the documentation! I've been a good little Elf!

Usually you want more than a single statistic. More useful are the functions that return a bunch of stats together. minmax will give you both minimum and maximum values of the ndarray. To get a measure of the distribution of your data, you want the stats function which gives you the mean, the population RMS deviation the median, minimum, maximum, average absolute deviation and the RMS deviation or square root of the variance, all in one call.

Standard deviation

On the subject of the difference between the RMS deviation and the population RMS deviation, according to the authors of Numerical Recipes, if the difference between N and N-1 matters to you in your variance calculation, then you are up to no good. (commentary on Equation 14.1.2)

X-over

But what if you want to compare the rows of your ndarray to each other? Do you have to split the data up or do fancy indexing tricks?

No - you want the various over functions. For instance, medover takes median along the first dimension. You've got prodover for products, sumover for sums ... and 34 others in PDL::Ufunc. If the first dimension is not the one you're after, look to the xchg function to get the dimension you want.

Wait - which one is the first dimension again? Let's see an example using statsover on a 5x3 ndarray and we'll get either 3 values or 5 values. The averages across the rows are [3 3 3] and down the columns are [1 2 3 4 5].

pdl> $m = xvals(5, 3) + 1
[
 [1 2 3 4 5]
 [1 2 3 4 5]
 [1 2 3 4 5]
]
pdl> p statsover $m
[3 3 3] [ 1.58 1.58 1.58] [3 3 3] ...

Right, so X-over works on the rows, then.
But what if some data is better than others? We can give "weights" to values according to how much more significance they should have. Make an ndarray of ones and then zero the first two columns. That should change the average to (3+4+5)/3 = 4.

pdl> $w = $m->ones->copy
pdl> $w->where($m < 3) .= 0
[
 [0 0 1 1 1]
 [0 0 1 1 1]
 [0 0 1 1 1]
]
pdl> p stats $m, $w
4 0.866 3 1 5 0.666 0.816
pdl> p statsover $m, $w
[4 4 4] [1 1 1] [3 3 3] [1 1 1]  ....

A lot of the statistical functions will take an ndarray of weights, the same size as the data of course.

For more detailed statistics ... PDL::Stats

The above will give most people what they want, but sometimes you need more detail. For that we can use the PDL::Stats module. It also gives you both biased and unbiased versions of variance and standard deviation. It adds in skew and kurtosis, showing you the shape of the distribution.

It calculates the standard error in your data, the sum of squared deviations from the mean and covariance.

Can it answer the questions today's children are concerned with, like, does being really good translate into even more presents? For that you would need the Pearson correlation coefficient which measures the linear correlation between 2 sets of data.

Are children from one country much nicer than another?

Student's t test tests whether the difference between two groups is significant or not against the null hypothesis. It's not concrete proof of the answer, but will give you a measure of how confident Santa will be in the elves' geographical allocation of oranges for stockings.

In the absence of a continuous measure of goodness, the binomial test is a one-tailed significance test for two-outcome distribution that should be used for categorical data such as Naughty and Nice.

That should give you enough pointers into the documentation to get you started on your Data Elf journey. But before publishing your results on this generation of children, have a read through Statistics Done Wrong and make sure that you have enough data so that your analysis isn't underpowered.

Remember this season to be significantly nice!

By en:User:Qwfp (original); Pbroks13 (redraw) - Fisher iris versicolor sepalwidth.png, CC BY-SA 3.0, Link

Tagged in : statistics, correlation, significance

Boyd Duffee

Boyd has wanted to learn PDL for many years and realizing that dream is bringing him joy. He has done mad things to Complex Networks with NLP and is moving on to DSP and Time Series Analysis. He's interested in Data Science, Complex Networks and walks in the woods.