History
This will be based on the book The Lady Tasting Tea by David Salsburg
A lot of this book is rich with early-20th century history, way too much for me to take in as a beginner. So for now (I know I’m saying that a lot) I’ll glaze over it and just note the main points I come across. I’m still in information-gathering mode, so I won’t stress over not tying this together yet.
Regression
While studying the difference between father and son heights, Francis Galton observed that any pattern will generally regress to the mean, aka. ‘central tendency‘. A less confusing way of explaining this is that it’s unrealistic for every variable (male born in this case) to be significantly higher or lower than the previous variable (the father); if man 1 is 5’10, and his son man 2 is 6’6, it’s unlikely that man 3 will be over 6’6 or significantly under 5’10.
Here’s an interesting video of this concept using a device created by Galton specifically for this demonstration.
Fisher vs Pearson
The ultimate dual of statistics was between Ronald Fisher and Karl Pearson.
Fisher based his theory off of Darwinism, that there are patterns within the evolution of species. This prompted him to subscribe to eugenicism (the culling of inferior races so only the superiors will reign; basically what Hitler wanted).
Based off of how he spelled his first name (which was originally spelled the normal way), it shouldn’t be hard to guess what Pearson stood for, Marxism.
I am completely ignorant of what Marxism is. My experience with it is only from memes and Tweets. My current understanding is it’s a response to capitalism (typically aligned with the political left) which I’m assuming just means that followers of this doctrine wish to debunk the concept of work. No more businesses and the word ’employ’ and all its variations would cease to exist. All of our resources would be distributed by the government. I may be confusing this with socialism.
So both men’s stance in itself should show you why these guys were never meant to get along.
This is very little for me to be halfway through the book. Like I said, it’s stuffed with history that’s too much to focus on right now. Let me get a better understanding of the forest before I start looking at individual trees.
Where I’m at Now
I’ve read Naked Statistics, which gave me a good overview of the formulas. I’ve also completed the fundamentals course on the Brilliant app. I’m still lost on the ultimate purpose of this subject, so for now I’ll just paste the notes I’ve taken. These will mainly be tidbits and formulas I’ve learned. I’ll bullet these since they’re mostly random.
Tidbits
- Pie charts aren’t as effective in detecting subtle differences between variables
- Horizontal bar charts are better for longer categories
- Segmented bar charts ideal for displaying frequencies and percentages
- Histograms are best for grouped numerical data
- When designing a histogram, there must be intervals of equal ranges & widths so each individual bars ends and begins simultaneously.
- Frequency is represented by the height of the bar
- Regression line = line of best fit
- r=Correlation coefficient ranges from -1 to 1
- R2=Coefficient of determination
- To measure how close data is to regression line; proportion of variance in y that is predictable from x
- Sum of squared residual (SSR) measures how much variation there is in the dependent variable
- Lowest amount of Sum of squared residuals = best regression line
- Total sum of squares (SST) measures variability in the y-variable as a whole
- Squared differences between observed variable and its mean
- R2 occurs when ratio of SSR to SST is smallest (largest gap)
- Area of geometric visualizations must be in same proportion of stat ratios
- Keep in mind, Cumulative frequencies are carried over; whatever individual point you look at includes the previous
- Steep lines (in a CF bar-line graph) clearly indicate the glaring factor in an issue
- Mean absolute deviation (MnAD) used to determine how much the data deviates from the mean on average; the mean of how much we deviate from the mean
- Median absolute deviation (MdAD)
- Variance is typically greater, but can also be less than or equal to MnAD
- Sample variance is typically smaller than population variance
- Using population variance formula on a sample would skew the results (with too low variance) since samples generally show less variability
- Significance pertains to the probability of a result occurring
- Significance level is the probability of an extreme occurrence
- Typically .1,.01 or .05
- On a normally distributed curve, 68% of sample means are within 1 SE; 95% within 2 SE
- Type I error rejects true null
- Probability = alpha (α)
- Type II error approves false null
- Probability = beta (β)
- Affected by:
- Value of α
- Sample size
- Actual effect
- P-value is proportionate with the probability of a positive result from randomness
- Higher P-value = higher chance of false positive
Formulas
- SST = SSR + SSE ~ Total variability = explained variable + unexplained variable
- SSR = plug x into ‘best fit line’, then subtract from y, and square the result
- SST = (y-ÿ)2 + same
- ÿ = mean of y-values
- MnAD = sum of each #’s absolute value from the mean/number of numbers
- MdAD = evaluate each #’s absolute value from the median, then order them least to greatest to find the median
- Variance = MnAD, but square the values
- Standard deviation = square root of variance
- Z-score: σz=σx−μ or z=(x-μ)/σ
- Population variance: σ^2
- Standard error: σ/√n
- SE for sample proportion: √P(1-P)/n
I’ll just keep going through Lady Tasting Tea and the Brilliant courses. Got a long way to go before I can start using this.