Statistics
p-value
A p-value (probability value) is a statistical measure used to determine the significance of results in hypothesis testing.
In null-hypothesis significance testing,
the p-value is the probability of obtaining test results at least as extreme as the result actually observed,
under the assumption that the null hypothesis is true.
A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
Key points about the p-value:
- Null Hypothesis (H₀): Typically, the null hypothesis is the default position that there is no effect or no difference.
- Small p-value (< α): If the p-value is small (commonly < 0.05), it suggests that the observed data is unlikely under the null hypothesis, leading researchers to reject the null hypothesis. In other words, there is evidence to suggest that the effect is statistically significant.
- Large p-value (≥ α): If the p-value is large, it indicates that the observed data is consistent with the null hypothesis, and there is no strong evidence against it.
However, the p-value does not measure the magnitude of an effect or the probability that the null hypothesis is true.
It only tells you whether the data is unusual under the assumption of the null hypothesis.
What is the Central Limit Theorem and why is it important?
"Suppose that we are interested in estimating the average height among all people.
Collecting data for every person in the world is impossible.
While we can't obtain a height measurement from everyone in the population,
we can still sample some people.
The question now becomes,
what can we say about the average height of the entire population given a single sample.
The Central Limit Theorem addresses this question exactly."
What is sampling? What are some sampling methods?
"Data sampling is a statistical analysis technique used to select,
manipulate and analyze a representative subset of data points to identify patterns
and trends in the larger data set being examined."
Some Sampling Methods:
What is the difference between type I vs type Il error?
"A type I error occurs when the null hypothesis is true, but is rejected.
A type Il error occurs when the null hypothesis is false, but erroneously fails to be rejected."
What is linear regression?
A linear regression is a good tool for quick predictive analysis:
for example, the price of a house depends on a myriad of factors,
such as its size or its location. In order to see the relationship between these variables,
we need to build a linear regression,
which predicts the line of best fit between them and can help conclude whether or not
these two factors have a positive or negative relationship.
What are the assumptions required for linear regression?
There are four major assumptions:
- There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
- The errors or residuals of the data are normally distributed and independent from each other
- There is minimal multicollinearity between explanatory variables
- Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
What is a statistical interaction?
"Basically, an interaction is when the effect of one factor (input variable)
on the dependent variable (output variable) differs among levels of another factor."
What is selection bias?
"Selection (or 'sampling') bias occurs in an 'active,'
sense when the sample data that is gathered and prepared for modeling has characteristics
that are not representative of the true, future population of cases...