Wangsheng's World
Basic Data Science Knowledge
Some materials come from the Internet.
Statistics
p-value
A p-value (probability value) is a statistical measure used to determine the significance of results in hypothesis testing. In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is true.
A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
Key points about the p-value:
  • Null Hypothesis (H₀): Typically, the null hypothesis is the default position that there is no effect or no difference.
  • Small p-value (< α): If the p-value is small (commonly < 0.05), it suggests that the observed data is unlikely under the null hypothesis, leading researchers to reject the null hypothesis. In other words, there is evidence to suggest that the effect is statistically significant.
  • Large p-value (≥ α): If the p-value is large, it indicates that the observed data is consistent with the null hypothesis, and there is no strong evidence against it.
However, the p-value does not measure the magnitude of an effect or the probability that the null hypothesis is true. It only tells you whether the data is unusual under the assumption of the null hypothesis.
What is the Central Limit Theorem and why is it important?
"Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can't obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly."
What is sampling? What are some sampling methods?
"Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined."
Some Sampling Methods:
What is the difference between type I vs type Il error?
"A type I error occurs when the null hypothesis is true, but is rejected. A type Il error occurs when the null hypothesis is false, but erroneously fails to be rejected."
What is linear regression?
A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship.
What are the assumptions required for linear regression?
There are four major assumptions:
  • There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data
  • The errors or residuals of the data are normally distributed and independent from each other
  • There is minimal multicollinearity between explanatory variables
  • Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
What is a statistical interaction?
"Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor."
What is selection bias?
"Selection (or 'sampling') bias occurs in an 'active,' sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases...
Data Analysis
How does data cleaning plays a vital role in the analysis?
Data cleaning can help in analysis because:
  • Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
  • Data Cleaning helps to increase the accuracy of the model in machine learning.
  • It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
  • It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
Differentiate between univariate, bivariate and multivariate analysis
Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.
The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.
Machine Learning
What is Machine Learning?
Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.
What is Supervised Learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.
Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks.
E.g., if you built a fruit classifier, the labels will be "this is an orange, this is an apple and this is a banana", based on showing the classifier examples of apples, oranges and bananas.
What is Unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses.
Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models
E.g. In the same example, a fruit clustering will categorize as "fruits with soft skin and lots of dimples", "fruits with shiny hard skin" and "elongated yellow fruits".
What is 'Naive' in a Naive Bayes?
The Naive Bayes Algorithm is based on the Bayes' Theorem. Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
The Algorithm is "naive" because it makes assumptions that may or may not turn out to be correct.
Explain SVM algorithm in detail
SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyperplanes to separate out different classes based on the provided kernel function.
What are the different kernels in SVM?
There are four types of kernels in SVM:
  • Linear Kernel
  • Polynomial Kernel
  • Radial Basis Kernel
  • Sigmoid Kernel