본문 바로가기
데이터 커리어 in US

데이터 사이언스 면접에서 자주 나오는 45가지 통계 개념 정리

by USDK 2023. 3. 2.

 



데이터 사이언스는 많은 데이터 분석, 모델링 및 해석을 수반하는 빠르게 성장하는 분야입니다. 특히, 탄탄한 통계학 기초 지식은 필수적이라고 할 수 있겠는데요. 데이터 사이언티스트 테크니컬 인터뷰에서는 다양한 통계 개념에 대한 질문이 많이 나옵니다. 이 포스트에서는 데이터 사이언스 면접에서 자주 묻는 45가지 통계 개념에 대해서 간략하게 정리해 보았습니다.

1️⃣ Basics


1. Univariate statistics - mean, median, mode: Univariate statistics is the analysis of a single variable. Mean, median, and mode are measures of central tendency used to describe the central location of a dataset.

2. Standard deviation and variance: Standard deviation and variance are measures of variability used to describe how spread out a dataset is.

3. Covariance and correlation: Covariance and correlation are measures of how two variables are related to each other. Correlation is a standardized measure of covariance.

4. Population and sample: A population is the entire group of individuals or objects that you're interested in studying. A sample is a subset of the population.

5. Nominal, ordinal, and continuous, discrete data types: These are different types of data. Nominal data is categorical data that has no order or hierarchy. Ordinal data is categorical data that has an order or hierarchy. Continuous data is numerical data that can take on any value within a range. Discrete data is numerical data that can only take on certain values.

6. Outliers: An outlier is a data point that is significantly different from the other data points in the dataset.

7. The Simpson’s Paradox: The Simpson's Paradox is a phenomenon where a trend appears in different groups of data but disappears or reverses when the groups are combined.

8. Selection Bias: Selection bias is a bias introduced when the sample selection process is not random.

 

2️⃣ Probability and Distributions



9. The Bayes Theorem: The Bayes Theorem is a mathematical formula used to update the probability of a hypothesis based on new evidence.

10. Conditional probability: Conditional probability is the probability of an event occurring given that another event has already occurred.

11. Normal distribution: The normal distribution is a bell-shaped curve that is used to model many real-world phenomena.

12. Uniform distribution: The uniform distribution is a probability distribution where all outcomes are equally likely.

13. Bernoulli distribution: The Bernoulli distribution is a probability distribution that models a single trial with two possible outcomes.

14. Binomial distribution: The binomial distribution is a probability distribution that models the number of successes in a fixed number of trials.

15. Geometric distribution: The geometric distribution is a probability distribution that models the number of trials needed to get the first success.

16. Poisson distribution: The Poisson distribution is a probability distribution that models the number of events occurring in a fixed interval of time or space.

17. Exponential distribution: The exponential distribution is a probability distribution that models the time between events occurring in a Poisson process.

18. Deriving the mean and variance of distributions: The mean and variance of a probability distribution can be calculated using mathematical formulas.

19. Central Limit Theorem: The Central Limit Theorem states that as the sample size increases, the distribution of sample means approaches a normal distribution.

20. The Birthday problem: The Birthday problem is a probability problem that asks how many people are needed in a room before there is a greater than 50% chance that two people share the same birthday.

21. Card probability problems: Card probability problems involve calculating the probability of getting certain hands or outcomes in a game of cards.

22. Die roll problems: Die roll problems involve calculating the probability of getting certain outcomes when rolling one or more dice.

23. OLS regression: OLS (Ordinary Least Squares) regression is a statistical method used to estimate the relationship between a dependent variable and one or more independent variables. The goal of OLS regression is to find the line of best fit that minimizes the sum of the squared errors.

24. Confidence vs. prediction intervals: Confidence intervals and prediction intervals are used to estimate the range of values that a future observation or sample mean is likely to fall within. Confidence intervals are used to estimate the range of values that a population mean is likely to fall within. Prediction intervals are used to estimate the range of values that a future observation is likely to fall within.

25. Logistic regression: Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. Logistic regression uses the logistic function to model the probability of the dependent variable being equal to 1.

26. Regression model assumptions: Regression models have several assumptions that must be met in order for the results to be valid. These assumptions include linearity, independence, homoscedasticity, normality, and absence of multicollinearity.

27. Model diagnostic checks: Model diagnostic checks are used to assess whether a regression model meets its assumptions. These checks include residual plots, QQ plots, and tests for normality and homoscedasticity.

28. R-Square vs. R-Square Adjusted: R-Square is a measure of the proportion of variation in the dependent variable that is explained by the independent variables in the model. R-Square Adjusted is a modified version of R-Square that penalizes for the number of independent variables in the model.

29. AIC, BIC, Cp Statistics: AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and Cp (Mallows' Cp) Statistics are measures used to compare different regression models. AIC and BIC penalize for the number of independent variables in the model. CP Statistic is used to determine the best subset of independent variables to include in the model.

30. Model Interpretation: Model interpretation involves interpreting the coefficients in a regression model. The coefficients represent the change in the dependent variable associated with a one-unit change in the independent variable, holding all other independent variables constant.

3️⃣ Hypothesis Testing and A/B Testing



31. Hypothesis statements: Hypothesis statements are statements that define the null and alternative hypotheses for a statistical test. The null hypothesis is the hypothesis that there is no difference between two groups or that a coefficient is equal to zero. The alternative hypothesis is the hypothesis that there is a difference between two groups or that a coefficient is not equal to zero.

32. Z-Test: Z-Test is a statistical test used to compare a sample mean to a known population mean when the population standard deviation is known.

33. T-Test: T-Test is a statistical test used to compare a sample mean to a known population mean when the population standard deviation is unknown.

34. T-Test for sample means: T-Test for sample means is a statistical test used to compare the means of two independent samples.

35. Proportion test: Proportion test is a statistical test used to compare the proportions of two independent samples.

36. Paired and unpaired T-Tests: Paired T-Test is a statistical test used to compare the means of two related samples. Unpaired T-Test is a statistical test used to compare the means of two independent samples.

37. Variance test: Variance test is a statistical test used to compare the variances of two independent samples.

38. ANOVA: ANOVA (Analysis of Variance) is a statistical test used to compare the means of more than two independent samples.

39. Chi-Squared test: Chi-Squared test is a statistical test used to determine whether there is a significant association between two categorical variables.

40. Goodness of Fit test for categorical data: Goodness of Fit test is a statistical test used to determine whether the observed frequencies of a categorical variable fit the expected frequencies.

41. Nominal, ordinal, and continuous, discrete data types: These data types are used in hypothesis testing and A/B testing. Nominal and ordinal data are often analyzed using chi-squared tests, while continuous and discrete data are often analyzed using T-Tests.

42. Pairwise tests: Pairwise tests are statistical tests used to compare the means of multiple groups. These tests are useful when there are more than two groups being compared.

43. T-Test assumptions: T-Tests have several assumptions that must be met in order for the results to be valid. These assumptions include normality, homoscedasticity, and independence.

44. Non-parametric tests: Non-parametric tests are statistical tests that do not make assumptions about the distribution of the data. These tests are useful when the data does not meet the assumptions of parametric tests.

45. Type 1 and 2 errors: Type 1 error occurs when the null hypothesis is rejected when it is actually true. Type 2 error occurs when the null hypothesis is not rejected when it is actually false. It is important to balance the risk of these errors when conducting hypothesis testing.

 

마무리하며


이 45가지 통계 개념은 데이터 과학 면접에서 흔히 묻는 개념들입니다. 이러한 개념을 이해하는 것은 데이터 사이언티스트 테크니컬 인터뷰를 통과하는데 있어서 필수적이라고 할 수 있겠습니다. 공식과 정의를 아는 것뿐만 아니라 실제 문제에 적용할 수 있는 능력도 중요합니다. 통계에 능숙해지기 위해서는 반복이 필수적이고, 꾸준하게 하시다보면 좋은 결과가 있을 거라 믿어 의심치 않습니다. 응원하겠습니다 🙌🏻

 

 

이제 직접 미국 대학원과 현지 취업을 경험해 본 멘토들과 함께 대학원 진학과 미국 취업을 준비해 보세요.

https://www.datakorlab.com/

 

Data KorLab

Featured Courses

www.datakorlab.com

 

 

 

 

#영어이력서 #영문이력서 #이력서 #Cover #coverletter #resume #레주메 #데이터분석 #데이터애널리스트 #미국데이터분석석사 #미국석사 #데이터과학자 #네트워킹이벤트 #GMAT #링크드인 #미국데이터분석 #데이터사이언스 #미국 #GRE #해외취업 #데이터사이언티스트 #미국현지취업 #데이터분석석사 #글래빈 #미국데이터사이언티스트 #글래빈미국 #글래빈미국데이터사이언티스트 #브라이언 #브라이언미국데이터사이언티스트 #스테이시미국데이터사이언티스트 #미국대학원 #해외취업마스터 #미국생활 #데이터사이언스석사 #미국유학생 #유학생 #미국데이터 #애널리틱스석사 #데이터석사 #미국데이터석사유학 #뉴욕직딩 #미국유학 #미국직장인 #미국취업 #미국현지취업 #prerequisites #선수과목 #미국대학원선수과목 #리트코드 #테크니컬인터뷰 #코딩인터뷰 #leetcode #커피챗 #대학원진학컨설팅 #데이터분석대학원 #데이터사이언스인터뷰 #데이터분석가인터뷰 #datascientistinterview #datascientisttechnicalinterview 

728x90

댓글