Data Science Interview Questions

10 min readMar 3, 2020

Photo by <a href=”https://burst.shopify.com/@matthew_henry?utm_campaign=photo_credit&utm_content=Free+Stock+Photo+of+Group+Job+Interview+%E2%80%94+HD+Images&utm_medium=referral&utm_source=credit">Matthew Henry</a> from <a href=”https://burst.shopify.com/teamwork?utm_campaign=photo_credit&utm_content=Free+Stock+Photo+of+Group+Job+Interview+%E2%80%94+HD+Images&utm_medium=referral&utm_source=credit">Burst</a>

Before any data science interview it is important that you should revise your core concepts so that you are at ease during the interview. Below are a list of questions and answers to help you do that. Note that this is not proxy for indepth knowledge in your field and knowing these answers by heart do not make you a professional. For that you will need to sift through data for long hours.

What is the tradeoff between bias and variance?

Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

Variance is error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. E = bias² + variance + ℇ. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.

What is the difference between supervised and unsupervised machine learning algorithm?

Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you’ll need to first label the data you’ll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.

How is knn different from kmeans clustering?

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t — and is thus unsupervised learning.

Define precision and recall.

Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.

What is Bayes’ Theorem? How is it useful in a machine learning context?

Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.

Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?

Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.

Bayes’ Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That’s something important to consider when you’re faced with machine learning interview questions.

Why is naive bayes “naive”?

Despite its practical applications, especially in text mining, Naive Bayes is considered “Naive” because it makes an assumption that is virtually impossible to see in real-life data: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.

As a Quora commenter put it whimsically, a Naive Bayes classifier that figured out that you liked pickles and ice cream would probably naively recommend you a pickle ice cream.

What is regularization and how is it used to solve the problem of overfitting?

In Statistical models, overfitting is a very common problem. One of the methods to solve this problem is Regularization. Before I go further and write a plain definition of regularization, it is very important for you to understand the problem of overfitting.

Let’s take an example. Let’s say, you’ve been given a problem to predict the genre of music one likes based on one’s age. You first try a linear regression model with age as an independent variable and music genre as a dependent one. Sad for you, but this model will mostly fail because of its too simplistic nature.

You then sure want to add more explaining variables to make your model more interesting. You then go ahead and add the sex and the education of each individual in your dataset. Now, you measure its accuracy by a loss metric L(X,Y)L(X,Y) where XX is your design matrix and YY is the denoted targets (music genre in your case). You find out that results are good but not very accurate.

So you go ahead and add more variables like marital status, location, profession, education, etc. Much to your surprise, you find that your model may have poor prediction power. You have just experienced a problem of overfitting. Which means you model sticks too much to the data and might have learned the background noise. In other words, your model has high variance and low bias.

To overcome this problem, we use the technique called regularization. Basically, you need to penalize the loss function by adding a multiple of L1L1 (Lasso) norm of the weight vector ww. You will then come up with the following equation

L(X,Y) + λN(w), where λ is regularisation term and N is either L¹ (Lasso), L²(Lasso), or any other norm.

The biggest reasons for regularisation is

1. To avoid overfitting by not generating high coefficients for predictors that are sparse.

2. To stabilise the estimates especially when there is collinearity in the data.

Explain the difference between L¹ and L² regularization.

The lasso regression uses l¹ norm for regularisation. The ridge regression uses l² norm for regression. The main difference between ridge and lasso regression is the shape of the contrained region. For P=2 case the shape of the L1 constraint region is diamond.

Practically we can see that the l2 regularisation spread error throughout the vector x (Ax = B where A is the observation matrix and B is the output vector). L1 is happy with a sparse x, meaning that some values in x are exactly 0 while others may be relatively large. The former case is sufficient and indeed suitable for a variety of statistical problems. But the later is gaining traction through the field of compressive sensing. From a non rigorous stand point, compressive sensing assumes not that observations come from Gaussian distributed sources of ground truth but that sparse and simple solutions to equations are preferable or more likely (think Occams razor). Hence L1 acts like a feature selector.

What’s your favorite <classification/clustering>algorithm, and can you explain it to me in less than a minute?

What’s the difference between Type I and Type II error?

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby.

What’s the difference between a generative and discriminative model?

Let’s say you have input data x and you want to classify the data into labels y. A generative model learns the joint probability distribution p(x,y) and a discriminative model learns the conditional probability distribution p(y|x) — which you should read as “the probability of y given x”.

generative model — joint probability dist

discriminative model — conditional probability dist

Although you might think that generative models should be better than discriminative models, in reality it is seen that discriminative models outperform generative models in classification tasks.

Which is more important to you– model accuracy, or model performance?

This question tests your grasp of the nuances of machine learning model performance! Machine learning interview questions often look towards the details. There are models with higher accuracy that can perform worse in predictive power — how does that make sense?

Well, it has everything to do with how model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model — a model designed to find fraud that asserted there was no fraud at all! Questions like this help you demonstrate that you understand model accuracy isn’t the be-all and end-all of model performance.

What’s the F1 score? How would you use it?

The F1 score is a measure of the model performance. It is the weighted average of the precision and recall of the model, with results tending to 1 being the best and, those tending to 0 being the worst. You can use it in classification tasks where the true negatives dont matter much.

Explain how a ROC curve works?

REceiver operating characteristics or ROC curve is a graphical representation of a binary classification at different thresholds.

The receiver operating curve is a two dimensional curve in which the false positive rate is plotted in the x axis and true positive rate is plotted on the y axis. The roc curve is useful for visualising and comparing the performance of classifier methods.

what is homoscedasticity?

In linear regression, you must ensure that the data is homoscedastic in nature. That is the variance is the same for all points in the data. You can see if the data is homoscedastic by observing the distance of each point on the regression line. The distance should be the same for the data to be homoscedastic.

Technically the data is considered to be homoscedastic if the ration of the largest variance to the smallest variance is less than 1.5

But, in reality, you often have to deal with the heteroscedastic data where variance is not constant for all the data points in the scattered data. Heteroscedastic data has a cone shape that spreads out in either direction i.e left to right, or right to left. One example of such data is the prediction of annual income by age. More often than not, people in their teens earn close to the minimum wage, so the variance of such data points seems constant at low age. But, if you observe, the income gap widens with the age. For example, One could be driving a Ferrari and other could not even afford a car. this data looks like this

How do you handle missing data?

Before jumping to the methods of data imputation, we have to understand the reason why data goes missing.

missing at random. The propensity of data to be missing is not related to the missing data. But it is related to some other unobserved data.
missing completely at random. the fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables.
missing not at random. missing values are related to some other internal property of the data.