5 Key Statistical Techniques Important for a Data scientist

In my previous article, Data Scientists, What Are Those?, I talked about the essence of a data scientist, there I pointed out four subject areas that a DS is expected to be somewhat skilled at, I called them “The four pillars of expertise”; business domain, computer science (programming), mathematics and communication . It’s important for a data scientist to piece together a compelling narrative, hence, I’ll be developing on one of these pillars, mathematics, specifically, statistics and some of the techniques implored by data scientist. As Josh Wills put it, “data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”

I’ve always loved mathematics since I could count, as a kid in elementary it was the one subject that gave me a sense of security, the right answer to a simple calculation was the same at all times. You’ve all heard the saying, people lie but numbers don’t. Well, with statistics, i found it required a further level of understanding of various theories and assumptions.

What is Statistics?

Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data. A statistic is a piece of data from a portion of a population. Think of it like this: If you have a bit of information, it’s a statistic. If you look at part of a data set, it’s a statistic. If you know something about 10% of people, that’s a statistic too.

Statistics is a way to understand the data that is collected about us and the world. All of that data is meaningless without a way to interpret it, which is where statistics comes in. Statistics is about data and variables. It’s also about analyzing that data and producing some meaningful information about that data.

The two main statistics used in data analysis are:

Descriptive Statistics: Anything that describes data

Inferential statistics: Inferential stats is just a “best guess” about something, based on data

I’m going to attempt to keep this high level although it is easy to get lost in the terminologies, I find it important for easier comprehension to start by getting a feel of the ideas before you jump in the technical details. The goal is to learn.

Therefore there are some terminologies one should get familiar with before we proceed.

Variables :You might be familiar with variables from algebra, like “x” or “y.” They stand for something (usually a number that you plug-in to solve an equation). In statistics, variables are broken down into two types: numerical or quantitative variables and categorical variables. Numerical variables are the variables you’re most familiar with: numbers. For example, those “x” and “y” variables in algebra stand for a number.

Some of the statistical techniques include:

Linear Regression:

In statistics, linear regression is a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable. The best fit is done by making sure that the sum of all the distances between the shape and the actual observations at each point is as small as possible. Best fit is the point the error is at its minimum, i.e no other position will result in less error. Two major types of linear regression, simple linear regression, here you have one dependent and one independent variable, while in multiple linear regression, there is more than one independent variable used to predict the dependent variable.

University GPA as a function of High School GPA.

Above is an example of a best fit line of two variables.

Classification:

In classification, you assign categories to a collection of data in order to aid in more accurate predictions and analysis. Here let’s focus on logistic regression.

Logistic Regression is the appropriate regression analysis to conduct when the dependent variable is binary . Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Resampling Methods:

Resampling is the method that consists of drawing repeated samples from the original data samples. Resampling generates a unique sampling distribution on the basis of the actual data. It uses experimental methods, rather than analytical methods, to generate the unique sampling distribution. It yields unbiased estimates as it is based on the unbiased samples of all the possible results of the data studied by the researcher.

In order to understand the concept of resampling, you should understand the terms Bootstrapping and Cross-Validation:

  • Bootstrapping is a technique that helps in many situations like validation of a predictive model performance, ensemble methods, estimation of bias and variance of the model. It works by sampling with replacement from the original data, and take the “not chosen” data points as test cases. We can make this several times and calculate the average score as estimation of our model performance.
  • On the other hand, cross validation is a technique for validating the model performance, and it’s done by split the training data into k parts. We take the k — 1 parts as our training set and use the “held out” part as our test set. We repeat that k times differently. Finally, we take the average of the k scores as our performance estimation.

Tree-Based Methods:

Tree-based methods can be used for both regression and classification problems. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods. The methods below grow multiple trees which are then combined to yield a single consensus prediction.

  • Bagging is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multistep of the same carnality/size as your original data. By increasing the size of your training set you can’t improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.
  • Boosting is an approach to calculate the output using several different models and then average the result using a weighted average approach. By combining the advantages and pitfalls of these approaches by varying your weighting formula you can come up with a good predictive force for a wider range of input data, using different narrowly tuned models.
  • The random forest algorithm is actually very similar to bagging. Also here, you draw random bootstrap samples of your training set. However, in addition to the bootstrap samples, you also draw a random subset of features for training the individual trees; in bagging, you give each tree the full set of features. Due to the random feature selection, you make the trees more independent of each other compared to regular bagging, which often results in better predictive performance (due to better variance-bias trade-offs) and it’s also faster, because each tree learns only from a subset of features.

Support Vector Machines:

SVM is a classification technique that is listed under supervised learning models in Machine Learning. In layman’s terms, it involves finding the hyperplane (line in 2D, plane in 3D and hyperplane in higher dimensions. More formally, a hyperplane is n-1 dimensional subspace of an n-dimensional space) that best separates two classes of points with the maximum margin. Essentially, it is a constrained optimization problem where the margin is maximized subject to the constraint that it perfectly classifies the data (hard margin).

The data points that kind of “support” this hyperplane on either sides are called the “support vectors”. In the above picture, the filled blue circle and the two filled squares are the support vectors. For cases where the two classes of data are not linearly separable, the points are projected to an exploded (higher dimensional) space where linear separation may be possible. A problem involving multiple classes can be broken down into multiple one-versus-one or one-versus-rest binary classification problems.

Unsupervised Learning:

So far, we only have discussed supervised learning techniques, in which the groups are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to. Another set of techniques can be used when the groups (categories) of data are not known. They are called unsupervised as it is left on the learning algorithm to figure out patterns in the data provided. Clustering is an example of unsupervised learning in which different data sets are clustered into groups of closely related items. Below is the list of most widely used unsupervised learning algorithms:

  • Principal Component Analysis helps in producing low dimensional representation of the dataset by identifying a set of linear combination of features which have maximum variance and are mutually un-correlated. This linear dimensionality technique could be helpful in understanding latent interaction between the variable in an unsupervised setting.
  • k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster.
  • Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree.

Conclusion:

There are various statistical techniques implored by data scientist, in analyzing data to run the different machine learning algorithms. In other be able to choose the right algorithm to provide solution to whatever problem confronted with, an intermediate understanding of the statistical theories utilized by these algorithms are required for a data scientist.

Data Scientist