Data Normalization and Standardization

4 min readJan 31, 2020

Normalization:

Normalization typically means rescaling the values into a range of [0,1]. In vector terminology, “normalizing” a vector most often means dividing by a norm of the vector and it often refers to rescaling by the minimum and range of the vector. This is done to ensure all the elements lie between 0 and 1 thus bringing all the values of numeric columns in the dataset to a common scale.

Standardization:

Standardization typically means rescaling the data to have a mean of 0 and a standard deviation of 1 (unit variance). In vector terminology, “standardizing” a vector most often means subtracting a measure of location and dividing by a measure of scale.

Why should we Standardize / Normalize variables?

Standardization

In a scenario where we when, we compare measurements that have different units, standardizing the features around the center and 0 with a standard deviation of 1 is very important. This is because variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bias.

A good example to better explain this concept has been detailed in [2]: A variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization will give the variable with the larger range weight of 1000 in the analysis. For this purpose, typical data standardization procedures are employed to equalize the range and/or data variability.

Normalization:

On a similar note, the idea of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For some use cases in machine learning, every dataset does not require normalization and is required only when features have different ranges.

A good example to better explain this concept has been detailed in [2]: Consider a data set containing two features, employee’s age, and salary(x2), where age ranges from 0–100, while salary ranges from 0–100,000 and higher. It is obvious that an employee’s salary is about 1,000 times larger than age which concludes that these two features are in different ranges. When we do further analysis, let’s say multivariate linear regression, the attributed salary will intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor, hence we normalize the data to bring all the variables to the same range.

When should we use Normalization and Standardization

Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve).

Standardization, in general, assumes that your data has a Gaussian (bell curve) distribution. This is not entirely true in all scenarios, but the technique is more effective if your attribute distribution is Gaussian.

Fuzzy Matching

Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations. It usually operates at sentence-level segments, but some translation technology allows matching at a phrasal level.

Why do we require Fuzzy matching?

There are significant challenges in a large number of data science-related data sets:

Deduplication: Aligning similar categories or entities in a data set. For example, we may need to combine ‘D J Trump’, ‘D. Trump’ and ‘Donald Trump’ into the same entity.

Record Linkage. Joining data sets on a particular entity. For example, joining records of ‘D J Trump’ to a URL of his Wikipedia page.

Fuzzy string matching:

It is a technique of finding strings that match a pattern approximately (not 100% exact). It is a type of search that will find matches even when users misspell words or enter only partial words for the search, also termed as approximate string matching. Some applications include the following:

A spell checker and spelling-error, typos corrector. For example, a user types “Trrummpp” into Google, automatically a list of hits is returned along with “Showing results for Trump”. This implies that a search query returns relevant results even if the user’s entry contains additional or missing characters or other types of spelling errors.

There are many algorithms that can provide fuzzy matching but they do not perform up to the mark when used on large data sets of greater than a few thousand records. This is because they compare each record to all the other records in the data set. Also, most string matching functions are also dependant on the length of the two strings being compared and can, therefore, slow down even further when comparing long texts.

One of the solutions to this problem lies in an NLP algorithm. It is a simple algorithm that splits the text into ‘chunks’ (or ngrams), counts the occurrence of each chunk for a given sample and then applies a weighting to this based on how rare the chunk is across all the samples of a data set. This implies that useful words are filtered from the ‘noise’ of more common words which occur within text.

References:

[1] https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf

[2] https://medium.com/@swethalakshmanan14/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

[3] http://www.dataminingblog.com/standardization-vs-normalization/

Data Normalization and Standardization

Why should we Standardize / Normalize variables?

When should we use Normalization and Standardization

Fuzzy Matching

Written by Sathish Manthani

No responses yet