Can you use KNN on categorical variables?
Table of Contents
Can you use KNN on categorical variables?
KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.
Can a categorical variable have more than two categories?
If the categorical variable has more than two categories, you might think that you could simply add a third value to the dichotomous variable. The correct way to incorporate categorical variables is to include a number of indicator or dummy variables, which all take the values 0 and 1.
Does it matter whether you use all or K 1 dummy variables when using Knn?
The major point is to exclude one of the m dummy variables to avoid redundancy. Now comes the surprising part: when using categorical predictors in machine learning algorithms such as k-nearest neighbors (kNN) or classification and regression trees, we keep all m dummy variables.
Which distance metric do we use in Knn for categorical variables?
Both Euclidean and Manhattan distances are used in case of continuous variables, whereas hamming distance is used in case of categorical variable.
Can you standardize categorical variables?
Normalization/standardization of features is done to bring all features to a similar scale. When you one hot encode categorical variables they are either 0/1 hence there is not much scale difference like 10~1000 hence there is no need to apply techniques for normalization/standardization.
Can categorical data be continuous?
Quantitative variables can be classified as discrete or continuous. Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. If the discrete variable has many levels, then it may be best to treat it as a continuous variable.