Data Mining (18 page)

Authors: Mehmed Kantardzic

BOOK: Data Mining

11.92Mb size Format: txt, pdf, ePub

TABLE 3.2.
Training Data Set for Applying
Relief
Algorithm

In this example the number of samples is low and therefore we use all samples (m = n) to estimate the feature scores. Based on the previous results feature F
₁is much more relevant in classification than feature F
₂. If we assign the threshold value of τ = 5, it is possible to eliminate feature F
₂and build the classification model only based on feature F
₁.

Relief is a simple algorithm that relies entirely on a statistical methodology. The algorithm employs few heuristics, and therefore it is efficient—its computational complexity is O(mpn). With m, number of randomly selected training samples, being a user-defined constant, the time complexity becomes O(pn), which makes it very scalable to data sets with both a huge number of samples
n
and a very high dimensionality p. When n is very large, it is assumed that m
n. It is one of the few algorithms that can evaluate features for real-world problems with large feature space and large number of samples. Relief is also noise-tolerant and is unaffected by feature interaction, and this is especially important for hard data-mining applications. However, Relief does not help with removing redundant features. As long as features are deemed relevant to the class concept, they will be selected even though many of them are highly correlated to each other.

One of the Relief problems is to pick a proper value of τ. Theoretically, using the so-called Cebysev’s inequality τ may be estimated:

While the above formula determines τ in terms of α (data-mining model accuracy) and m (the training data set size), experiments show that the score levels display clear contrast between relevant and irrelevant features so τ can easily be determined by inspection.

Relief was extended to deal with multi-class problems, noise, redundant, and missing data. Recently, additional feature-selection methods based on feature weighting are proposed including ReliefF, Simba, and I-Relief, and they are improvements of the basic Relief algorithm.

3.4 ENTROPY MEASURE FOR RANKING FEATURES

A method for unsupervised feature selection or ranking based on entropy measure is a relatively simple technique, but with a large number of features its complexity increases significantly. The basic assumption is that all samples are given as vectors of a feature’s values without any classification of output samples. The approach is based on the observation that removing an irrelevant feature, a redundant feature, or both from a set may not change the basic characteristics of the data set. The idea is to remove as many features as possible and yet maintain the level of distinction between the samples in the data set as if no features had been removed. The algorithm is based on a similarity measure S that is in inverse proportion to the distance D between two n-dimensional samples. The distance measure D is small for close samples (close to 0) and large for distinct pairs (close to one). When the features are numeric, the similarity measure S of two samples can be defined as

where D
_ijis the distance between samples x
_iand x
_jand α is a parameter mathematically expressed as

D is the average distance among samples in the data set. Hence, α is determined by the data. But, in a successfully implemented practical application, it was used a constant value of α = 0.5. Normalized Euclidean distance measure is used to calculate the distance D
_ijbetween two samples x
_iand x
_j:

where n is the number of dimensions and max
_kand min
_kare maximum and minimum values used for normalization of the
k
th dimension.

All features are not numeric. The similarity for nominal variables is measured directly using Hamming distance:

where |x
_ik= x
_jk| is 1 if x
_ik= x
_jk, and 0 otherwise. The total number of variables is equal to n. For mixed data, we can discretize numeric values and transform numeric features into nominal features before we apply this similarity measure. Figure
3.3
a is an example of a simple data set with three categorical features; corresponding similarities are given in Figure
3.3
b.

Figure 3.3.
A tabular representation of similarity measures S. (a) Data set with three features; (b) a table of similarity measures categorical feature S
_ijbetween samples.

Other books

Lawless by Emma Wildes

The Blazing Star by Erin Hunter

Hostages of Hate by Franklin W. Dixon

Proven Guilty by Jim Butcher

Tempting Taylor by Beverly Havlir

Asimov's Science Fiction: April/May 2014 by Penny Publications

Razer's Ride by Jamie Begley

Sorceress Rising (A Gargoyle and Sorceress Tale Book 2) by Lisa Blackwood

Gettin' Lucky (Love and Laughter) by Raye, Kimberly

Dark and Damaged: Eight Tortured Heroes of Paranormal Romance: Paranormal Romance Boxed Set by Colleen Gleason