Data Mining (133 page)

Read Data Mining Online

Authors: Mehmed Kantardzic

BOOK: Data Mining
7.7Mb size Format: txt, pdf, ePub

Cox, E.,
Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration
, Morgan Kaufmann, San Francisco, CA, 2005.

Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration
is a handbook for analysts, engineers, and managers involved in developing data-mining models in business and government. As you will discover, fuzzy systems are extraordinarily valuable tools for representing and manipulating all kinds of data, and genetic algorithms and evolutionary programming techniques drawn from biology provide the most effective means for designing and tuning these systems. You do not need a background in fuzzy modeling or genetic algorithms to benefit, for this book provides it, along with detailed instruction in methods that you can immediately put to work in your own projects. The author provides many diverse examples and also an extended example in which evolutionary strategies are used to create a complex scheduling system.

Laurent, A., M. Lesot, eds.,
Scalable Fuzzy Algorithms for Data Management and Analysis, Methods and Design
, IGI Global, Hershey, PA, 2010.

The book presents innovative, cutting-edge fuzzy techniques that highlight the relevance of fuzziness for huge data sets in the perspective of scalability issues, from both a theoretical and experimental point of view. It covers a wide scope of research areas including data representation, structuring and querying, as well as information retrieval and data mining. It encompasses different forms of databases, including data warehouses, data cubes, tabular or relational data, and many applications, among which are music warehouses, video mining, bioinformatics, semantic Web and data streams.

Li, H. X., V. C. Yen,
Fuzzy Sets and Fuzzy Decision-Making
, CRC Press, Inc., Boca Raton, 1995.

The book emphasizes the applications of fuzzy-set theory in the field of management science and decision science, introducing and formalizing the concept of fuzzy decision making. Many interesting methods of fuzzy decision making are developed and illustrated with examples.

Pal, S. K., S. Mitra,
Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing
, John Wiley & Sons, Inc., New York, 1999.

The authors consolidate a wealth of information previously scattered in disparate articles, journals, and edited volumes, explaining both the theory of neuro-fuzzy computing and the latest methodologies for performing different pattern-recognition tasks using neuro-fuzzy networks—classification, feature evaluation, rule generation, and knowledge extraction. Special emphasis is given to the integration of neuro-fuzzy methods with rough sets and genetic algorithms to ensure a more efficient recognition system.

Pedrycz, W., F. Gomide,
An Introduction to Fuzzy Sets: Analysis and Design
, The MIT Press, Cambridge, 1998.

The book provides a highly readable, comprehensive, self-contained, updated, and well-organized presentation of the fuzzy-set technology. Both theoretical and practical aspects of the subject are given a coherent and balanced treatment. The reader is introduced to the main computational models, such as fuzzy modeling and rule-based computation, and to the frontiers of the field at the confluence of fuzzy-set technology with other major methodologies of soft computing.

15

VISUALIZATION METHODS

Chapter Objectives

  • Recognize the importance of a visual-perception analysis in humans to discover appropriate data-visualization techniques.
  • Distinguish between scientific-visualization and information-visualization techniques (IVT).
  • Understand the basic characteristics of geometric, icon-based, pixel-oriented, and hierarchical techniques in visualization of large data sets
  • Explain the methods of parallel coordinates and radial visualization for n-dimensional data sets.
  • Analyze the requirements for advanced visualization systems in data mining.

How are humans capable of recognizing hundreds of faces? What is our “channel capacity” when dealing with the visual or any other of our senses? How many distinct visual icons and orientations can humans accurately perceive? It is important to factor all these cognitive limitations when designing a visualization technique that avoids delivering ambiguous or misleading information. Categorization lays the foundation for a well-known cognitive technique: the “chunking” phenomena. How many chunks can you hang onto? That varies among people, but the typical range forms “the magical number seven, plus or minus two.” The process of reorganizing large amounts of data into fewer chunks with more bits of information per chunk is known in cognitive science as “recoding.” We expand our comprehension abilities by reformatting problems into multiple dimensions or sequences of chunks, or by redefining the problem in a way that invokes relative judgment, followed by a second focus of attention.

15.1 PERCEPTION AND VISUALIZATION

Perception is our chief means of knowing and understanding the world; images are the mental pictures produced by this understanding. In perception as well as art, a meaningful whole is created by the relationship of the parts to each other. Our ability to see patterns in things and pull together parts into a meaningful whole is the key to perception and thought. As we view our environment, we are actually performing the enormously complex task of deriving meaning out of essentially separate and disparate sensory elements. The eye, unlike the camera, is not a mechanism for capturing images so much as it is a complex processing unit that detects changes, forms, and features, and selectively prepares data for the brain to interpret. The image we perceive is a mental one, the result of gleaning what remains constant while the eye scans. As we survey our three-dimensional (3-D) ambient environment, properties such as contour, texture, and regularity allow us to discriminate objects and see them as constants.

Human beings do not normally think in terms of data; they are inspired by and think in terms of images—mental pictures of a given situation—and they assimilate information more quickly and effectively as visual images than as textual or tabular forms. Human vision is still the most powerful means of sifting out irrelevant information and detecting significant patterns. The effectiveness of this process is based on a picture’s submodalities (shape, color, luminance, motion, vectors, texture). They depict abstract information as a visual grammar that integrates different aspects of represented information. Visually presenting abstract information, using graphical metaphors in an immersive 2-D or 3-D environment, increases one’s ability to assimilate many dimensions of the data in a broad and immediately comprehensible form. It converts aspects of information into experiences our senses and mind can comprehend, analyze, and act upon.

We have heard the phrase “Seeing is believing” many times, although merely seeing is not enough. When you understand what you see, seeing becomes believing. Recently, scientists discovered that seeing and understanding together enable humans to discover new knowledge with deeper insight from large amounts of data. The approach integrates the human mind’s exploratory abilities with the enormous processing power of computers to form a powerful visualization environment that capitalizes on the best of both worlds. A computer-based visualization technique has to incorporate the computer less as a tool and more as a communication medium. The power of visualization to exploit human perception offers both a challenge and an opportunity. The challenge is to avoid visualizing incorrect patterns leading to incorrect decisions and actions. The opportunity is to use knowledge about human perception when designing visualizations. Visualization creates a feedback loop between perceptual stimuli and the user’s cognition.

Visual data-mining technology builds on visual and analytical processes developed in various disciplines including scientific visualization, computer graphics, data mining, statistics, and machine learning with custom extensions that handle very large multidimensional data sets interactively. The methodologies are based on both functionality that characterizes structures and displays data and human capabilities that perceive patterns, exceptions, trends, and relationships.

15.2 SCIENTIFIC VISUALIZATION AND INFORMATION VISUALIZATION

Visualization is defined in the dictionary as “a mental image.” In the field of computer graphics, the term has a much more specific meaning. Technically, visualization concerns itself with the display of behavior and, particularly, with making complex states of behavior comprehensible to the human eye. Computer visualization, in particular, is about using computer graphics and other techniques to think about more cases, more variables, and more relations. The goal is to think clearly, appropriately, with insight, and to act with conviction. Unlike presentations, visualizations are typically interactive and very often animated.

Because of the high rate of technological progress, the amount of data stored in databases increases rapidly. This proves true for traditional relational databases and complex 2-D and 3-D multimedia databases that store images, computer-aided design (CAD) drawings, geographic information, and molecular biology structure. Many of the applications mentioned rely on very large databases consisting of millions of data objects with several tens to a few hundred dimensions. When confronted with the complexity of data, users face tough problems: Where do I start? What looks interesting here? Have I missed anything? What are the other ways to derive the answer? Are there other data available? People think iteratively and ask ad hoc questions of complex data while looking for insights.

Computation, based on these large data sets and databases, creates content. Visualization makes computation and its content accessible to humans. Therefore, visual data mining uses visualization to augment the data-mining process. Some data-mining techniques and algorithms are difficult for decision makers to understand and use. Visualization can make the data and the mining results more accessible, allowing comparison and verification of results. Visualization can also be used to steer the data-mining algorithm.

It is useful to develop a taxonomy for data visualization, not only because it brings order to disjointed techniques, but also because it clarifies and interprets ideas and purposes behind these techniques. Taxonomy may trigger the imagination to combine existing techniques or discover a totally new technique.

Visualization techniques can be classified in a number of ways. They can be classified as to whether their focus is geometric or symbolic, whether the stimulus is 2-D, 3-D, or n-dimensional, or whether the display is static or dynamic. Many visualization tasks involve detection of differences in data rather than a measurement of absolute values. It is the well-known Weber’s Law that states that the likelihood of detection is proportional to the relative change, not the absolute change, of a graphical attribute. In general, visualizations can be used to explore data, to confirm a hypothesis, or to manipulate a view.

In
exploratory visualizations
, the user does not necessarily know what he/she is looking for. This creates a dynamic scenario in which interaction is critical. The user is searching for structures or trends and is attempting to arrive at some hypothesis. In
confirmatory visualizations
, the user has a hypothesis that needs only to be tested. This scenario is more stable and predictable. System parameters are often predetermined and visualization tools are necessary for the user to confirm or refute the hypothesis. In
manipulative (production) visualizations
, the user has a validated hypothesis and so knows exactly what is to be presented. Therefore, he/she focuses on refining the visualization to optimize the presentation. This type is the most stable and predictable of all visualizations.

The accepted taxonomy in this book is primarily based on different approaches in visualization caused by different types of source data. Visualization techniques are divided roughly into two classes, depending on whether physical data are involved. These two classes are
scientific visualization
and
information visualization
.

Scientific visualization
focuses primarily on physical data such as the human body, the earth, and molecules. Scientific visualization also deals with multidimensional data, but most of the data sets used in this field use the spatial attributes of the data for visualization purposes, for example, computer-aided tomography(CAT) and CAD. Also, many of the Geographical Information Systems (GIS) use either the Cartesian coordinate system or some modified geographical coordinates to achieve a reasonable visualization of the data.

Information visualization
focuses on abstract, nonphysical data such as text, hierarchies, and statistical data. Data-mining techniques are primarily oriented toward information visualization. The challenge for nonphysical data is in designing a visual representation of multidimensional samples (where the number of dimensions is greater than three). Multidimensional-information visualizations present data that are not primarily plenary or spatial. One-, two-, and three-dimensional, but also temporal information–visualization schemes can be viewed as a subset of multidimensional information visualization. One approach is to map the nonphysical data to a virtual object such as a cone tree, which can be manipulated as if it were a physical object. Another approach is to map the nonphysical data to the graphical properties of points, lines, and areas.

Using historical developments as criteria, we can divide IVT into two broad categories: traditional IVT and novel IVT. Traditional methods of 2-D and 3-D graphics offer an opportunity for information visualization, even though these techniques are more often used for presentation of physical data in scientific visualization. Traditional visual metaphors are used for a single or a small number of dimensions, and they include:

1.
bar charts
that show aggregations and frequencies;

2.
histograms
that show the distribution of variable values;

3.
line charts
for understanding trends in order;

4.
pie charts
for visualizing fractions of a total;

5.
scatter plots
for bivariate analysis.

Color-coding is one of the most common traditional IVT methods for displaying a 1-D set of values where each value is represented by a different color. This representation becomes a continuous tonal variation of color when real numbers are the values of a dimension. Normally, a color spectrum from blue to red is chosen, representing a natural variation from “cool” to “hot,” in other words, from the smallest to the highest values.

With the development of large data warehouses, data cubes became very popular IVT. A
data cube
, the raw-data structure in a multidimensional database, organizes information along a sequence of categories. The categorizing variables are called dimensions. The data, called measures, are stored in cells along given dimensions. The cube dimensions are organized into hierarchies and usually include a dimension representing time. The hierarchical levels for the dimension time may be year, quarter, month, day, and hour. Similar hierarchies could be defined for other dimensions given in a data warehouse. Multidimensional databases in modern data warehouses automatically aggregate measures across hierarchical dimensions; they support hierarchical navigation, expand and collapse dimensions, enable drill down, drill up, or drill across, and facilitate comparisons through time. In a transaction information in the database, the cube dimensions might be product, store, department, customer number, region, month, year. The dimensions are predefined indices in a cube cell and the measures in a cell are roll-ups or aggregations over the transactions. They are usually sums but may include functions such as average, standard deviation, and percentage.

For example, the values for the dimensions in a database may be

1.
region: north, south, east, west;

2.
product: shoes, shirts;

3.
month: anuary, February, March, … , December.

Then, the cell corresponding to (north, shirt, February) is the total sales of shirts for the northern region for the month of February.

Novel IVT can simultaneously represent large data sets with many dimensions on one screen. The widely accepted classifications of these new techniques are

1.
geometric-projection techniques,

2.
icon-based techniques,

3.
pixel-oriented techniques, and

4.
hierarchical techniques.

Geometric-projection techniques
aim to find interesting projections of multidimensional data sets. We will present some illustrative examples of these techniques.

The Scatter-Plot Matrix Technique is an approach that is very often available in new data-mining software tools. A grid of 2-D scatter plots is the standard means of extending a standard 2-D scatter plot to higher dimensions. If you have 10-D data, a 10 × 10 array of scatter plots is used to provide a visualization of each dimension versus every other dimension. This is useful for looking at all possible two-way interactions or correlations between dimensions. Positive and negative correlations, but only between two dimensions, can be seen easily. The standard display quickly becomes inadequate for extremely large numbers of dimensions, and user interactions of zooming and panning are needed to interpret the scatter plots effectively.

Other books

Real Live Boyfriends by E. Lockhart
Lust Killer by Ann Rule
Shadows of Golstar by Terrence Scott
Cut Dead by Mark Sennen
The Carlton Club by Stone, Katherine
Wry Martinis by Christopher Buckley