Data Mining: Concepts and Techniques — Slides for Textbook ...

Data Mining: Concepts and Techniques — Slides for Textbook ...
Description:

9/6/09
Data Mining: Concepts
and Techniques
1
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 1 —
©Jiawei Han and Micheline
Kamber
Department of Computer
Science
University of Illinois
at Urbana-Champaign
www.cs.uiuc.edu/~hanj
9/6/09
Data Mining: Concepts
and Techniques
2
Data Mining: Concepts and Techniques
9/6/09
Data Mining: Concepts
and Techniques
3
Acknowledgements
This set of slides started
with Han’s tutorial for UCLA Extension course in February 1998
Other subsequent contributors:
Dr. Hongjun
Lu (Hong Kong Univ. of Science and
Technology)
Graduate students from Simon
Fraser Univ., Canada, notably Eugene
Belchev, Jian Pei, and Osmar R. Zaiane
Graduate students from Univ.
of Illinois at Urbana-Champaign
9/6/09
Data Mining: Concepts
and Techniques
4
CS497JH Schedule (Fall 2002)
Chapter 1. Introduction {W1:L1}
Chapter 2. Data pre-processing
{W4: L1-2}
Homework # 1 distribution
(SQLServer2000)
Chapter 3. Data warehousing
and OLAP technology for data mining {W2:L1-2, W3:L1-2}
Homework # 2 distribution
Chapter 4. Data mining primitives,
languages, and system architectures {W5: L1}
Chapter 5. Concept description:
Characterization and comparison {W5: L2, W6: L1}
Chapter 6. Mining association
rules in large databases {W6:L2, W7:L1-L21, W8: L1}
Homework #3 distribution
Chapter 7. Classification
and prediction {W8:L2, W9: L2, W10:L1}
Midterm {W9: L1}
Chapter 8. Clustering analysis
{W10:L2, W11: L1-2}
Homework #4 distribution
Chapter 9. Mining complex
types of data {W12: L1-2, W13:L1-2}
Chapter 10. Data mining applications
and trends in data mining {W14: L1}
Research/Development project
presentation (W14-W15 + final exam period)
Final Project Due
9/6/09
Data Mining: Concepts
and Techniques
5
Where to Find the Set of Slides?
Book page: (MS PowerPoint
files):
www.cs.uiuc.edu/~hanj/dmbook
Updated course presentation
slides (.ppt):
www-courses.cs.uiuc.edu/~cs497jh/
Research papers, DBMiner
system, and other related information:
www.cs.uiuc.edu/~hanj or www.dbminer.com
9/6/09
Data Mining: Concepts
and Techniques
6
Chapter 1. Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind
of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining
systems
Major issues in data mining
9/6/09
Data Mining: Concepts
and Techniques
7
Necessity Is the Mother of Invention
Data explosion problem
Automated data collection
tools and mature database technology lead to tremendous amounts of data
accumulated and/or to be analyzed in databases, data warehouses, and
other information repositories
We are drowning in data,
but starving for knowledge!
Solution: Data warehousing
and data mining
Data warehousing and on-line analytical processing
Miing interesting knowledge
(rules, regularities, patterns, constraints) from data in large databases
9/6/09
Data Mining: Concepts
and Techniques
8
Evolution of Database Technology
1960s:
Data collection, database
creation, IMS and network DBMS
1970s:
Relational data model, relational
DBMS implementation
1980s:
RDBMS, advanced data models
(extended-relational, OO, deductive, etc.)
Application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing,
multimedia databases, and Web databases
2000s
Stream data management and
mining
Data mining with a variety
of applications
Web technology and global
information systems
9/6/09
Data Mining: Concepts
and Techniques
9
What Is Data Mining?
Data mining (knowledge discovery
from data)
Extraction of interesting (non-trivial,
implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining)
in databases (KDD), knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business intelligence,
etc.
Watch out: Is everything
“data mining”?
(Deductive) query processing.
Expert systems or small ML/statistical
programs
9/6/09
Data Mining: Concepts
and Techniques
10
Why Data Mining?—Potential Applications
Data analysis and decision
support
Market analysis and management
Target marketing, customer
relationship management (CRM), market basket analysis, cross selling,
market segmentation
Risk analysis and management
Forecasting, customer retention,
improved underwriting, quality control, competitive analysis
Fraud detection and detection
of unusual patterns (outliers)
Other Applications
Text mining (news group,
email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
9/6/09
Data Mining: Concepts
and Techniques
11
Market Analysis and Management
Where does the data come
from?
Credit card transactions,
loyalty cards, discount coupons, customer complaint calls, plus (public)
lifestyle studies
Target marketing
Find clusters of “model”
customers who share the same characteristics: interest, income level,
spending habits, etc.
Determine customer purchasing
patterns over time
Cross-market analysis
Associations/co-relations
between product sales, & prediction based on such association
Customer profiling
What types of customers buy
what products (clustering or classification)
Customer requirement analysis
identifying the best products
for different customers
predict what factors will
attract new customers
Provision of summary information
multidimensional summary
reports
statistical summary information
(data central tendency and variation)
9/6/09
Data Mining: Concepts
and Techniques
12
Corporate Analysis & Risk Management
Finance planning and asset
evaluation
cash flow analysis and prediction
contingent claim analysis
to evaluate assets
cross-sectional and time
series analysis (financial-ratio, trend analysis, etc.)
Resource planning
summarize and compare the
resources and spending
Competition
monitor competitors and market
directions
group customers into classes
and a class-based pricing procedure
set pricing strategy in a
highly competitive market
9/6/09
Data Mining: Concepts
and Techniques
13
Fraud Detection & Mining Unusual
Patterns
Approaches: Clustering
& model construction for frauds, outlier analysis
Applications: Health care,
retail, credit card service, telecomm.
Auto insurance:
ring of collisions
Money laundering:
suspicious monetary transactions
Medical insurance
Professional patients,
ring of doctors, and ring of references
Unnecessary or correlated
screening tests
Telecommunications: phone-call
fraud
Phone call model: destination
of the call, duration, time of day or week. Analyze patterns that
deviate from an expected norm
Retail industry
Analysts estimate that 38%
of retail shrink is due to dishonest employees
Anti-terrorism
9/6/09
Data Mining: Concepts
and Techniques
14
Other Applications
Sports
IBM Advanced Scout analyzed
NBA game statistics (shots blocked, assists, and fouls) to gain competitive
advantage for New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory
discovered 22 quasars with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data
mining algorithms to Web access logs for market-related pages to discover
customer preference and behavior pages, analyzing effectiveness of Web
marketing, improving Web site organization, etc.
9/6/09
Data Mining: Concepts
and Techniques
15
Data Mining: A KDD Process
Data mining—core of knowledge
discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Knowledge
Task-relevant
Data
Selection
Data Mining
Pattern Evaluation
9/6/09
Data Mining: Concepts
and Techniques
16
Steps of a KDD Process
Learning the application
domain
relevant prior knowledge
and goals of application
Creating a target data set:
data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction
and transformation
Find useful features, dimensionality/variable
reduction, invariant representation.
Choosing functions of data
mining
summarization, classification,
regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation
and knowledge presentation
visualization, transformation,
removing redundant patterns, etc.
Use of discovered knowledge
9/6/09
Data Mining: Concepts
and Techniques
17
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP,
MDA
Statistical Analysis,
Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information
Providers, Database Systems, OLTP
9/6/09
Data Mining: Concepts
and Techniques
18
Architecture: Typical
Data Mining System
Data
Warehouse
Data cleaning & data
integration
Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
9/6/09
Data Mining: Concepts
and Techniques
19
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information
repository
Object-relational database
Spatial and temporal data
Time-series data
Stream data
Multimedia database
Heterogeneous and legacy
database
Text databases & WWW
9/6/09
Data Mining: Concepts
and Techniques
20
Data Mining Functionalities
Concept description: Characterization
and discrimination
Generalize, summarize, and
contrast data characteristics, e.g., dry vs. wet regions
Association (correlation
and causality)
Diaper à Beer [0.5%, 75%]
Classification and Prediction
Construct models (functions)
that describe and distinguish classes or concepts for future prediction
E.g., classify countries
based on climate, or classify cars based on gas mileage
Presentation: decision-tree,
classification rule, neural network
Predict some unknown or missing
numerical values
9/6/09
Data Mining: Concepts
and Techniques
21
Data Mining Functionalities (2)
Cluster analysis
Class label is unknown: Group
data to form new classes, e.g., cluster houses to find distribution
patterns
Maximizing intra-class similarity
& minimizing interclass similarity
Outlier analysis
Outlier: a data object that
does not comply with the general behavior of the data
Noise or exception? No! useful
in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation:
regression analysis
Sequential pattern mining,
periodicity analysis
Similarity-based analysis
Other pattern-directed
or statistical analyses
9/6/09
Data Mining: Concepts
and Techniques
22
Are All the “Discovered” Patterns
Interesting?
Data mining may generate
thousands of patterns: Not all of them are interesting
Suggested approach: Human-centered,
query-based, focused mining
Interestingness measures
A pattern is interesting
if it is easily
understood by humans, valid on
new or test data with some degree of certainty, potentially
useful, novel, or validates
some hypothesis that a user
seeks to confirm
Objective vs. subjective
interestingness measures
Objective: based on statistics
and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s
belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
9/6/09
Data Mining: Concepts
and Techniques
23
Can We Find All and Only Interesting
Patterns?
Find all the interesting
patterns: Completeness
Can a data mining system
find all the interesting patterns?
Heuristic vs. exhaustive
search
Association vs. classification
vs. clustering
Search for only interesting
patterns: An optimization problem
Can a data mining system
find only the interesting patterns?
Approaches
First general all the patterns
and then filter out the uninteresting ones.
Generate only the interesting
patterns—mining query optimization
9/6/09
Data Mining: Concepts
and Techniques
24
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Systems
Statistics
Other
Disciplines
Algorithm
Machine
Learning
Visualization
9/6/09
Data Mining: Concepts
and Techniques
25
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views, different
classifications
Kinds of data to be mined
Kinds of knowledge to be
discovered
Kinds of techniques utilized
Kinds of applications adapted
9/6/09
Data Mining: Concepts
and Techniques
26
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse,
transactional, stream, object-oriented/relational, active, spatial,
time-series, text, multi-media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination,
association, classification, clustering, trend/deviation, outlier analysis,
etc.
Multiple/integrated functions
and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse
(OLAP), machine learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication,
banking, fraud analysis, bio-data mining, stock market analysis, Web
mining, etc.
9/6/09
Data Mining: Concepts
and Techniques
27
OLAP Mining: Integration of Data Mining
and Data Warehousing
Data mining systems, DBMS,
Data warehouse systems coupling
No coupling, loose-coupling,
semi-tight-coupling, tight-coupling
On-line analytical mining
data
integration of mining and
OLAP technologies
Interactive mining multi-level
knowledge
Necessity of mining knowledge
and patterns at different levels of abstraction by drilling/rolling,
pivoting, slicing/dicing, etc.
Integration of multiple
mining functions
Characterized classification,
first clustering and then association
9/6/09
Data Mining: Concepts
and Techniques
28
An OLAM Architecture
Data
Warehouse
Meta Data
MDDB
OLAM
Engine
OLAP
Engine
User GUI
API
Data Cube
API
Database
API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration
Filtering
Databases
Mining query
Mining result
9/6/09
Data Mining: Concepts
and Techniques
29
Major Issues in Data Mining
Mining methodology
Mining different kinds of
knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency,
effectiveness, and scalability
Pattern evaluation: the interestingness
problem
Incorporation of background
knowledge
Handling noise and incomplete
data
Parallel, distributed and
incremental mining methods
Integration of the discovered
knowledge with existing one: knowledge fusion
User interaction
Data mining query languages
and ad-hoc mining
Expression and visualization
of data mining results
Interactive mining of knowledge
at multiple levels of abstraction
Applications and social
impacts
Domain-specific data mining
& invisible data mining
Protection of data security,
integrity, and privacy
9/6/09
Data Mining: Concepts
and Techniques
30
Summary
Data mining: discovering
interesting patterns from large amounts of data
A natural evolution of database
technology, in great demand, with wide applications
A KDD process includes data
cleaning, data integration, data selection, transformation, data mining,
pattern evaluation, and knowledge presentation
Mining can be performed in
a variety of information repositories
Data mining functionalities:
characterization, discrimination, association, classification, clustering,
outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
9/6/09
Data Mining: Concepts
and Techniques
31
A Brief History of Data Mining Society
1989 IJCAI Workshop on
Knowledge Discovery in Databases (Piatetsky-Shapiro)
Knowledge Discovery in
Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on
Knowledge Discovery in Databases
Advances in Knowledge Discovery
and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
1996)
1995-1998 International
Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
Journal of Data Mining
and Knowledge Discovery (1997)
1998 ACM SIGKDD, SIGKDD’1999-2001
conferences, and SIGKDD Explorations
More conferences on data
mining
PAKDD (1997), PKDD (1997),
SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
9/6/09
Data Mining: Concepts
and Techniques
32
Where to Find References?
Data mining and KDD (SIGKDD:
CDROM)
Conferences: ACM-SIGKDD,
IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and
Knowledge Discovery, KDD Explorations
Database systems (SIGMOD:
CD ROM)
Conferences: ACM-SIGMOD,
ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE,
JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning
(ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning,
Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat.
Meeting, etc.
Journals: Annals of statistics,
etc.
Visualization
Conference proceedings: CHI,
ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization
and computer graphics, etc.
9/6/09
Data Mining: Concepts
and Techniques
33
Recommended Reference Books
R. Agrawal, J. Han, and H.
Mannila, Readings in Data Mining: A Database Perspective, Morgan Kaufmann
(in preparation)
U. M. Fayyad, G. Piatetsky-Shapiro,
P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein,
and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data
Mining: Concepts and Techniques. Morgan Kaufmann, 2001
D. J. Hand, H. Mannila, and
P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani,
and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer-Verlag, 2001
T. M. Mitchell, Machine Learning,
McGraw Hill, 1997
G. Piatetsky-Shapiro and
W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
S. M. Weiss and N. Indurkhya,
Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank,
Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2001
9/6/09
Data Mining: Concepts
and Techniques
34
www.cs.uiuc.edu/~hanj
Thank
you !!!
page url: http://www.docftp.com/pdf/2q95mv1-Data+Mining%3A+Concepts+and+Techniques+%E2%80%94+Slides+for+Textbook/

hot pdf files:

   Direct Download
Hot Searches