Description:
Introduction to Link Analysis
September 11, 2007
Analysis of Social Media Seminar
Natalie Glance
Content adapted from:
CS345 Data Mining Class Slides
Anand Rajaraman, Jeffrey D. Ullman
IMA Tutorial on Measuring & Modeling
the Web
Andrew Tomkins
Problem formulation (1998)
Suppose we are given a
collection of documents on some broad topic
e.g., universities, evolution,
iraq
perhaps obtained through
a text search
Can we organize these documents
in some manner?
Link analysis techniques
use links, not text
Link Analysis Algorithms
Page Rank
Hubs and Authorities
Both proposed at about
the same time (1998)
Topic-Specific Page Rank
Spam Detection Algorithms
Mining for Communities
Further Reading: Link Analysis
for Web 2.0
PageRank
The
PageRank citation ranking: Bringing
order to the Web, L. Page,
S. Brin, R. Motwani, T. Winograd, 1999.
The
Anatomy of a Large-Scale Hypertextual
Web Search Engine, S. Brin,
L. Page (1998).
Ranking web pages
Web pages are not equally
âimportantâ
www.silly-billy.com v www.stanford.edu
Inlinks as votes
www.stanford.edu has 23,400 inlinks
www.silly-billy.com has 1 inlink
Are all inlinks equal?
Recursive question!
Simple recursive formulation
Each linkâs vote is proportional
to the importance of its source page
If page P with importance x has n outlinks, each link gets x/n votes
Simple âflowâ model
The web long agoâ¦
Yahoo
Mâsoft
Amazon
y
a
m
y/2
y/2
a/2
a/2
m
y = y
/2 + a /2
a = y
/2 + m
m = a
/2
Solving the flow equations
3 equations, 3 unknowns,
no constants
No unique solution
All solutions equivalent
modulo scale factor
Additional constraint forces
uniqueness
y+a+m = 1
y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method
works for small examples, but we need a better method for large graphs
Matrix formulation
Matrix M has one
row and one column for each web page
Suppose page j has n outlinks
If j links to i, then
Mij=1/n
Else Mij=0
M is a column stochastic matrix
Columns sum to 1
Suppose r is a vector
with one entry per web page
ri is the importance
score of page i
Call it the rank vector
Example
Suppose page j
links to 3 pages, including i
i
j
M
r
r
=
i
1/3
Eigenvector formulation
The flow equations can be
written
r
= Mr
So the rank vector is an
eigenvector of the stochastic web matrix
In fact, its first or principal
eigenvector, with corresponding eigenvalue 1
Example
Yahoo
Mâsoft
Amazon
y 1/2 1/2
0
a 1/2
0 1
m 0
1/2 0
y a
m
y = y
/2 + a /2
a = y
/2 + m
m = a
/2
r = Mr
y
1/2 1/2 0 y
a = 1/2
0 1 a
m
0 1/2 0 m
Power Iteration method
Simple iterative scheme
(aka relaxation)
Suppose there are N web
pages
Initialize: r0
= [1/N,â¦.,1/N]T
Iterate: rk+1
= Mrk
Stop when |rk+1
- rk|1 < ï¥
|x|1
= ï¥1·i·N|xi|
is the L1 norm
Can use any other vector
norm e.g., Euclidean
Power Iteration Example
Yahoo
Mâsoft
Amazon
y 1/2 1/2
0
a 1/2
0 1
m 0
1/2 0
y a
m
y
a =
m
1/3
1/3
1/3
1/3
1/2
1/6
5/12
1/3
1/4
3/8
11/24
1/6
2/5
2/5
1/5
. . .
Random Walk Interpretation
Imagine a random
web surfer
At any time t, surfer is
on some page P
At time t+1, the surfer
follows an outlink from P uniformly at random
Ends up on some page Q linked
from P
Process repeats indefinitely
Let p(t) be a vector
whose ith component is the probability that the surfer is
at page i at time t
p(t) is a probability
distribution on pages
The stationary distribution
Where is the surfer at time
t+1?
Follows a link uniformly
at random
p(t+1) = Mp(t)
Suppose the random walk
reaches a state such that p(t+1) = Mp(t) = p(t)
Then p(t) is called
a stationary distribution for the random walk
Our rank vector r
satisfies r = Mr
So it is a stationary distribution
for the random surfer
Existence and Uniqueness
A central result
from the theory of random walks (aka Markov processes):
For graphs that satisfy
certain conditions, the stationary distribution is unique and eventually
will be reached no matter what the initial probability distribution
at time t = 0.
Spider traps
A group of pages is a spider trap
if there are no links from within the group to outside the group
Random surfer gets trapped
Spider traps violate the
conditions needed for the random walk theorem
Microsoft becomes a spider trap
Yahoo
Mâsoft
Amazon
y 1/2 1/2
0
a 1/2
0 0
m 0
1/2 1
y a
m
y
a =
m
1
1
1
1
1/2
3/2
3/4
1/2
7/4
5/8
3/8
2
0
0
3
. . .
Random teleports
The PageRank solution for
spider traps
At each time step, the random
surfer has two options:
With probability ï¢, follow
a link at random
With probability 1-ï¢, jump
to some page uniformly at random
Common values for ï¢ are
in the range 0.8 to 0.9
Surfer will teleport out
of spider trap within a few time steps
Matrix formulation
Suppose there are N pages
Consider a page j, with
set of outlinks O(j)
We have Mij =
1/|O(j)| when j links to i and Mij = 0 otherwise
The random teleport is equivalent
to
adding a teleport
link from j to every other page
with probability (1-ï¢)/N
reducing the probability
of following each outlink from 1/|O(j)| to ï¢/|O(j)|
Equivalent: tax each page
a fraction (1-ï¢)
of its score and redistribute evenly
PageRank
Construct the NXN matrix
A as follows
Aij = ï¢Mij
+ (1-ï¢)/N
Verify that A is
a stochastic matrix
The page
rank vector r
is the principal eigenvector of this matrix
satisfying r
= Ar
Equivalently, r is
the stationary distribution of the random walk with teleports
Previous example with ï¢=0.8
Yahoo
Mâsoft
Amazon
1/2
1/2 0
1/2
0 0
0 1/2 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15
1/15
a 7/15 1/15
1/15
m 1/15 7/15
13/15
0.8
+ 0.2
y
a =
m
1
1
1
1.00
0.60
1.40
0.84
0.60
1.56
0.776
0.536
1.688
7/11
5/11
21/11
. . .
Dead ends
Pages with no outlinks are
âdead endsâ for the random surfer
Nowhere to go on next
step
Microsoft becomes a dead end
Yahoo
Mâsoft
Amazon
y
a =
m
1
1
1
1
0.6
0.6
0.787
0.547
0.387
0.648
0.430
0.333
0
0
0
. . .
1/2
1/2 0
1/2
0 0
0 1/2 0
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15
1/15
a 7/15 1/15
1/15
m 1/15 7/15
1/15
0.8
+ 0.2
Non-
stochastic!
Dealing with dead-ends
Teleport
Follow random teleport
links with probability 1.0 from dead-ends
Adjust matrix accordingly
Prune and propagate
Preprocess the graph to
eliminate dead-ends
Might require multiple passes
Compute page rank on reduced
graph
Approximate values for deadends
by propagating values from reduced graph
Computing PageRank
Key step is matrix-vector
multiply
rnew =
Arold
Easy if we have enough main
memory to hold A, rold, rnew
Say N = 1 billion pages
We need 4 bytes for each
entry (say)
2 billion entries for vectors,
approx 8GB
Matrix A has N2
entries
1018 is a large
number!
Comparison of Query for âUniversityâ
Google vs. AltaVista,
circa 1998
Problems with PageRank
Measures generic popularity
of a page
Biased against topic-specific
authorities
Ambiguous queries e.g.,
Michael Jordan
Uses a single measure of
importance
Other models e.g., hubs
and authorities
Susceptible to Link spam
Artificial link topographies
created in order to boost page rank
Topic-Specific Page Rank
Taher H. Haveliwala. Topic-Sensitive
PageRank. 11th International
World Wide Web Conference, 2002
Instead of generic popularity,
can we measure popularity within a topic?
E.g., computer science,
health
Bias the random walk
When the random walker
teleports, he picks a page from a set S of web pages
S contains only pages that
are relevant to the topic
E.g., Open Directory (DMOZ)
pages for a given topic (www.dmoz.org)
For each teleport set S,
we get a different rank vector rS
Matrix formulation
Aij = ï¢Mij
+ (1-ï¢)/|S|
if i in
S
Aij = ï¢Mij
otherwise
Show that A is stochastic
We have weighted all pages
in the teleport set S equally
Could also assign different
weights to them
How well does TSPR work?
Experimental results [Haveliwala
2000]
Picked 16 topics
Teleport sets determined
using DMOZ
E.g., arts, business, sports,â¦
âBlind studyâ using
volunteers
35 test queries
Results ranked using Page
Rank and TSPR of most closely related topic
E.g., bicycling using Sports
ranking
In most cases volunteers
preferred TSPR ranking
Which topic ranking to use?
User can pick from a menu
Use Bayesian classification
schemes to classify query into a topic
Can use the context
of the query
E.g., query is launched
from a web page talking about a known topic
History of queries e.g.,
âbasketballâ followed by âjordanâ
User context e.g., userâs
My Yahoo settings, bookmarks, â¦
Scaling with topics and users
Suppose we wanted to cover
1000âs of topics
Need to compute 1000âs
of different rank vectors
Need to store and retrieve
them efficiently at query time
For good performance vectors
must fit in memory
Even harder when we consider personalization
Each user has their own
teleport vector
One page rank vector per
user!
Web spam
Spamming = any deliberate action solely in order to
boost a web pageâs position in search engine results, incommensurate
with pageâs real value
Spam = web pages that are the result of spamming
This is a very broad defintion
SEO industry might disagree!
SEO = search engine optimization
Approximately 10-15% of
web pages are spam
Boosting techniques
Term spamming
Manipulating the text of
web pages in order to appear relevant to queries
Link spamming
Creating link structures
that boost page rank or hubs and authorities scores
Link Farms
Spammerâs goal
Maximize the page rank of
target page t
Technique
Get as many links from accessible
pages as possible to target page t
Construct âlink farmâ
to get PageRank multiplier effect
Link Farms
Inaccessible
t
Accessible
Own
1
2
M
One of the most common and effective
organizations for a link farm
TrustRank idea
Combating
Web Spam with TrustRank,
Z. Gyongyi, H. Garcia-Molina, J. Pedersen, VLDB, 2004.
Basic principle: approximate isolation
It is rare for a âgoodâ
page to point to a âbadâ (spam) page
Sample a set of âseed
pagesâ from the web
Have an oracle (human) identify
the good pages and the spam pages in the seed set
Expensive task, so must
make seed set as small as possible
Trust Propagation
Call the subset of seed
pages that are identified as âgoodâ the âtrusted pagesâ
Set trust of each trusted
page to 1
Propagate trust through
links
Each page gets a trust
value between 0 and 1
Use a threshold value and
mark all pages below the trust threshold as spam
Example
1
4
7
2
5
3
6
good
bad
Rules for trust propagation
Trust attenuation
The degree of trust conferred
by a trusted page decreases with distance
Trust splitting
The larger the number of
outlinks from a page, the less scrutiny the page author gives each outlink
Trust is âsplitâ across
outlinks
Simple model
Suppose trust of page p
is t(p)
Set of outlinks O(p)
For each q in O(p), p confers
the trust
bt(p)/|O(p)| for 0<b<1
Trust is additive
Trust of p is the sum of
the trust conferred on p by all its inlinked pages
Note similarity to Topic-Specific
Page Rank
Within a scaling factor,
trust rank = biased page rank with trusted pages as teleport set
Picking the seed set
Two conflicting considerations
Human has to inspect each
seed page, so seed set must be as small as possible
Must ensure every âgood
pageâ gets adequate trust rank, so need make all good pages reachable
from seed set by short paths
Approaches to picking seed set
Suppose we want to pick
a seed set of k pages
PageRank
Pick the top k pages by
PageRank
Assume high PageRank pages
are close to other highly ranked pages
We care more about high
PageRank âgoodâ pages
Inverse page rank
Pick the pages with the
maximum number of outlinks
Can make it recursive
Pick pages that link to
pages with many outlinks
Formalize as âinverse
PageRankâ
Construct graph Gâ by
reversing each edge in web graph G
PageRank in Gâ is inverse
page rank in G
Pick top k pages by inverse
PageRank
Hubs and Authorities
J. Kleinberg. Authoritative
sources in a hyperlinked environment. Proc.
9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version
in Journal of the ACM 46(1999).
HITS Model
Interesting documents fall
into two classes
Authorities are pages containing useful information
course home pages
home pages of auto manufacturers
Hubs are pages that link to authorities
course bulletin
list of US auto manufacturers
Idealized view
Hubs
Authorities
Mutually recursive definition
A good hub links to many
good authorities
A good authority is linked
from many good hubs
Model using two scores for
each node
Hub score and Authority
score
Represented as vectors
h and a
Transition Matrix A
HITS uses a matrix A[i,
j] = 1 if page i links to page j, 0 if not
AT,
the transpose of A, is similar to the PageRank matrix M,
but AT has 1âs where M has fractions
Example
Yahoo
Mâsoft
Amazon
y 1
1 1
a 1
0 1
m 0
1 0
y a
m
A =
Hub and Authority Equations
The hub score of page P
is proportional to the sum of the authority scores of the pages it links
to
h
= λAa
Constant λ
is a scale factor
The authority
score of page P is proportional to the sum of the hub scores of the
pages it is linked from
a
= μAT
h
Constant μ
is scale factor
Iterative algorithm
Initialize h,
a to all 1âs
h = Aa
Scale h so that its
max entry is 1.0
a
= ATh
Scale a so that its
max entry is 1.0
Continue until h,
a converge
Example
1 1 1
A = 1 0 1
0 1 0
1 1 0
AT = 1 0 1
1 1 0
a(yahoo)
a(amazon)
a(mâsoft)
=
=
=
1
1
1
1
1
1
1
4/5
1
1
0.75
1
. . .
. . .
. . .
1
0.732
1
h(yahoo)
= 1
h(amazon) =
1
h(mâsoft)
= 1
1
2/3
1/3
1
0.73
0.27
. . .
. . .
. . .
1.000
0.732
0.268
1
0.71
0.29
Existence and Uniqueness
h = λAa
a = μAT
h
h
= λμAAT
h
a = λμATA
a
Under reasonable assumptions about
A,
the dual iterative algorithm converges
to vectors
h* and a* such that:
h*
is the principal eigenvector of the matrix AAT
a*
is the principal eigenvector of the matrix ATA
Bipartite cores
Hubs
Authorities
Most densely-connected core
(primary
core)
Less densely-connected core
(secondary
core)
Secondary cores
A single topic can have
many bipartite cores
corresponding to different
meanings, or points of view
abortion: pro-choice, pro-life
evolution: darwinian, intelligent
design
jaguar: auto, Mac, NFL team,
panthera onca
How to find such secondary
cores?
Non-primary eigenvectors
AAT
and ATA have the same set of eigenvalues
An eigenpair is the pair of eigenvectors with the same
eigenvalue
The primary
eigenpair (largest eigenvalue)
is what we get from the iterative algorithm
Non-primary
eigenpairs correspond to other
bipartite cores
The eigenvalue is a measure
of the density of links in the core
Finding secondary cores
Once we find the primary
core, we can remove its links from the graph
Repeat HITS algorithm on
residual graph to find the next bipartite core
Technically, not exactly
equivalent to non-primary eigenpair model
Creating the graph for HITS
We need a well-connected
graph of pages for HITS to work well
Add all pages linking into
and out of search results
Sample results: âsearch enginesâ
Authorities (in 1998)
.346 http://www.yahoo.com/
Yahoo!
.291 http://www.excite.com/
Excite
.239 http://www.mckinley.com/
Welcome to Magellan!
.231 http://www.lycos.com/
Lycos Home Page
.231 http://www.altavista.digital.com/
AltaVista: Main Page
PageRank and HITS
PageRank and HITS are two
solutions to the same problem
What is the value of an
inlink from S to D?
In the PageRank model, the
value of the link depends on the links into S
In the HITS model, it depends
on the value of the other links out of S
The destinies of Page Rank
and HITS post-1998 were very different
Why?
Trawling the Web for Emerging Communities
[KRR 98]
(slides in separate ppt)
Link analysis over weblogs
How are Blogs different
than the Web:
Time-stamps -> time ordering
of posts/links
Blog-blog links (blogrolls)
& post-post links
Friends lists
Reciprocal links built in
as trackbacks
Rank of post vs. rank of
blog
Search results: timeliness
vs. relevance/authority
Research topics
Information diffusion
Ranking blogs
Cascades
Community mining
Information diffusion & ranking
Implicit Structure and the
Dynamic of Blogspace, Eytan Adar, WWE-2004
Macroscopic & microscopic
patterns of blog epidemics
Implicit & explicit
ranking algorithms that take advantage of infection patterns
iRank acts on the implicit
link structure to find those blogs that initiate these epidemics
Information diffusion in blogs
Information diffusion through
blogspace, Gruhl et al, WWW 2004
How topics spread through
blogs
Macroscopic: long-running
âchatterâ topics vs. bursty âspikeâ topics
Microscopic: model topic
diffusion as infectious disease
Information cascades
Finding
patterns in blog shapes and blog evolution,
M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst and N. Glance
Study topology of cascades
implicit in blog post linking behavior
Temporal evolution of types
of cascades for given blogs
Ranking blogs: EigenRumor Algorithm
The
EigenRumor Algorithm for Ranking Blogs,
Ko Fujimura, WWE-2005
"EigenRumor" algorithm
scores each blog entry based on eigenvector calculations.
Higher score assigned to
the blog entries submitted by a good blogger but not yet linked to by
any other blogs based on acceptance of the blogger's prior work.
Mining weblog communities
Extracting
Latent Weblog Communities: A Partitioning
Algorithm for Bipartite Graphs,
Kazunari Ishida, WWE-2005
Builds on âTrawling for
communitiesâ KRR 1998
Bipartite graphs derived
from weblog update information and cited webpages
Partitioning method: weakest
pairs (WP) method to divide a bipartite graph into complete bipartite
subgraphs
Discovery
of Blog Communities Based on Mutual
Awareness, Yu-Ru Lin,
Hari Sundaram, Yun Chi, Jun Tatemura and Belle Tseng, WWE 2006.
Compute mutual awareness
matrix based on reciprocal links
Use an iterative, ranking-based
clustering scheme on the mutual awareness matrix to determine the communities
PageRank is used to determine
the seeds at each step, followed by diffusion of association to determine
the community members.
Combining sentiment & link analysis
Modeling
Trust and Influence in the Blogosphere
Using Link Polarity,
A. Kale, A. Karandikar, P. Kolari, A. Java, T. Finin and A. Joshi, ICWSM
2007
Steps:
Identify polarity of blog-blog
link
Use trust propagation models
to spread sentiment from subset of connected blogs to other blogs
Seed with computed base
polarity of known source blog (blog or blog post)
Take into account magnitude
of strength or weakness of sentiment
Uses lexicon of positive
and negative words
Beyond blogs
Why
We Twitter: Understanding Microblogging
Usage and Communities,
Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng, Joint 9th
WEBKDD and 1st SNA-KDD Workshop, August 2007.
Dataset includes about 1350K
posts from over 75K users
Presents usage trends, network
properties, top hubs and authorities, community structure and geographic
distribution.