dnt q bow

Transcrição

dnt q bow
Text Analytics
Models for Information Retrieval 1
Ulf Leser
Content of this Lecture
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Relevance Feedback in the VSM
Probabilistic Model
Latent Semantic Indexing
Other IR Models
Ulf Leser: Text Analytics, Winter Semester 2010/2011
2
Information Retrieval Core
• The core question in IR:
Which of a given set of (normalized) documents is relevant
for a given query?
• Ranking:
How relevant is each document of a given set for a given
query?
Query
Document base
Normalization
Normalization
Ulf Leser: Text Analytics, Winter Semester 2010/2011
Match /
Relevance
Scoring
Result
3
IR Models
• How is relevance judged?
• Different models of relevancy
U
s
e
r
T
a
s
k
Retrieval:
Adhoc
Filtering
Fuzzy
Extended Boolean
Classic Models
Algebraic
Boolean
Vector-Space
Probabilistic
Generalized Vector
Lat. Semantic Index
Neural Networks
Structured Models
Non-Overlapping Lists
Proximal Nodes
Browsing
Set Theoretic
Probabilistic
Inference Network
Belief Network
Browsing
Flat
Structure Guided
Hypertext
Ulf Leser: Text Analytics, Winter Semester 2010/2011
[BYRN99]
4
Notation
• All of the models we discuss use the “Bag of Words” view
• Definition
– Let K be the set of all terms, k∈K is a term
– Let D be the set of all documents, d∈D is a document
– Let w be the function that maps a document to its set of distinct
terms in an arbitrary, yet fixed order (previously: bow)
• After pre-processing: Stemming, stop-word removal etc.
– Let vd by a vector for d (or a query q) with
• vd[i]=0 if the i’th term ki in w(D) is not contained in w(d)
• vd[i]=1 if the i’th term ki in w(D) is contained in w(d)
– Often, we shall use weights instead of 0/1 memberships
• Let wij≥0 be the weight of term ki in document dj (wij=vj[i])
• wij=0 if ki∉dj
Ulf Leser: Text Analytics, Winter Semester 2010/2011
5
Content of this Lecture
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Relevance Feedback in the VSM
Probabilistic Model
Latent Semantic Indexing
Other IR Models
Ulf Leser: Text Analytics, Winter Semester 2010/2011
6
Boolean Model
• Simple model based on set theory
• Queries are specified as Boolean expressions
– Terms are atoms
– Terms are connected by AND, OR, NOT, (XOR, ...)
• There are no real weights; wij∈{0,1}
– All terms contribute equally to the content of a doc
• Relevance of a document is either 0 or 1
– Computation by Boolean operations on the doc and query vector
– Represent q as set Sq of vectors
• Sq: All combinations of keywords that make a doc relevant for q
• Example: q=“k1∧(k2∨¬k3)” => S={(1,1,1),(1,1,0),(1,0,0)}
– d is relevant for q iff vd∈Sq
Ulf Leser: Text Analytics, Winter Semester 2010/2011
7
Properties
• Simple, clear semantics, widely used in (early) systems
• Disadvantages
– No partial matching
• Suppose query k1∧k2∧… ∧k9
• A doc d with k1∧k2…k8 is as much refused as one with none of the terms
– No ranking
– Query terms cannot be weighted using domain knowledge
– Average users don’t like Boolean expressions
• Results: Often unsatisfactory
– Too many documents (too few restrictions)
– Too few documents (too many restrictions)
• Several extensions exist, but generally not satisfactory
Ulf Leser: Text Analytics, Winter Semester 2010/2011
8
A Note on Implementation
• Our definition only gave one way of defining the semantics of a
Boolean query; this should not be used for implementations
• The set S of vectors can be extremely large
– Suppose q=k1 and |K|=1000 => S contains 2999 vectors
• Improvement: Evaluate only those terms that are used in the query
• Usual way of implementation: Indexing
– Assume we have an index with fast operation find: K→ΡD
– Search each term ki of the query, resulting in a set Di of docs with Di⊆D
– Evaluate query in the usual order using set operations on Di’s
• ki ∧ kj : Di ∩ Dj
• ki ∨ kj : Di ∪ Dj
• NOT ki: D\Di
• Cost-based optimization
– Evaluate those sub-expressions first which hopefully result in small results
Ulf Leser: Text Analytics, Winter Semester 2010/2011
9
Negation in the Boolean Model
• Evaluating “NOT ki: D\Di“ is expensive
– Result often is very, very large (|D\Di|≈|D|)
• If ki is not a stop word (which we removed anyway)
• Most other terms appear in almost no documents
• Recall Zipf’s Law – the tail of the distribution
• Solution 1: Disallow negation
– This is what web search engines do (exception?)
• Solution 2: Allow only in the form “ki ∧ NOT kj”
Ulf Leser: Text Analytics, Winter Semester 2010/2011
10
Content of this Lecture
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Relevance Feedback in the VSM
Probabilistic Model
Latent Semantic Indexing
Other IR Models
Ulf Leser: Text Analytics, Winter Semester 2010/2011
11
Vector Space Model
• Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector
Space Model for Automatic Indexing." Communications of
the ACM 18(11): 613-620.
– A breakthrough in IR
– Still most popular model today
• General idea
– Fix a vocabulary K
– View each doc and query as a point in a |K|-dimensional space
– Rank docs according to distance from the query in that space
• Main advantages
– Ability to rank docs (according to distance)
– Natural capability for partial matching
Ulf Leser: Text Analytics, Winter Semester 2010/2011
12
Quelle: http://www.vilceanu.ne
Vector Space
• Each term is one dimension
– Different suggestions for
determining co-ordinates
• = Weights
– Different suggestions for
distance between two points
• The closest docs are the
most relevant ones
Star
Astronomy
Movie stars
Mammals
Diet
Ulf Leser: Text Analytics, Winter Semester 2010/2011
– Rational: vectors correspond
to themes which are loosely
related to sets of terms
– Distance between vector ~
distance between themes
13
Computing the Angle between Two Vectors
• Recall
– The scalar product between vectors v and w
v o w = v * w * cos(v, w)
– with
v=
– and
– We use:
vow =
∑ vi
2
∑v *w
i =1..n
i
i
vow
cos(v, w) =
v*w
Ulf Leser: Text Analytics, Winter Semester 2010/2011
14
The Simplest Approach
• Co-ordinates are 0/1
• Distance is measured as the cosine of the angle between
doc d and query q
sim(d , q) = cos(vd , vq ) =
vd o v q
vd * v q
=
∑ (v [i] * v [i])
∑ v [i] * ∑ v [i]
q
d
2
d
2
q
Can be dropped
for ranking
• Including term weights is easy (wait a second)
Ulf Leser: Text Analytics, Winter Semester 2010/2011
15
Example Data
• Assume stop word removal (wir, in, …) and stemming
(verkaufen -> kauf, wollen -> woll, …)
Text
1 Wir verkaufen Häuser in
Italien
verkauf
haus
italien
1
1
1
2 Häuser mit Gärten zu
vermieten
1
3 Häuser: In Italien, um
Italien, um Italien herum
1
4 Die italienschen Gärtner
sind im Garten
gart
1
blüh
woll
1
1
1
1
5 Der Garten in unserem
italienschen Haus blüht
1
1
1
Q Wir wollen ein Haus mit
Garten in Italien mieten
1
1
1
Ulf Leser: Text Analytics, Winter Semester 2010/2011
miet
1
1
1
16
Ranking without Length Normalization
sim(d , q ) = d o q =
•
•
•
•
•
∑ v [i] * v [i]
i =1..n
d
q
1
1
1
2
1
3
1
4
1
1
1
1
1
1
5
1
1
1
Q
1
1
1
1
1
1
sim(d1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2
sim(d2,q)=1+1+1=3
sim(d3,q)=1+1=2
sim(d4,q)=1+1=2
sim(d5,q)=1+1+1=3
Rg
1
Q: Wir wollen ein Haus mit Garten in Italien mieten
d2: Häuser mit Gärten zu vermieten
d5: Der Garten in unserem italienschen Haus blüht
d4: Die italienschen Gärtner sind im Garten
2
d3: Häuser: In Italien, um Italien, um Italien herum
d1: Wir verkaufen Häuser in Italien
Ulf Leser: Text Analytics, Winter Semester 2010/2011
17
Ranking with Length Normalization
(
v [i ] * v [i ])
∑
sim(d , q ) =
∑ v [i]
q
d
2
d
•
•
•
•
•
sim(d1,q) =
sim(d2,q) =
sim(d3,q) =
sim(d4,q) =
sim(d5,q) =
1
1
1
2
1
3
1
4
1
1
1
1
1
5
1
1
1
Q
1
1
1
(1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3
(1+1+1) / √3
(1+1) / √2
(1+1) / √2
(1+1+1) / √4
Rg
1
1
1
~
~
~
~
~
1.15
1.73
1.41
1.41
1.5
Q: Wir wollen ein Haus mit Garten in Italien mieten
1
d2: Häuser mit Gärten zu vermieten
2
d5: Der Garten in unserem italienschen Haus blüht
3
5
1
d4: Die italienschen Gärtner sind im Garten
d3: Häuser: In Italien, um Italien, um Italien herum
d1: Wir verkaufen Häuser in Italien
Ulf Leser: Text Analytics, Winter Semester 2010/2011
18
Introducing Term Weights
• Definition
Let D be a document collection, K be the set of all terms in
D, d∈D and k∈K
– The term frequency tfdk is the frequency of k in d
– The document frequency dfk is the number of docs in D containing k
• Or sometimes the number of occurrences of k in D
– The inverse document frequency is defined as idfk = |D| / dfk
• Actually, it should rather be called the inverse relative document
frequency
• In practice, one usually uses idfk = log(|D| / dfk)
Ulf Leser: Text Analytics, Winter Semester 2010/2011
19
Ranking with TF scoring
(
v [i ] * v [i ])
∑
sim(d , q ) =
∑ v [i]
q
d
2
d
•
•
•
•
•
sim(d1,q) =
sim(d2,q) =
sim(d3,q) =
sim(d4,q) =
sim(d5,q) =
1
1
1
2
1
3
1
4
1
1
3
1
2
5
1
1
1
Q
1
1
1
(1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3
(1+1+1) / √3
(1+3) / √10
(1+2) / √5
(1+1+1) / √4
Rg
1
1
1
~
~
~
~
~
1
1.15
1.73
1.26
1.34
1.5
Q: Wir wollen ein Haus mit Garten in Italien mieten
1
d2: Häuser mit Gärten zu vermieten
2
d5: Der Garten in unserem italienschen Haus blüht
3
d4: Die italienschen Gärtner sind im Garten
4
d3: Häuser: In Italien, um Italien, um Italien herum
5
d1: Wir verkaufen Häuser in Italien
Ulf Leser: Text Analytics, Winter Semester 2010/2011
20
Improved Scoring: TF*IDF
• One obvious problem: The longer the document, the
higher the chances of being ranked high
– Solution: Normalize on document length (also yields 0 ≤ wij ≤ 1)
wij =
tf ij
| di |
tf ij
=
∑ tf
k =1..|di|
ik
• Second obvious problem: Some terms are everywhere in
D, don’t help to discriminate, and should be scored less
– Solution: Use IDF scores
wij =
tf ij
| di |
Ulf Leser: Text Analytics, Winter Semester 2010/2011
* idf j
21
Example TF*IDF
IDF
1 (tf)
tf ij
tf ij
|D|
wij =
* idf j =
*
| di |
| d i | df k
(
v [i ] * v [i ])
∑
sim(d , q) =
∑ v [i]
q
d
2
5
5/4
5/4
1/3
1/3
1/3
2 (tf)
1/3
3 (tf)
1/4
4 (tf)
5 (tf)
Q
5/3
5
1/3
1/3
5
0
3/4
1/3
2/3
1/4
1/4
1/4
1
1
1
1/4
1
1
d
•
•
•
•
•
sim(d1,q)=(5/4*1/3 +5/4*1/3) / √0.3
sim(d2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / √0.3
sim(d3,q)=(5/4*1/4+5/4*3/4) / √0.63
sim(d4,q)=(5/4*1/3 +5/3*2/3) / √0.56
sim(d5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / √0.25
~
~
~
~
~
1.51
4,80
1,57
2,08
2,08
wollen ein Haus mit Garten in Italien mieten
wollen ein Haus mit Garten in Italien mieten
d2: Häuser mit Gärten zu vermieten
Häuser mit Gärten zu vermieten
d5: Der Garten in unserem italienschen Haus blüht
Der Garten in unserem italienschen Haus blüht
d4: Die italienschen Gärtner sind im Garten
Die italienschen Gärtner sind im Garten
d3: Häuser: In Italien, um Italien, um Italien
herum
Wir verkaufen Häuser in Italien
Häuser: In Italien, um Italien, um Italien herum
d1: Wir
in Winter
ItalienSemester 2010/2011
Ulf verkaufen
Leser: TextHäuser
Analytics,
22
TF*IDF Short Form
• Give terms in a doc d high weights which are …
– frequent in d and
– infrequent in D
• IDF deals with the consequences of Zipf’s law
– The few very frequent (and unspecific) terms get lower scores
– The many infrequent (and specific) terms get higher scores
Ulf Leser: Text Analytics, Winter Semester 2010/2011
23
Further Improvements
• Scoring the query in the same way as the documents
– Note: Repeating words in a query then makes a difference
• Empirically estimated “best” scoring function [BYRN99]
– Different length normalization
wij =
tf ij
max tf ij
j
– Use log of IDF with slightly
different measure
(nj : # of docs containing kj);
rare terms are not totally
distorting the score any more
Ulf Leser: Text Analytics, Winter Semester 2010/2011
⎛| D |⎞
⎟
wij =
* log⎜
⎜ n ⎟
max tf ij
⎝ j ⎠
j
tf ij
24
Why not?
• Why not use Euclidean distance?
• Length of vectors would be much more important
• Since queries usually are very short, very short documents
would always win
• Cosine measures normalizes by the length of both vectors
Ulf Leser: Text Analytics, Winter Semester 2010/2011
25
Shortcomings
• We assumed that all terms are independent, i.e., that their
vectors are orthogonal
– Clearly wrong: some terms are semantically closer than others
• The appearance of “red” in a doc with “wine” doesn’t mean much
• But “wine” is an important match for “drinks”
– Extension: Topic-based Vector Space Model (LSI)
• No treatment of homonyms (as for most other methods)
– Different senses = different dimensions
– We would need to split words into their senses
– See word disambiguation (later)
• Term-order independent
– But order carries semantic meaning (object? subject?)
– Order important for disambiguation
Ulf Leser: Text Analytics, Winter Semester 2010/2011
26
A Note on Implementation
• Assume we want to retrieve the top-r docs
• Algorithm
– Look up all terms ki of the query in an inverted file index
– Build the union of all documents which contain at least on ki
• Hold in a list sorted by score (initialize with 0)
– Walk through terms ki in order of decreasing IDF-weights
• For each document dj: Add wji*IDFi to current score sj
• Go through docs in order of current score
– Early stop
• Look at sjr – sjr+1: Can doc with rank r+1 still reach doc with rank r?
• Can be estimated from distribution of TF and IDF values
• Early stop might produce the wrong order of top-r docs,
but cannot produce wrong docs
– Why not?
Ulf Leser: Text Analytics, Winter Semester 2010/2011
27
A Different View on VSD
Star
Diet
• Query evaluation actually searches for the top-r nearest
neighbors (for some similarity measure)
• Can be achieved using multidimensional indexing
– kDD-Trees, Grid files, etc.
– Hope: No sequential scan of (all, many) docs, but directed search
according to distances
• Severe problem: High dimensionality
Ulf Leser: Text Analytics, Winter Semester 2010/2011
28
A Concrete (and Popular) Model
• Okapi BM25
– Okapi: First system which used it (80ties)
– BM25: Best-Match, version 25 (roughly)
• Good results in several TREC evaluations
tf kd * (k1 + 1)
sim(d , q ) = ∑ IDF (k ) *
| d |⎞
⎛
k∈q
tf kd + k1 * ⎜1 − b + b *
⎟
a ⎠
⎝
– k1, b constants (often b=0.75, k1=0.2)
– a is the average document length in D
| D | −tf k + 0.5
IDF (k ) =
tf k + 0.5
Ulf Leser: Text Analytics, Winter Semester 2010/2011
29
Content of this Lecture
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Relevance Feedback in the VSM
Probabilistic Model
Latent Semantic Indexing
Ulf Leser: Text Analytics, Winter Semester 2010/2011
30
Interactive IR
• Recall: IR is a process, not a query
• Relevance feedback
– User poses query
– System answers
– User judges the relevance
of the top-k returned docs
– System considers feedback
and generates new
(improved) ranked answer
Query
IR System
(Ranked)
Results
• User never needs to pose another query
• The new query is generated by the system
– Loop until satisfaction
Ulf Leser: Text Analytics, Winter Semester 2010/2011
31
Relevance Feedback
• Basic assumptions
– Relevant docs are somewhat similar to each other – the common
core should be emphasized
– Irrelevant docs are different from relevant docs – the differences
should be deemphasized
• System adapts to feedback by
– Query expansion: Add new terms to the query
• From the relevant documents
• Alternative: add “NOT” with terms from irrelevant docs
– Term re-weighting: Assign new weights to terms
• Overweight terms the relevant documents, down-weight other terms
Ulf Leser: Text Analytics, Winter Semester 2010/2011
32
Formal Approach
• Assume we know the set R of all relevant docs, and N=D\R
• How would a good query for R look like?
1
1
vq =
vd −
vd
∑
∑
| R | d∈R
| N | d ∈N
• Idea
– Weight terms high that are in R
– Weight terms low that are not in R, i.e., that are in N
• Remark
– It is not clear that this is the “best” of all queries
• Relevance feedback is a heuristic
• Especially when only a sample of R is known or the corpus changes
Ulf Leser: Text Analytics, Winter Semester 2010/2011
33
Rocchio Algorithm
• Usually we do not know the real R
• Best bet: Let R (N) be the set of docs marked as relevant
(irrelevant) by the user
• Adapt query vector as follows
vqnew
1
1
= α * vq + β *
vd − γ *
vd
∑
∑
| R | d∈R
| N | d ∈N
– This implicitly performs query expansion and term re-weighting
Ulf Leser: Text Analytics, Winter Semester 2010/2011
34
Example
α=0.5, β=0.5, γ=0
d1 = "information science“ = (0.2, 0.8)
d2 = "retrieval systems" = (0.9, 0.1)
q = "retrieval of information“ = (0.7, 0.3)
Information
1.0
d1
If d1 were marked relevant
q’ = ½∗q + ½∗d1 = (0.45, 0.55)
q’
If d2 were marked relevant
Q” = ½∗q + ½∗d2 = (0.80, 0.20)
0.5
q
q”
Quelle: A. Nürnberger: IR
0.5
Ulf Leser: Text Analytics, Winter Semester 2010/2011
d2
1.0
Retrieval
35
Variations
• How to choose α, β, γ?
– Tuning with gold standard sets – very difficult
– Educated guess or black magic
• Alternative algorithm
– Intuition: Non-relevant docs are heterogeneous and tear in every
direction – better to only take the worst instead of all of them
vqnew = α * vq + β *
1
vd − γ *{vd | d = arg min( sim(vq , vd ))}
∑
| R | d∈R
d ∈N
Ulf Leser: Text Analytics, Winter Semester 2010/2011
36
Effects
• Advantages
–
–
–
–
Improved results (many studies) compared to single queries
Comfortable: Users need not generate new queries themselves
Iterative process converging to the best possible answer
Especially helpful for increasing recall (due to query expansion)
• Disadvantages
– Still requires some work by the user
• Excite: Only 4% used relevance feedback (“more of this” button)
– Writing a new query based on returned results might be faster (and
easier and more successful) than classifying results
– Based on the assumption that relevant docs are similar
• What if user searches for all meanings of “jaguar”?
– Query very long already after one iteration – slow retrieval
Ulf Leser: Text Analytics, Winter Semester 2010/2011
37
Collaborative Filtering
• What other information could we use for improving
retrieval performance?
• Collaborative filtering: Give the user what other, similar
users liked
– “Customers who bought this book also bought …”
– In IR: Find users posing similar queries and look at what they did
with the ansers
• In e-Commerce: Which produces did they buy? (very reliable)
• In IR, we need to approximate
–
–
–
–
–
Documents a user clicked on (if recordable)
Did the user look at the second page? (Low credit for first results)
Did the user pose a “refinement query” next?
…
All these measures are not very reliable
• Problems: What is a similar user? What is a similar query?
Ulf Leser: Text Analytics, Winter Semester 2010/2011
38
Thesaurus-based Query Expansion
[M07, CS276]
• Expand query with synonyms and hyponyms of each term
– feline → feline cat
– One may weight added terms less than original query terms
• Often used in scientific IR systems (Medline)
• Requires a high quality thesaurus
• General observation
– Increases recall
– May significantly decrease precision
• “interest rate” → “interest rate fascinate evaluate”
– Do synonyms really exist?
Ulf Leser: Text Analytics, Winter Semester 2010/2011
39