dnt q bow

Transcrição

dnt q bow

Text Analytics
Models for Information Retrieval 1
Ulf Leser
Content of this Lecture
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Relevance Feedback in the VSM
Probabilistic Model
Latent Semantic Indexing
Other IR Models
Ulf Leser: Text Analytics, Winter Semester 2010/2011
2
Information Retrieval Core
• The core question in IR:
Which of a given set of (normalized) documents is relevant
for a given query?
• Ranking:
How relevant is each document of a given set for a given
query?
Query
Document base
Normalization
Normalization
Match /
Relevance
Scoring
Result
3
IR Models
• How is relevance judged?
• Different models of relevancy
U
s
e
r
T
a
s
k
Retrieval:
Adhoc
Filtering
Fuzzy
Extended Boolean
Classic Models
Algebraic
Boolean
Vector-Space
Probabilistic
Generalized Vector
Lat. Semantic Index
Neural Networks
Structured Models
Non-Overlapping Lists
Proximal Nodes
Browsing
Set Theoretic
Probabilistic
Inference Network
Belief Network
Browsing
Flat
Structure Guided
Hypertext
[BYRN99]
4
Notation
• All of the models we discuss use the “Bag of Words” view
• Definition
– Let K be the set of all terms, k∈K is a term
– Let D be the set of all documents, d∈D is a document
– Let w be the function that maps a document to its set of distinct
terms in an arbitrary, yet fixed order (previously: bow)
• After pre-processing: Stemming, stop-word removal etc.
– Let vd by a vector for d (or a query q) with
• vd[i]=0 if the i’th term ki in w(D) is not contained in w(d)
• vd[i]=1 if the i’th term ki in w(D) is contained in w(d)
– Often, we shall use weights instead of 0/1 memberships
• Let wij≥0 be the weight of term ki in document dj (wij=vj[i])
• wij=0 if ki∉dj
5
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Probabilistic Model
Other IR Models
6
Boolean Model
• Simple model based on set theory
• Queries are specified as Boolean expressions
– Terms are atoms
– Terms are connected by AND, OR, NOT, (XOR, ...)
• There are no real weights; wij∈{0,1}
– All terms contribute equally to the content of a doc
• Relevance of a document is either 0 or 1
– Computation by Boolean operations on the doc and query vector
– Represent q as set Sq of vectors
• Sq: All combinations of keywords that make a doc relevant for q
• Example: q=“k1∧(k2∨¬k3)” => S={(1,1,1),(1,1,0),(1,0,0)}
– d is relevant for q iff vd∈Sq
7
Properties
• Simple, clear semantics, widely used in (early) systems
• Disadvantages
– No partial matching
• Suppose query k1∧k2∧… ∧k9
• A doc d with k1∧k2…k8 is as much refused as one with none of the terms
– No ranking
– Query terms cannot be weighted using domain knowledge
– Average users don’t like Boolean expressions
• Results: Often unsatisfactory
– Too many documents (too few restrictions)
– Too few documents (too many restrictions)
• Several extensions exist, but generally not satisfactory
8
A Note on Implementation
• Our definition only gave one way of defining the semantics of a
Boolean query; this should not be used for implementations
• The set S of vectors can be extremely large
– Suppose q=k1 and |K|=1000 => S contains 2999 vectors
• Improvement: Evaluate only those terms that are used in the query
• Usual way of implementation: Indexing
– Assume we have an index with fast operation find: K→ΡD
– Search each term ki of the query, resulting in a set Di of docs with Di⊆D
– Evaluate query in the usual order using set operations on Di’s
• ki ∧ kj : Di ∩ Dj
• ki ∨ kj : Di ∪ Dj
• NOT ki: D\Di
• Cost-based optimization
– Evaluate those sub-expressions first which hopefully result in small results
9
Negation in the Boolean Model
• Evaluating “NOT ki: D\Di“ is expensive
– Result often is very, very large (|D\Di|≈|D|)
• If ki is not a stop word (which we removed anyway)
• Most other terms appear in almost no documents
• Recall Zipf’s Law – the tail of the distribution
• Solution 1: Disallow negation
– This is what web search engines do (exception?)
• Solution 2: Allow only in the form “ki ∧ NOT kj”
10
•
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Probabilistic Model
Other IR Models
11
Vector Space Model
• Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector
Space Model for Automatic Indexing." Communications of
the ACM 18(11): 613-620.
– A breakthrough in IR
– Still most popular model today
• General idea
– Fix a vocabulary K
– View each doc and query as a point in a |K|-dimensional space
– Rank docs according to distance from the query in that space
• Main advantages
– Ability to rank docs (according to distance)
– Natural capability for partial matching
12
Quelle: http://www.vilceanu.ne
Vector Space
• Each term is one dimension
– Different suggestions for
determining co-ordinates
• = Weights
– Different suggestions for
distance between two points
• The closest docs are the
most relevant ones
Star
Astronomy
Movie stars
Mammals
Diet
– Rational: vectors correspond
to themes which are loosely
related to sets of terms
– Distance between vector ~
distance between themes
13
Computing the Angle between Two Vectors
• Recall
– The scalar product between vectors v and w
v o w = v * w * cos(v, w)
– with
v=
– and
– We use:
vow =
∑ vi
2
∑v *w
i =1..n
i
i
vow
cos(v, w) =
v*w
14
The Simplest Approach
• Co-ordinates are 0/1
• Distance is measured as the cosine of the angle between
doc d and query q
sim(d , q) = cos(vd , vq ) =
vd o v q
vd * v q
=
∑ (v [i] * v [i])
∑ v [i] * ∑ v [i]
q
d
2
d
2
q
Can be dropped
for ranking
• Including term weights is easy (wait a second)
15
Example Data
• Assume stop word removal (wir, in, …) and stemming
(verkaufen -> kauf, wollen -> woll, …)
Text
1 Wir verkaufen Häuser in
Italien
verkauf
haus
italien
1
1
1
2 Häuser mit Gärten zu
vermieten
1
3 Häuser: In Italien, um
Italien, um Italien herum
1
4 Die italienschen Gärtner
sind im Garten
gart
1
blüh
woll
1
1
1
1
5 Der Garten in unserem
italienschen Haus blüht
1
1
1
Q Wir wollen ein Haus mit
Garten in Italien mieten
1
1
1
miet
1
1
1
16
Ranking without Length Normalization
sim(d , q ) = d o q =
•
•
•
•
•
∑ v [i] * v [i]
i =1..n
d
q
1
1
1
2
1
3
1
4
1
1
1
1
1
1
5
1
1
1
Q
1
1
1
1
1
1
sim(d1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2
sim(d2,q)=1+1+1=3
sim(d3,q)=1+1=2
sim(d4,q)=1+1=2
sim(d5,q)=1+1+1=3
Rg
1
Q: Wir wollen ein Haus mit Garten in Italien mieten
d2: Häuser mit Gärten zu vermieten
d5: Der Garten in unserem italienschen Haus blüht
d4: Die italienschen Gärtner sind im Garten
2
d3: Häuser: In Italien, um Italien, um Italien herum
d1: Wir verkaufen Häuser in Italien
17
Ranking with Length Normalization
(
v [i ] * v [i ])
∑
sim(d , q ) =
∑ v [i]
q
d
2
d
•
•
•
•
•
sim(d1,q) =
sim(d2,q) =
sim(d3,q) =
sim(d4,q) =
sim(d5,q) =
1
1
1
2
1
3
1
4
1
1
1
1
1
5
1
1
1
Q
1
1
1
(1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3
(1+1+1) / √3
(1+1) / √2
(1+1) / √2
(1+1+1) / √4
Rg
1
1
1
~
~
~
~
~
1.15
1.73
1.41
1.41
1.5
1
2
3
5
1
18
Introducing Term Weights
• Definition
Let D be a document collection, K be the set of all terms in
D, d∈D and k∈K
– The term frequency tfdk is the frequency of k in d
– The document frequency dfk is the number of docs in D containing k
• Or sometimes the number of occurrences of k in D
– The inverse document frequency is defined as idfk = |D| / dfk
• Actually, it should rather be called the inverse relative document
frequency
• In practice, one usually uses idfk = log(|D| / dfk)
19
Ranking with TF scoring
(
v [i ] * v [i ])
∑
sim(d , q ) =
∑ v [i]
q
d
2
d
•
•
•
•
•
sim(d1,q) =
sim(d2,q) =
sim(d3,q) =
sim(d4,q) =
sim(d5,q) =
1
1
1
2
1
3
1
4
1
1
3
1
2
5
1
1
1
Q
1
1
1
(1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3
(1+1+1) / √3
(1+3) / √10
(1+2) / √5
(1+1+1) / √4
Rg
1
1
1
~
~
~
~
~
1
1.15
1.73
1.26
1.34
1.5
1
2
3
4
5
20
Improved Scoring: TF*IDF
• One obvious problem: The longer the document, the
higher the chances of being ranked high
– Solution: Normalize on document length (also yields 0 ≤ wij ≤ 1)
wij =
tf ij
| di |
tf ij
=
∑ tf
k =1..|di|
ik
• Second obvious problem: Some terms are everywhere in
D, don’t help to discriminate, and should be scored less
– Solution: Use IDF scores
wij =
tf ij
| di |
* idf j
21
Example TF*IDF
IDF
1 (tf)
tf ij
tf ij
|D|
wij =
* idf j =
*
| di |
| d i | df k
(
v [i ] * v [i ])
∑
sim(d , q) =
∑ v [i]
q
d
2
5
5/4
5/4
1/3
1/3
1/3
2 (tf)
1/3
3 (tf)
1/4
4 (tf)
5 (tf)
Q
5/3
5
1/3
1/3
5
0
3/4
1/3
2/3
1/4
1/4
1/4
1
1
1
1/4
1
1
d
•
•
•
•
•
sim(d1,q)=(5/4*1/3 +5/4*1/3) / √0.3
sim(d2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / √0.3
sim(d3,q)=(5/4*1/4+5/4*3/4) / √0.63
sim(d4,q)=(5/4*1/3 +5/3*2/3) / √0.56
sim(d5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / √0.25
~
~
~
~
~
1.51
4,80
1,57
2,08
2,08
wollen ein Haus mit Garten in Italien mieten
wollen ein Haus mit Garten in Italien mieten
Häuser mit Gärten zu vermieten
Der Garten in unserem italienschen Haus blüht
Die italienschen Gärtner sind im Garten
d3: Häuser: In Italien, um Italien, um Italien
herum
Wir verkaufen Häuser in Italien
Häuser: In Italien, um Italien, um Italien herum
d1: Wir
in Winter
ItalienSemester 2010/2011
Ulf verkaufen
Leser: TextHäuser
Analytics,
22
TF*IDF Short Form
• Give terms in a doc d high weights which are …
– frequent in d and
– infrequent in D
• IDF deals with the consequences of Zipf’s law
– The few very frequent (and unspecific) terms get lower scores
– The many infrequent (and specific) terms get higher scores
23
Further Improvements
• Scoring the query in the same way as the documents
– Note: Repeating words in a query then makes a difference
• Empirically estimated “best” scoring function [BYRN99]
– Different length normalization
wij =
tf ij
max tf ij
j
– Use log of IDF with slightly
different measure
(nj : # of docs containing kj);
rare terms are not totally
distorting the score any more
⎛| D |⎞
⎟
wij =
* log⎜
⎜ n ⎟
max tf ij
⎝ j ⎠
j
tf ij
24
Why not?
• Why not use Euclidean distance?
• Length of vectors would be much more important
• Since queries usually are very short, very short documents
would always win
• Cosine measures normalizes by the length of both vectors
25
Shortcomings
• We assumed that all terms are independent, i.e., that their
vectors are orthogonal
– Clearly wrong: some terms are semantically closer than others
• The appearance of “red” in a doc with “wine” doesn’t mean much
• But “wine” is an important match for “drinks”
– Extension: Topic-based Vector Space Model (LSI)
• No treatment of homonyms (as for most other methods)
– Different senses = different dimensions
– We would need to split words into their senses
– See word disambiguation (later)
• Term-order independent
– But order carries semantic meaning (object? subject?)
– Order important for disambiguation
26
A Note on Implementation
• Assume we want to retrieve the top-r docs
• Algorithm
– Look up all terms ki of the query in an inverted file index
– Build the union of all documents which contain at least on ki
• Hold in a list sorted by score (initialize with 0)
– Walk through terms ki in order of decreasing IDF-weights
• For each document dj: Add wji*IDFi to current score sj
• Go through docs in order of current score
– Early stop
• Look at sjr – sjr+1: Can doc with rank r+1 still reach doc with rank r?
• Can be estimated from distribution of TF and IDF values
• Early stop might produce the wrong order of top-r docs,
but cannot produce wrong docs
– Why not?
27
A Different View on VSD
Star
Diet
• Query evaluation actually searches for the top-r nearest
neighbors (for some similarity measure)
• Can be achieved using multidimensional indexing
– kDD-Trees, Grid files, etc.
– Hope: No sequential scan of (all, many) docs, but directed search
according to distances
• Severe problem: High dimensionality
28
A Concrete (and Popular) Model
• Okapi BM25
– Okapi: First system which used it (80ties)
– BM25: Best-Match, version 25 (roughly)
• Good results in several TREC evaluations
tf kd * (k1 + 1)
sim(d , q ) = ∑ IDF (k ) *
| d |⎞
⎛
k∈q
tf kd + k1 * ⎜1 − b + b *
⎟
a ⎠
⎝
– k1, b constants (often b=0.75, k1=0.2)
– a is the average document length in D
| D | −tf k + 0.5
IDF (k ) =
tf k + 0.5
29
•
•
•
•
•
•
IR Models
Boolean Model
Vector Space Model
Probabilistic Model
30
Interactive IR
• Recall: IR is a process, not a query
• Relevance feedback
– User poses query
– System answers
– User judges the relevance
of the top-k returned docs
– System considers feedback
and generates new
(improved) ranked answer
Query
IR System
(Ranked)
Results
• User never needs to pose another query
• The new query is generated by the system
– Loop until satisfaction
31
Relevance Feedback
• Basic assumptions
– Relevant docs are somewhat similar to each other – the common
core should be emphasized
– Irrelevant docs are different from relevant docs – the differences
should be deemphasized
• System adapts to feedback by
– Query expansion: Add new terms to the query
• From the relevant documents
• Alternative: add “NOT” with terms from irrelevant docs
– Term re-weighting: Assign new weights to terms
• Overweight terms the relevant documents, down-weight other terms
32
Formal Approach
• Assume we know the set R of all relevant docs, and N=D\R
• How would a good query for R look like?
1
1
vq =
vd −
vd
∑
∑
| R | d∈R
| N | d ∈N
• Idea
– Weight terms high that are in R
– Weight terms low that are not in R, i.e., that are in N
• Remark
– It is not clear that this is the “best” of all queries
• Relevance feedback is a heuristic
• Especially when only a sample of R is known or the corpus changes
33
Rocchio Algorithm
• Usually we do not know the real R
• Best bet: Let R (N) be the set of docs marked as relevant
(irrelevant) by the user
• Adapt query vector as follows
vqnew
1
1
= α * vq + β *
vd − γ *
vd
∑
∑
| R | d∈R
| N | d ∈N
– This implicitly performs query expansion and term re-weighting
34
Example
α=0.5, β=0.5, γ=0
d1 = "information science“ = (0.2, 0.8)
d2 = "retrieval systems" = (0.9, 0.1)
q = "retrieval of information“ = (0.7, 0.3)
Information
1.0
d1
If d1 were marked relevant
q’ = ½∗q + ½∗d1 = (0.45, 0.55)
q’
If d2 were marked relevant
Q” = ½∗q + ½∗d2 = (0.80, 0.20)
0.5
q
q”
Quelle: A. Nürnberger: IR
0.5
d2
1.0
Retrieval
35
Variations
• How to choose α, β, γ?
– Tuning with gold standard sets – very difficult
– Educated guess or black magic
• Alternative algorithm
– Intuition: Non-relevant docs are heterogeneous and tear in every
direction – better to only take the worst instead of all of them
vqnew = α * vq + β *
1
vd − γ *{vd | d = arg min( sim(vq , vd ))}
∑
| R | d∈R
d ∈N
36
Effects
• Advantages
–
–
–
–
Improved results (many studies) compared to single queries
Comfortable: Users need not generate new queries themselves
Iterative process converging to the best possible answer
Especially helpful for increasing recall (due to query expansion)
• Disadvantages
– Still requires some work by the user
• Excite: Only 4% used relevance feedback (“more of this” button)
– Writing a new query based on returned results might be faster (and
easier and more successful) than classifying results
– Based on the assumption that relevant docs are similar
• What if user searches for all meanings of “jaguar”?
– Query very long already after one iteration – slow retrieval
37
Collaborative Filtering
• What other information could we use for improving
retrieval performance?
• Collaborative filtering: Give the user what other, similar
users liked
– “Customers who bought this book also bought …”
– In IR: Find users posing similar queries and look at what they did
with the ansers
• In e-Commerce: Which produces did they buy? (very reliable)
• In IR, we need to approximate
–
–
–
–
–
Documents a user clicked on (if recordable)
Did the user look at the second page? (Low credit for first results)
Did the user pose a “refinement query” next?
…
All these measures are not very reliable
• Problems: What is a similar user? What is a similar query?
38
Thesaurus-based Query Expansion
[M07, CS276]
• Expand query with synonyms and hyponyms of each term
– feline → feline cat
– One may weight added terms less than original query terms
• Often used in scientific IR systems (Medline)
• Requires a high quality thesaurus
• General observation
– Increases recall
– May significantly decrease precision
• “interest rate” → “interest rate fascinate evaluate”
– Do synonyms really exist?
39

dnt q bow

Transcrição

Documentos relacionados

Text Analytics - Institut für Informatik

Text Analytics

Praktikum Information Integration

clinton engines

Einführung

Foundations of Information Retrieval

Text Analytics

Materials Logistics and Production

Business English 1

Property Sheet PDF - Signature Properties Solutions

6 Months Digital Marketing Internship

Property Sheet PDF - Signature Properties Solutions

who is behind ere international?

Property Sheet PDF - Signature Properties Solutions