dnt q bow
Transcrição
dnt q bow
Text Analytics Models for Information Retrieval 1 Ulf Leser Content of this Lecture • • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Winter Semester 2010/2011 2 Information Retrieval Core • The core question in IR: Which of a given set of (normalized) documents is relevant for a given query? • Ranking: How relevant is each document of a given set for a given query? Query Document base Normalization Normalization Ulf Leser: Text Analytics, Winter Semester 2010/2011 Match / Relevance Scoring Result 3 IR Models • How is relevance judged? • Different models of relevancy U s e r T a s k Retrieval: Adhoc Filtering Fuzzy Extended Boolean Classic Models Algebraic Boolean Vector-Space Probabilistic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Non-Overlapping Lists Proximal Nodes Browsing Set Theoretic Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext Ulf Leser: Text Analytics, Winter Semester 2010/2011 [BYRN99] 4 Notation • All of the models we discuss use the “Bag of Words” view • Definition – Let K be the set of all terms, k∈K is a term – Let D be the set of all documents, d∈D is a document – Let w be the function that maps a document to its set of distinct terms in an arbitrary, yet fixed order (previously: bow) • After pre-processing: Stemming, stop-word removal etc. – Let vd by a vector for d (or a query q) with • vd[i]=0 if the i’th term ki in w(D) is not contained in w(d) • vd[i]=1 if the i’th term ki in w(D) is contained in w(d) – Often, we shall use weights instead of 0/1 memberships • Let wij≥0 be the weight of term ki in document dj (wij=vj[i]) • wij=0 if ki∉dj Ulf Leser: Text Analytics, Winter Semester 2010/2011 5 Content of this Lecture • • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Winter Semester 2010/2011 6 Boolean Model • Simple model based on set theory • Queries are specified as Boolean expressions – Terms are atoms – Terms are connected by AND, OR, NOT, (XOR, ...) • There are no real weights; wij∈{0,1} – All terms contribute equally to the content of a doc • Relevance of a document is either 0 or 1 – Computation by Boolean operations on the doc and query vector – Represent q as set Sq of vectors • Sq: All combinations of keywords that make a doc relevant for q • Example: q=“k1∧(k2∨¬k3)” => S={(1,1,1),(1,1,0),(1,0,0)} – d is relevant for q iff vd∈Sq Ulf Leser: Text Analytics, Winter Semester 2010/2011 7 Properties • Simple, clear semantics, widely used in (early) systems • Disadvantages – No partial matching • Suppose query k1∧k2∧… ∧k9 • A doc d with k1∧k2…k8 is as much refused as one with none of the terms – No ranking – Query terms cannot be weighted using domain knowledge – Average users don’t like Boolean expressions • Results: Often unsatisfactory – Too many documents (too few restrictions) – Too few documents (too many restrictions) • Several extensions exist, but generally not satisfactory Ulf Leser: Text Analytics, Winter Semester 2010/2011 8 A Note on Implementation • Our definition only gave one way of defining the semantics of a Boolean query; this should not be used for implementations • The set S of vectors can be extremely large – Suppose q=k1 and |K|=1000 => S contains 2999 vectors • Improvement: Evaluate only those terms that are used in the query • Usual way of implementation: Indexing – Assume we have an index with fast operation find: K→ΡD – Search each term ki of the query, resulting in a set Di of docs with Di⊆D – Evaluate query in the usual order using set operations on Di’s • ki ∧ kj : Di ∩ Dj • ki ∨ kj : Di ∪ Dj • NOT ki: D\Di • Cost-based optimization – Evaluate those sub-expressions first which hopefully result in small results Ulf Leser: Text Analytics, Winter Semester 2010/2011 9 Negation in the Boolean Model • Evaluating “NOT ki: D\Di“ is expensive – Result often is very, very large (|D\Di|≈|D|) • If ki is not a stop word (which we removed anyway) • Most other terms appear in almost no documents • Recall Zipf’s Law – the tail of the distribution • Solution 1: Disallow negation – This is what web search engines do (exception?) • Solution 2: Allow only in the form “ki ∧ NOT kj” Ulf Leser: Text Analytics, Winter Semester 2010/2011 10 Content of this Lecture • • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Winter Semester 2010/2011 11 Vector Space Model • Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM 18(11): 613-620. – A breakthrough in IR – Still most popular model today • General idea – Fix a vocabulary K – View each doc and query as a point in a |K|-dimensional space – Rank docs according to distance from the query in that space • Main advantages – Ability to rank docs (according to distance) – Natural capability for partial matching Ulf Leser: Text Analytics, Winter Semester 2010/2011 12 Quelle: http://www.vilceanu.ne Vector Space • Each term is one dimension – Different suggestions for determining co-ordinates • = Weights – Different suggestions for distance between two points • The closest docs are the most relevant ones Star Astronomy Movie stars Mammals Diet Ulf Leser: Text Analytics, Winter Semester 2010/2011 – Rational: vectors correspond to themes which are loosely related to sets of terms – Distance between vector ~ distance between themes 13 Computing the Angle between Two Vectors • Recall – The scalar product between vectors v and w v o w = v * w * cos(v, w) – with v= – and – We use: vow = ∑ vi 2 ∑v *w i =1..n i i vow cos(v, w) = v*w Ulf Leser: Text Analytics, Winter Semester 2010/2011 14 The Simplest Approach • Co-ordinates are 0/1 • Distance is measured as the cosine of the angle between doc d and query q sim(d , q) = cos(vd , vq ) = vd o v q vd * v q = ∑ (v [i] * v [i]) ∑ v [i] * ∑ v [i] q d 2 d 2 q Can be dropped for ranking • Including term weights is easy (wait a second) Ulf Leser: Text Analytics, Winter Semester 2010/2011 15 Example Data • Assume stop word removal (wir, in, …) and stemming (verkaufen -> kauf, wollen -> woll, …) Text 1 Wir verkaufen Häuser in Italien verkauf haus italien 1 1 1 2 Häuser mit Gärten zu vermieten 1 3 Häuser: In Italien, um Italien, um Italien herum 1 4 Die italienschen Gärtner sind im Garten gart 1 blüh woll 1 1 1 1 5 Der Garten in unserem italienschen Haus blüht 1 1 1 Q Wir wollen ein Haus mit Garten in Italien mieten 1 1 1 Ulf Leser: Text Analytics, Winter Semester 2010/2011 miet 1 1 1 16 Ranking without Length Normalization sim(d , q ) = d o q = • • • • • ∑ v [i] * v [i] i =1..n d q 1 1 1 2 1 3 1 4 1 1 1 1 1 1 5 1 1 1 Q 1 1 1 1 1 1 sim(d1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2 sim(d2,q)=1+1+1=3 sim(d3,q)=1+1=2 sim(d4,q)=1+1=2 sim(d5,q)=1+1+1=3 Rg 1 Q: Wir wollen ein Haus mit Garten in Italien mieten d2: Häuser mit Gärten zu vermieten d5: Der Garten in unserem italienschen Haus blüht d4: Die italienschen Gärtner sind im Garten 2 d3: Häuser: In Italien, um Italien, um Italien herum d1: Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Winter Semester 2010/2011 17 Ranking with Length Normalization ( v [i ] * v [i ]) ∑ sim(d , q ) = ∑ v [i] q d 2 d • • • • • sim(d1,q) = sim(d2,q) = sim(d3,q) = sim(d4,q) = sim(d5,q) = 1 1 1 2 1 3 1 4 1 1 1 1 1 5 1 1 1 Q 1 1 1 (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3 (1+1+1) / √3 (1+1) / √2 (1+1) / √2 (1+1+1) / √4 Rg 1 1 1 ~ ~ ~ ~ ~ 1.15 1.73 1.41 1.41 1.5 Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d2: Häuser mit Gärten zu vermieten 2 d5: Der Garten in unserem italienschen Haus blüht 3 5 1 d4: Die italienschen Gärtner sind im Garten d3: Häuser: In Italien, um Italien, um Italien herum d1: Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Winter Semester 2010/2011 18 Introducing Term Weights • Definition Let D be a document collection, K be the set of all terms in D, d∈D and k∈K – The term frequency tfdk is the frequency of k in d – The document frequency dfk is the number of docs in D containing k • Or sometimes the number of occurrences of k in D – The inverse document frequency is defined as idfk = |D| / dfk • Actually, it should rather be called the inverse relative document frequency • In practice, one usually uses idfk = log(|D| / dfk) Ulf Leser: Text Analytics, Winter Semester 2010/2011 19 Ranking with TF scoring ( v [i ] * v [i ]) ∑ sim(d , q ) = ∑ v [i] q d 2 d • • • • • sim(d1,q) = sim(d2,q) = sim(d3,q) = sim(d4,q) = sim(d5,q) = 1 1 1 2 1 3 1 4 1 1 3 1 2 5 1 1 1 Q 1 1 1 (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3 (1+1+1) / √3 (1+3) / √10 (1+2) / √5 (1+1+1) / √4 Rg 1 1 1 ~ ~ ~ ~ ~ 1 1.15 1.73 1.26 1.34 1.5 Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d2: Häuser mit Gärten zu vermieten 2 d5: Der Garten in unserem italienschen Haus blüht 3 d4: Die italienschen Gärtner sind im Garten 4 d3: Häuser: In Italien, um Italien, um Italien herum 5 d1: Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Winter Semester 2010/2011 20 Improved Scoring: TF*IDF • One obvious problem: The longer the document, the higher the chances of being ranked high – Solution: Normalize on document length (also yields 0 ≤ wij ≤ 1) wij = tf ij | di | tf ij = ∑ tf k =1..|di| ik • Second obvious problem: Some terms are everywhere in D, don’t help to discriminate, and should be scored less – Solution: Use IDF scores wij = tf ij | di | Ulf Leser: Text Analytics, Winter Semester 2010/2011 * idf j 21 Example TF*IDF IDF 1 (tf) tf ij tf ij |D| wij = * idf j = * | di | | d i | df k ( v [i ] * v [i ]) ∑ sim(d , q) = ∑ v [i] q d 2 5 5/4 5/4 1/3 1/3 1/3 2 (tf) 1/3 3 (tf) 1/4 4 (tf) 5 (tf) Q 5/3 5 1/3 1/3 5 0 3/4 1/3 2/3 1/4 1/4 1/4 1 1 1 1/4 1 1 d • • • • • sim(d1,q)=(5/4*1/3 +5/4*1/3) / √0.3 sim(d2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / √0.3 sim(d3,q)=(5/4*1/4+5/4*3/4) / √0.63 sim(d4,q)=(5/4*1/3 +5/3*2/3) / √0.56 sim(d5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / √0.25 ~ ~ ~ ~ ~ 1.51 4,80 1,57 2,08 2,08 wollen ein Haus mit Garten in Italien mieten wollen ein Haus mit Garten in Italien mieten d2: Häuser mit Gärten zu vermieten Häuser mit Gärten zu vermieten d5: Der Garten in unserem italienschen Haus blüht Der Garten in unserem italienschen Haus blüht d4: Die italienschen Gärtner sind im Garten Die italienschen Gärtner sind im Garten d3: Häuser: In Italien, um Italien, um Italien herum Wir verkaufen Häuser in Italien Häuser: In Italien, um Italien, um Italien herum d1: Wir in Winter ItalienSemester 2010/2011 Ulf verkaufen Leser: TextHäuser Analytics, 22 TF*IDF Short Form • Give terms in a doc d high weights which are … – frequent in d and – infrequent in D • IDF deals with the consequences of Zipf’s law – The few very frequent (and unspecific) terms get lower scores – The many infrequent (and specific) terms get higher scores Ulf Leser: Text Analytics, Winter Semester 2010/2011 23 Further Improvements • Scoring the query in the same way as the documents – Note: Repeating words in a query then makes a difference • Empirically estimated “best” scoring function [BYRN99] – Different length normalization wij = tf ij max tf ij j – Use log of IDF with slightly different measure (nj : # of docs containing kj); rare terms are not totally distorting the score any more Ulf Leser: Text Analytics, Winter Semester 2010/2011 ⎛| D |⎞ ⎟ wij = * log⎜ ⎜ n ⎟ max tf ij ⎝ j ⎠ j tf ij 24 Why not? • Why not use Euclidean distance? • Length of vectors would be much more important • Since queries usually are very short, very short documents would always win • Cosine measures normalizes by the length of both vectors Ulf Leser: Text Analytics, Winter Semester 2010/2011 25 Shortcomings • We assumed that all terms are independent, i.e., that their vectors are orthogonal – Clearly wrong: some terms are semantically closer than others • The appearance of “red” in a doc with “wine” doesn’t mean much • But “wine” is an important match for “drinks” – Extension: Topic-based Vector Space Model (LSI) • No treatment of homonyms (as for most other methods) – Different senses = different dimensions – We would need to split words into their senses – See word disambiguation (later) • Term-order independent – But order carries semantic meaning (object? subject?) – Order important for disambiguation Ulf Leser: Text Analytics, Winter Semester 2010/2011 26 A Note on Implementation • Assume we want to retrieve the top-r docs • Algorithm – Look up all terms ki of the query in an inverted file index – Build the union of all documents which contain at least on ki • Hold in a list sorted by score (initialize with 0) – Walk through terms ki in order of decreasing IDF-weights • For each document dj: Add wji*IDFi to current score sj • Go through docs in order of current score – Early stop • Look at sjr – sjr+1: Can doc with rank r+1 still reach doc with rank r? • Can be estimated from distribution of TF and IDF values • Early stop might produce the wrong order of top-r docs, but cannot produce wrong docs – Why not? Ulf Leser: Text Analytics, Winter Semester 2010/2011 27 A Different View on VSD Star Diet • Query evaluation actually searches for the top-r nearest neighbors (for some similarity measure) • Can be achieved using multidimensional indexing – kDD-Trees, Grid files, etc. – Hope: No sequential scan of (all, many) docs, but directed search according to distances • Severe problem: High dimensionality Ulf Leser: Text Analytics, Winter Semester 2010/2011 28 A Concrete (and Popular) Model • Okapi BM25 – Okapi: First system which used it (80ties) – BM25: Best-Match, version 25 (roughly) • Good results in several TREC evaluations tf kd * (k1 + 1) sim(d , q ) = ∑ IDF (k ) * | d |⎞ ⎛ k∈q tf kd + k1 * ⎜1 − b + b * ⎟ a ⎠ ⎝ – k1, b constants (often b=0.75, k1=0.2) – a is the average document length in D | D | −tf k + 0.5 IDF (k ) = tf k + 0.5 Ulf Leser: Text Analytics, Winter Semester 2010/2011 29 Content of this Lecture • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Ulf Leser: Text Analytics, Winter Semester 2010/2011 30 Interactive IR • Recall: IR is a process, not a query • Relevance feedback – User poses query – System answers – User judges the relevance of the top-k returned docs – System considers feedback and generates new (improved) ranked answer Query IR System (Ranked) Results • User never needs to pose another query • The new query is generated by the system – Loop until satisfaction Ulf Leser: Text Analytics, Winter Semester 2010/2011 31 Relevance Feedback • Basic assumptions – Relevant docs are somewhat similar to each other – the common core should be emphasized – Irrelevant docs are different from relevant docs – the differences should be deemphasized • System adapts to feedback by – Query expansion: Add new terms to the query • From the relevant documents • Alternative: add “NOT” with terms from irrelevant docs – Term re-weighting: Assign new weights to terms • Overweight terms the relevant documents, down-weight other terms Ulf Leser: Text Analytics, Winter Semester 2010/2011 32 Formal Approach • Assume we know the set R of all relevant docs, and N=D\R • How would a good query for R look like? 1 1 vq = vd − vd ∑ ∑ | R | d∈R | N | d ∈N • Idea – Weight terms high that are in R – Weight terms low that are not in R, i.e., that are in N • Remark – It is not clear that this is the “best” of all queries • Relevance feedback is a heuristic • Especially when only a sample of R is known or the corpus changes Ulf Leser: Text Analytics, Winter Semester 2010/2011 33 Rocchio Algorithm • Usually we do not know the real R • Best bet: Let R (N) be the set of docs marked as relevant (irrelevant) by the user • Adapt query vector as follows vqnew 1 1 = α * vq + β * vd − γ * vd ∑ ∑ | R | d∈R | N | d ∈N – This implicitly performs query expansion and term re-weighting Ulf Leser: Text Analytics, Winter Semester 2010/2011 34 Example α=0.5, β=0.5, γ=0 d1 = "information science“ = (0.2, 0.8) d2 = "retrieval systems" = (0.9, 0.1) q = "retrieval of information“ = (0.7, 0.3) Information 1.0 d1 If d1 were marked relevant q’ = ½∗q + ½∗d1 = (0.45, 0.55) q’ If d2 were marked relevant Q” = ½∗q + ½∗d2 = (0.80, 0.20) 0.5 q q” Quelle: A. Nürnberger: IR 0.5 Ulf Leser: Text Analytics, Winter Semester 2010/2011 d2 1.0 Retrieval 35 Variations • How to choose α, β, γ? – Tuning with gold standard sets – very difficult – Educated guess or black magic • Alternative algorithm – Intuition: Non-relevant docs are heterogeneous and tear in every direction – better to only take the worst instead of all of them vqnew = α * vq + β * 1 vd − γ *{vd | d = arg min( sim(vq , vd ))} ∑ | R | d∈R d ∈N Ulf Leser: Text Analytics, Winter Semester 2010/2011 36 Effects • Advantages – – – – Improved results (many studies) compared to single queries Comfortable: Users need not generate new queries themselves Iterative process converging to the best possible answer Especially helpful for increasing recall (due to query expansion) • Disadvantages – Still requires some work by the user • Excite: Only 4% used relevance feedback (“more of this” button) – Writing a new query based on returned results might be faster (and easier and more successful) than classifying results – Based on the assumption that relevant docs are similar • What if user searches for all meanings of “jaguar”? – Query very long already after one iteration – slow retrieval Ulf Leser: Text Analytics, Winter Semester 2010/2011 37 Collaborative Filtering • What other information could we use for improving retrieval performance? • Collaborative filtering: Give the user what other, similar users liked – “Customers who bought this book also bought …” – In IR: Find users posing similar queries and look at what they did with the ansers • In e-Commerce: Which produces did they buy? (very reliable) • In IR, we need to approximate – – – – – Documents a user clicked on (if recordable) Did the user look at the second page? (Low credit for first results) Did the user pose a “refinement query” next? … All these measures are not very reliable • Problems: What is a similar user? What is a similar query? Ulf Leser: Text Analytics, Winter Semester 2010/2011 38 Thesaurus-based Query Expansion [M07, CS276] • Expand query with synonyms and hyponyms of each term – feline → feline cat – One may weight added terms less than original query terms • Often used in scientific IR systems (Medline) • Requires a high quality thesaurus • General observation – Increases recall – May significantly decrease precision • “interest rate” → “interest rate fascinate evaluate” – Do synonyms really exist? Ulf Leser: Text Analytics, Winter Semester 2010/2011 39