clinton engines
Transcrição
clinton engines
Text Analytics Models for Information Retrieval 1 Ulf Leser Processing Pipeline Document Retrieval Corpus Parsing Document IDs Normalization Term Weighting IR Database Text Preprocessing Linguistic Annotation Normalization Filtering Parsing Ranking Named Entity Recognition Feedback Named Entity Normalization Interface User Frakes et al. 1992 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 Relationship Extraction 2 Format Conversion New generation of e-commerce applications require data schemas In relational database systems, data objects are conventionally that are constantly evolving and sparsely populated. The conven- stored using a horizontal scheme. A data object is represented as tional horizontal row representation fails to meet these require- … • Transform text in ASCII / UniCode • Formats – PDF, XML, DOC, ODP, images, … • Problems – Formatting instruction, Special characters, formulas, figures and captions, tables, section headers, footnotes, page numbers, margin text, … • Diplomacy: To what extend can one reconstruct the original document from the normalized representation? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 3 Assumptions • A document a sequence of sentences • A sentence is a sequence of tokens • A token is the smallest unit of text – E.g. words, numbers • A term is a token or a set of tokens with a semantic meaning – – – – E.g. single token, proper names “San” is a token, but not a term “San Francisco” are two tokens but one term Dictionaries usually contain terms, not tokens • A word is either a token or a term – depends on the context • A concept is the mental representation of a real-world thing – Concepts are represented by terms/words – Terms may represent multiple concepts: homonyms – Concepts may be represented by multiple terms: synonyms Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 4 Case – A Difficult Case • Should text be converted to lower case letters? • Advantages – Makes queries simpler – Decreases index size a lot • Disadvantages – No abbreviations – Looses important hints for sentence splitting • Sentences start with upper case letters – Looses important hints for tokenization, NER, … • Often – Convert to lower case after all other steps Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 5 Sentence Splitting • Most linguistic analysis works on sentence level • Sentences group together objects and statements – But mind references: “What is this white thing? It is a house.” • Naive approach: Reg-Exp search for “[.?!;] “ – (note the blank) – Abbreviations • “C. Elegans is a worm which …”; “This does not hold for the U.S.” – Errors (due to previous normalization steps) • “is not clear.Conclusions.We reported on …” – Proper names • “DOT.NET is a technique for …” – Direct speech • “By saying “It is the economy, stupid!”, B. Clinton meant that …” – … Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 6 Tokenization • Split a sentence into token – But in a way that allows for recognition of multi-token terms later • Simple approach: search for „ „ (blanks) – “In San Francisco it rained a lot this year.” – “A state-of-the-art Z-9 Firebird was purchased on 3/12/1995.” – „SQL commands comprise SELECT … FROM … WHERE clauses; the latter may contain functions such as leftstr(STRING, INT).“ – “This LCD-TV-Screen cost 3.100,99 USD.” – “[Bis[1,2-cyclohexanedionedioximato(1-)-O][1,2-cyclohexanedione dioximato(2-)-O]methyl-borato(2-)N,N0,N00,N000,N0000,N00000)-chlorotechnetium) belongs to a family of …“ • Typical approach – Treat hyphens/parantheses as blanks – Remove “.” (of abbreviations; after sentence splitting) – Hope that errors are rare and don’t matter much Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 7 Stems versus Lemmas • Common idea: Normalize words to a basal, “normal” form – car,cars -> car; gives, gave, give -> give • Stemming reduce words to a base form – Base form often is not a word in the language – Ignores detached prepositions (“Ich kaufe ein” -> “einkaufen” or “kauf”) – Rather simple • Lemmatization reduce words to their lemma – Finds the linguistic root of a word – Must be a proper word of the language – Rather difficult, requires morphological analysis Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 8 Stop Words • Words that are so frequent that their removal (hopefully) does not change the meaning of a doc for retrieval/mining purposes – English: Top-2: 10% of all tokens; Top6: 20%; Top-50: 50% – English (top10; LOB corpus): the, of, and, to, a, in, that, is, was, it – German: aber, als, am, an, auch, auf, aus, bei, bin, … • Removing stop words reduces a positional token index by ~40% – Search engines do remove stop words (but which?) • Hope: Increase in precision due to less spurious hits – Clearly, this also depends on the usage of stop words in the query • Variations – Remove top 10, 100, 1000, … words – Use language-specific or corpus-specific stop word list Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 9 Zipf‘s Law • Let f be the frequency of a word and r its rank in the list of all words sorted by frequency • Zipf’s law: f ~ k/r for some constant k • Example – Word ranks in Moby Dick – Good fit to Zipf’s law – Some domaindependency (whale) • Zipf’s law is a fairly good approximation in most corpora Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 10 Proper Names – Named Entity Recognition • Proper names are different from ordinary words – – – – – Very often comprise more than one token Often not contained in lexica Appear and disappear all the time Often contain special characters Are very important for information retrieval and text mining • Search for “Bill Clinton” (not “Clinton ordered the bill”) • Extract all relationships between “a disease” and the DMD gene • Recognizing proper names is difficult – more later – Named entity recognition • “dark” is a proper name • “broken wing” is no proper name, “broken-wing gene” possibly is one – Disambiguation • “dark” is a gene of the fruitfly (and a common English word) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 11 Document Summarization • Preprocessing may go far beyond what we discussed • Statistical summarization – Remove all token that are not representative for the text – Only keep words which are more frequent than expected by chance • “Chance”: Over all documents in the collection (corpus-view), in the language (naïve-view) – Makes amount of text much smaller, hopefully doesn’t influence IR performance • Semantic summarization – “Understand” text and create summary of content • Annotation / classification – Have a human understand the text and categorize it / describe it with words from a controlled dictionary – That’s what libraries are crazy about and Yahoo is famous for Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 12 Content of this Lecture • • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 13 Information Retrieval Core • The core question in IR: Which of a given set of (normalized) documents is relevant for a given query? • Ranking: How relevant is each document of a given set for a given query? Query Document base Normalization Normalization Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 Match / Relevance Scoring Result 14 IR Models • Modeling: How is relevance judged? – What model of relevancy is assumed? U s e r T a s k Retrieval: Adhoc Filtering Fuzzy Extended Boolean Classic Models Algebraic Boolean Vector-Space Probabilistic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Non-Overlapping Lists Proximal Nodes Browsing Set Theoretic Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 [BYRN99] 15 Notation • All of the models we discuss use the “Bag of Words” view • Definition – Let K be the set of all terms, k∈K is a term – Let D be the set of all documents, d∈D is a document – Let vd by a vector for d with • vd[i]=0 if the i’th term ki in w(D) is not contained in d • vd[i]=1 if the i’th term ki in w(D) is contained in d • Let vq by a vector for q in the same sense – Often, we shall use weights instead of 0/1 memberships – Let wij≥0 be the weight of term ki in document dj • To what extend does ki contribute to the content of dj? • wij=0 if ki∉dj • vd and vq are defined accordingly Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 16 Content of this Lecture • • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 17 Boolean Model • Simple model based on set theory • Queries are specified as Boolean expressions – Terms are atoms – Terms are connected by AND, OR, NOT (XOR, ...) • There are no real weights; wij∈{0,1} – All terms contribute equally to the content of a doc • Relevance of a document is either 0 or 1 – Computation by Boolean operations on the doc and query vector – Represent q as set S of vectors over all terms • Example: q=k1 ∧(k2∨¬k3) => S={(1,1,1),(1,1,0),(1,0,0)} • S describes all possible combinations of keywords that make a doc relevant – Document d is relevant for q iff vd∈S Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 18 Properties • Simple, clear semantics, widely used in (early) systems • Disadvantages – No partial matching (expression must evaluate to true) • Suppose query k1∧k2∧… ∧k9 • A doc d with k1∧k2…k8 is as much refused as one with none of the terms – No ranking (extensions exist, but not satisfactory) – Query terms cannot be weighted – Average users don’t like Boolean expressions • Search engines: “+bill +clinton –lewinsky tax” • Results: Often unsatisfactory – Too many documents (too few restrictions) • Without ranking – Too few documents (too many restrictions) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 19 A Note on Implementation • Our definition only gave one way of defining the semantics of a Boolean query; this should not be used for implementations • The set S of vectors can be extremely large – Suppose q=k1 and |K|=1000 => S has 2999 components • Improvement: Evaluate only those terms that are used in the query • Usual way of implementation: Indexing – Assume we have an index with fast operation find: K→PD – Search each term ki of the query, resulting in a set Di of docs with Di⊆D – Evaluate query in the usual order using set operations on Di’s • ki ∧ kj : Di ∩ Dj • ki ∨ kj : Di ∪ Dj • NOT ki: D\Di • Use cost-based optimization – Evaluate those subexpressions first which hopefully result in small results Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 20 Negation in the Boolean Model • Evaluating “NOT ki: D\Di“ is expensive – Result often is very, very large (|D\Di|≈|D|) • Because almost all terms appear in almost no documents – Recall Zipf’s Law – the tail of the distribution • If ki is not a stop word, which we removed anyway • Solution 1: Disallow negation – This is what web search engines do (exception?) • Solution 2: Allow only in the form “ki ∧ NOT kj” – Evaluation: Compute Di and remove docs which do not contain kj Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 21 Content of this Lecture • • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 22 Vector Space Model • Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM 18(11): 613-620. – Historically, a breakthrough in IR – Still most popular model today • General idea – Fix a vocabulary K • Typically the set of all different terms in D – View each doc and each query as a point in a |K|-dim space – Rank docs according to distance from the query • Main advantages – Ability to rank docs (according to relevance) – Natural capability for partial matching Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 23 Quelle: http://www.vilceanu.ne Vector Space • Each term in K is one dimension – Different suggestions for determining co-ordinates • = Weights – Different suggestions for determining the distance between two points Star Astronomy Movie stars • The closest docs are the most relevant ones Mammals Diet Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 – Rational: vectors correspond to themes which are loosely related to sets of terms – Distance between vector ~ distance between themes 24 Preliminaries 1 • Definition Let D be a document collection, K be the set of all terms in D, d∈D and k∈K. – The term frequency tfdk is the frequency of k in d – The document frequency dfk is the number of docs in D containing k • Or sometimes the number of occurrences of k in D – The inverse document frequency idfk = |D| / dfk • Actually, it should rather be called the inverse relative document frequency • In practice, one usually uses idfk = log(|D| / dfk) • Remarks – idf is a popular weighting term, therefore a proper definition – Clearly, tfdk is a natural measure for the weight wdk • But not the only or best one Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 25 Preliminaries 2 • Recall – The scalar product between vectors v and w v o w = v * w * cos(v, w) – with v= ∑ vi 2 – and vow = ∑v *w i =1..n i i Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 26 The Simplest Approach • Co-ordinates are set as term frequency • Distance is measured as the cosine of the angle between doc d and query q Can be dropped – Wait for Euclidean distance sim(d , q) = cos(vd , vq ) = vd o v q vd * v q for ranking = ∑ (v [i] * v [i]) ∑ v [i] * ∑ v [i] q d 2 q 2 d • Properties – Including term weights increases retrieval performance – Computes a (seemingly reasonable) ranking – Also returns partial matches Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 27 Example Data • Assume stop word removal (wir, in, …) and stemming (verkaufen -> kauf, wollen -> woll, …) Text 1 Wir verkaufen Häuser in Italien verkauf haus italien 1 1 1 2 Häuser mit Gärten zu vermieten 1 3 Häuser: In Italien, um Italien, um Italien herum 1 4 Die italienschen Gärtner sind im Garten gart 1 miet blüh woll 1 3 1 2 5 Der Garten in unserem italienschen Haus blüht 1 1 1 Q Wir wollen ein Haus mit Garten in Italien mieten 1 1 1 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 1 1 1 28 Ranking without Length Normalization sim(d , q ) = d o q = • • • • • ∑ v [i] * v [i] i =1..n d q 1 1 1 2 1 3 1 4 1 1 1 3 1 2 5 1 1 1 Q 1 1 1 1 1 1 sim(d1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2 sim(d2,q)=1+1+1=3 sim(d3,q)=1+3=4 sim(d4,q)=1+2=3 sim(d5,q)=1+1+1=3 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d3: Häuser: In Italien, um Italien, um Italien herum 2 d5: Der Garten in unserem italienschen Haus blüht 2 d4: Die italienschen Gärtner sind im Garten 2 d2: Häuser mit Gärten zu vermieten 5 d1: Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 29 Ranking with Length Normalization ( v [i ] * v [i ]) ∑ sim(d , q ) = ∑ v [i] q d 2 d • • • • • sim(d1,q) sim(d2,q) sim(d3,q) sim(d4,q) sim(d5,q) = = = = = 1 1 1 2 1 3 1 4 1 1 3 1 2 5 1 1 1 Q 1 1 1 (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / √3 (1+1+1) / √3 (1+3) / √10 (1+2) / √5 (1+1+1) / √4 Rg No good? 1 1 1 ~ ~ ~ ~ ~ 1 1.15 1.73 1.26 1.34 1.5 Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d2: Häuser mit Gärten zu vermieten 2 d5: Der Garten in unserem italienschen Haus blüht 3 d4: Die italienschen Gärtner sind im Garten 4 d3: Häuser: In Italien, um Italien, um Italien herum 5 d1: Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 30 Improved Scoring: TF*IDF • One obvious problem: The longer the document, the higher the chances of being ranked high – Solution: Normalize on document length (also yields 0 ≤ wij ≤ 1) wij = tf ij | di | tf ij = ∑ tf k =1..|di| ik • Second obvious problem: Some terms are just everywhere in D and should be scored less – Solution: Use IDF scores wij = tf ij | di | * idf j Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 31 TF*IDF Short Form • Give terms in a doc d high weights which are … – frequent in d and – infrequent in D • IDF deals with the consequences of Zipf’s law – The few very frequent (and unspecific) terms get low scores – The many infrequent (and specific) terms get higher scores Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 32 Example TF*IDF IDF 1 (tf) tf ij tf ij |D| wij = * idf j = * | di | | d i | df k ( v [i ] * v [i ]) ∑ sim(d , q ) = ∑ v [i] q d 2 5 5/4 5/4 1/3 1/3 1/3 2 (tf) 1/3 3 (tf) 1/4 4 (tf) 5 (tf) Q 5/3 5 1/3 1/3 5 0 3/4 1/3 2/3 1/4 1/4 1/4 1 1 1 1/4 1 1 d • • • • • sim(d1,q)=(5/4*1/3 +5/4*1/3) / √0.3 sim(d2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / √0.3 sim(d3,q)=(5/4*1/4+5/4*3/4) / √0.63 sim(d4,q)=(5/4*1/3 +5/3*2/3) / √0.56 sim(d5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / √0.25 ~ ~ ~ ~ ~ 1.51 4,80 1,57 2,08 2,08 wollen ein Haus mit Garten in Italien mieten wollen ein Haus mit Garten in Italien mieten d2: Häuser mit Gärten zu vermieten Häuser mit Gärten zu vermieten d5: Der Garten in unserem italienschen Haus blüht Der Garten in unserem italienschen Haus blüht d4: Die italienschen Gärtner sind im Garten Die italienschen Gärtner sind im Garten d3: Häuser: In Italien, um Italien, um Italien herum Wir verkaufen Häuser in Italien Häuser: In Italien, um Italien, um Italien herum Ulf verkaufen Leser: TextHäuser Analytics, Vorlesung, Sommersemester 2008 d1: Wir in Italien 33 Further Improvements • Scoring the query in the same way as the documents – Note: Repeating words in a query then makes a difference • Empirically estimated “best” scoring function [BYRN99] – Different length normalization wij = tf ij max tf ij j – Use log of IDF with slightly different measure (ni : # of docs containing kj); rare terms are not totally distorting the score any more ⎛| D |⎞ ⎟ wij = * log⎜ ⎜ n ⎟ max tf ij ⎝ j ⎠ j Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 tf ij 34 Why not Simpler? • Why not use simple Euclidean distance? • Length of vectors would be much more important • Since queries usually are very short, very short documents would always win • Cosine measures normalizes by the length of both vectors Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 35 Metaphor: Comparison to Clustering • Clustering: Find groups of objects that are close to each other and far apart from all others • Rephrase the IR problem – – – – – A query partitions D in two clusters: relevant R / not relevant N q should be in the heard of R: similarity Docs in R should be close to each other: large wij*qi values Docs in R should be far apart from other points: idf Goal of scoring: Balance between intra- and inter-cluster similarity Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 36 Shortcomings • We assumed that all terms are independent, i.e., that their vectors are orthogonal – Clearly wrong: terms are semantically close or not • The appearance of “red” in a doc with “wine” doesn’t mean much • But “wine” is a non-zero match for “drinks” – Extension: Topic-based Vector Space Model • No treatment homonyms (as for most other methods) – We would need to split words into their senses – See word disambiguation (later) • Order independent – But order carries semantic meaning (object? subject?) – Order important for disambiguation Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 37 A Note on Implementation • Again: Assume we have an index with op find: K→PD • Assume we want to retrieve the top-r docs • Core algorithm – Look up all terms ki of the query – Build the union of all documents which contain at least on ki • Hold in a list sorted by score (initialize with 0) – Walk through terms ki in order of decreasing IDF-weights • For each document dj: Add wji*IDFi to current score sj • Go through docs in order of current score – Early stop • Look at sjr – sjr+1: Can doc with rank r+1 still reach doc with rank r? • Can be estimated from distribution of TF and IDF values • Early stop might produce the wrong order of top-r docs, but not wrong docs – Why? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 38 A Different View on VSD Star Diet • Query evaluation actually searches for the top-r nearest neighbors • This can be approached using techniques from multidimensional indexing – kDD-Trees, Grid files, etc. – Hope: No sequential scan of (all, many) docs, but directed search according to distances • But severe problem: High dimensionality Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 39 Content of this Lecture • • • • • • IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 40 Interactive IR • Recall: IR is the process of finding the right information – This includes more than a single query • Relevance feedback – User poses query – System answers – User judges the relevance of top-k returned docs – System considers feedback and generates new (improved) answer Query IR System (Ranked) Results • User never needs to pose another query • The new query is generated by the system – Loop until satisfaction Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 41 Relevance Feedback • User judges the current answers • Basic assumptions – Relevant docs are somewhat similar to each other – the common core should be emphasized – Irrelevant docs are different from relevant docs – the differences should be deemphasized • System adapts to feedback by – Query expansion: Add new terms to the query • From the relevant documents – Term re-weighting: Assign new weights to terms • From the relevant documents Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 42 Rationale • Assume we know the set R of all relevant docs • How would a good query for R look like? 1 1 vq = d − d ∑ ∑ | R | d∈R | D | − | R | d∉R • Idea – Weight terms high that are in R – Weight terms low that are not in R, i.e., that are in N=D\R • Remark – It is not clear that this is the best of all queries • Relevance feedback is a heuristic Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 43 Rocchio Algorithm • Usually we do not know the real R • Best bet: Let R (N) be the set docs marked as relevant (irrelevant) • Adapt query vector as follows 1 1 vqnew = α * v(q ) + β * d − γ* d ∑ ∑ | R | d∈R | N | d ∈N • This implies query expansion and term re-weighting – How to choose α, β, γ? • Alternative 1 vqnew = α * v(q ) + β * d − γ *{d | d = arg max( sim(q, d ))} ∑ | R | d∈R d ∈N – Intuition: Non-relevant docs are heterogeneous and tear in every direction – better to only take the worst Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 44 Example (simplified) α=0.5, β=0.5, γ=0 d1 = "information science“ = (0.2, 0.8) d2 = "retrieval systems" = (0.9, 0.1) q = "retrieval of information“ = (0.7, 0.3) Information 1.0 d1 If d1 is marked relevant q’ = ½∗q + ½∗d1 = (0.45, 0.55) q’ If d2 is marked relevant Q” = ½∗q + ½∗d2 = (0.80, 0.20) 0.5 q q” Quelle: A. Nürnberger: IR 0.5 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 d2 1.0 Retrieval 45 Results • Advantages – – – – Improved results (many studies) compared to single queries Users need not generate new queries themselves Iterative process converging to the best possible answer Especially helpful for increasing recall (due to query expansion) • Disadvantages – Requires some work by the user • Excite: Only 4% used relevance feedback (“more of this” button) – Writing a new query based on returned results might be faster (and easier and more successful) than classifying results – Based on the assumption that relevant docs are similar • What if user searches for all meanings of “jaguar”? – Query very long already after one iteration – makes retrieval slow Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 46 Collaborative Filtering • What other information could we use for improving retrieval performance? • Collaborative filtering: Give the user what other, similar users liked – “Customers who bought this book also bought …” – In IR: Find users posing similar queries and look at which documents they eventually looked it • In e-Commerce, it is the buying in itself (very reliable) • In IR, we need to approximate – – – – – Documents a searcher clicked on (if recordable) Did the user look at the second page? (Low credit for first results) Did the user pose a “refinement query” next? … All are not very reliable • Problems: What is a similar user? What is a similar query? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 47 Thesaurus-based Query Expansion [M07, CS276] • Expand query with synonyms and hyponyms of each term – feline → feline cat – One may weight added terms less than original query terms • Often used in scientific IR systems (Medline) • Requires a high quality thesaurus • General observation – Increases recall – May significantly decrease precision • “interest rate” → “interest rate fascinate evaluate” – Do synonyms really exist? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 48