Text Analytics
Transcrição
Text Analytics
Text Analytics Language Modeling Ulf Leser Content of this Lecture • • • • Language Models Markov Models Data sparsity Case study: Dialect classification • Most material from [MS99], Chapter 6 Ulf Leser: Text Analytics, Winter Semester 2010/2011 2 Problem • Given a prefix of a sentence: Predict the next word – “At 5 o’clock, we usually drink …” • • • • • “tea” – quite likely “beer” – quite unlikely “a beer” – slightly more likely, but still “biscuits” – semantically wrong “the windows need cleaning” – syntactically wrong • Similar to Shannon’s Game: Given a series of characters, predict the next one (used in communication theory) • Abstract formulation: Given a language L and the prefix P[1..n] of a sequence S, S∈L: Predict S[n+1] • Usually, this is a ranking problem – no crisp solution Ulf Leser: Text Analytics, Winter Semester 2010/2011 3 Applications • Speech/character recognition – Given the transcribed prefix of a sentence – which words do we expect next? • Automatic translation • Given the translated prefix of a sentence – which words do we expect next? • T9 on mobiles – T9: Use prediction of next word as a-priori (background) probability for enhancing interactive prediction • Help to disambiguate between different options • Helps to make useful suggestions • Helps to point to likely errors (observation ≠ expectation) Ulf Leser: Text Analytics, Winter Semester 2010/2011 4 Language Models • Usual approach: Learn/use a model of the language • Classical approach: Grammars – Every language has a grammar (regular, context-free, …) – Grammars can be learned from examples • Not trivial, underdetermined, not covered here – But: Usually, multiple continuations of a prefix are allowed – (Deterministic) Grammars do not help in deciding which is the most probable one • Better: Probabilistic grammars – Probabilistic automata: Transitions have a relative frequency – Also called “sequential models” Ulf Leser: Text Analytics, Winter Semester 2010/2011 5 N-Grams over Words • A very simple and difficult to beat, statistical approach to language modeling are n-gram models – “Indeed, it is difficult to beat a trigram model on the purely linear task of predicting the next word” [MS99] • Definition An n-gram (over words) is a sequence of n words. • Count frequencies of all n-grams in a corpus – Slide window of size n over the text and keep counter for each ever seen n-gram • Given a prefix, predict most probable continuation(s) based on the frequencies – how? Ulf Leser: Text Analytics, Winter Semester 2010/2011 6 N-Grams for Language Modeling • Assume a sentence prefix with n-1 words <w1,…,wn-1> • Lookup counts of all n-grams starting with <w1,…,wn-1> – I.e., n-grams <w1,…,wn-1,wn> • Chose the most frequent wn • More formally – Compute, for every possibly wn, p ( w1 ,..., wn ) p( wn ) = p( wn | w1 ,..., wn −1 ) = p ( w1 ,...wn −1 ) – Chose wn which maximizes p(wn) Ulf Leser: Text Analytics, Winter Semester 2010/2011 7 Stochastic Processes • Which n should we chose? • Consider language generation as a sequential stochastic process • At each stage, the process generates a new word – Like a DFA, but transitions have probabilities • Question: How big is the memory? How many previous words does the process use to determine the next step? – – – – 0: Markov chain of order 0: No memory at all 1: Markov chain order 1: Next word only depends on prev. word 2: Markov chain order 2: Next word only depends on 2 prev. words … Ulf Leser: Text Analytics, Winter Semester 2010/2011 8 Which n? • In language modeling, one usually chooses n=3-4 • That seems small, but most language effects are local – But not all: “Dan swallowed the large, shiny, red …” (Car? Pil? Strawberry?) • Furthermore: We cannot obtain sensible relative frequencies for higher orders – Not enough training data – see later Ulf Leser: Text Analytics, Winter Semester 2010/2011 9 Content of this Lecture • • • • Language Models Markov Models Data sparsity Case study: Dialect classification • Most material from [MS99], Chapter 6 Ulf Leser: Text Analytics, Winter Semester 2010/2011 10 History and Applications • Andrej Andrejewitsch Markov (1856-1922) – Russian Mathematician – Developed Markov Models (or Markov Chains) as a method for analyzing language – Markov, A. A. (1913). "Beispiel statistischer Untersuchungen des Textes ‚Eugen Onegin‘, das den Zusammenhang von Ereignissen in einer Kette veranschaulicht (Original in Russisch)." Bulletin de l'Academie Imperiale des Sciences de St.-Petersbourg: 153-162. • Markov Models and Hidden Markov Models are popular in – – – – Language Modeling, Part-of-speech tagging Speech recognition Named entity recognition / information extraction Biological sequence analysis Ulf Leser: Text Analytics, Winter Semester 2010/2011 11 Markov Models • Definition Assume an alphabet Σ. A Markov Model of order 1 is a sequential stochastic process with |Σ| states s1, …, sn with – Every state emits exactly one symbol from Σ – No two states emit the same symbol – For a sequence <w1,w2,…> of states, the following holds p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1) – For order k, we have =p(wn=sn|wn-1=sn-1, …, wn-k=sn-k) – ai,j =p(wn=sj|wn-1=si) are called transition probabilities • Remarks – In our case, we have Σ = {set of all words of a language} – Computing start probabilities is an issue we essentially ignore Ulf Leser: Text Analytics, Winter Semester 2010/2011 12 Visualization • Since every state emits exactly one word, we can merge states and words • State transition graph – Nodes are states (labeled with their emission) – Arcs are transitions labeled with a non-zero probability • Example – “I go home”, “I go shopping”, “I am shopping, “I go abroad”, “Go shopping” 0,25 am I home 1 0,75 0,25 abroad Ulf Leser: Text Analytics, Winter Semester 2010/2011 go 0,25 0,50 shopping 13 Probability of a Sequence of States (=a Sentence) • Assume a Markov Model M of order 1 and a sequence S of states with |S|=n • With which probability was S generated by M, i.e., what is the value of p(S|M)? p ( S | M ) = p ( w1 = S [1]) * ∏ p ( wi = S [i ] | wi −1 = S [i − 1]) i = 2..n = a0, S [1] * ∏ aS [i −1], S [ i ] = a0,1 * ∏ ai −1,i i = 2..n Ulf Leser: Text Analytics, Winter Semester 2010/2011 i = 2..n 14 Example 0,25 am I home 1 0,75 0,25 abroad go 0,25 0,50 shopping • p(“I go home”) = p(w1=„I“|w0)* p(w2=„go“|w1=„I“) * p(w3=„home“|w2=„go“) = 1 * 0.75* 0.25 = 0.1875 • Problem: Pairs we have not seen in training get prob. 0 – Example: “I am abroad” – With this small “corpus”, almost all transitions get p=0 Ulf Leser: Text Analytics, Winter Semester 2010/2011 15 Higher Order Markov Models • Markov Models of order k – The probability of being in state s after n steps depends on the k predecessor states sn-1,…sn-k p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1, …, wn-k=sn-k) • We can trivially transform any order k model M (k>1) into a Markov Model of order 1 (M’) – M’ has |M|k states (all combinations of states of length k) Ulf Leser: Text Analytics, Winter Semester 2010/2011 16 Predicting the Next State • The problem of language modeling is a bit different – And simpler • We do not want to reason about an entire sequence, but only about the next state, given some previous states – We borrow terminology from Markov models, but not (yet) algs p ( wn ) = p ( wn | w1 ,..., wn −1 ) 0,25 = p( wn | wn −1 ) am I home 1 0,75 abroad 0,25 go 0,25 0,50 = p ( wn −1 , wn ) p ( wn −1 ) ~ p ( wn −1 , wn ) shopping Ulf Leser: Text Analytics, Winter Semester 2010/2011 This is the most frequent bi-gram with prefix wn-1 17 Problem • We learn our transition probabilities from a limited sample • Thus, we actually only estimate the true transition probabilities – Which we will never know, regardless how big our training corpus • Introduces a systematic error which we can try to remove – Sample selection is important – Problem is researched a lot in statistics • Extreme cases – Transitions we do not see at all in the corpus – These get a probability of 0 and will never be predicted – This does not mean that they are non-existing in the language • Our model (yet) cannot adequately cope with data sparsity Ulf Leser: Text Analytics, Winter Semester 2010/2011 18 Importance of Data Sparsity • How many n-grams do exist? – Assume a language of 20.000 words – n=1: 20.000, n=2: 4E8, n=3: 8E12, n=4: 1,6E17, … • In “normal” corpora over large alphabets (like natural languages), almost all n-grams (n>4) are very sparse – Exponential growth cannot be balanced by “use larger corpora” – Of course, not all n-grams really exist, but still – Especially rare n-grams are prone to be overlooked • Trade-off – Large n: More expressive model, but frequency estimations are bad – Small n: Less expressive model, but better estimates Ulf Leser: Text Analytics, Winter Semester 2010/2011 19 Solutions we will not Discuss in Detail • Reduce the number of words using stemming – Might help to go from n=3..4 to n=4…5 – But very important grammatical clues are lost • More abstract: Use some form of “binning” of tokens into classes and compute n-grams over token classes, not token – All numbers -> one class – All verbs -> one class (POS tags) – All verbs related to “movement” -> one class Ulf Leser: Text Analytics, Winter Semester 2010/2011 20 Content of this Lecture • N-Gram models • Data sparsity • Case study: Dialect classification Ulf Leser: Text Analytics, Winter Semester 2010/2011 21 Example • Unigrams: Always the most frequent word in the corpus, does not differentiate • Bi-grams: Correct word often ranks high, but not always • Tri-grams: Has a perfect hit, but already suffers from sparsity • Four-grams: Unusable • Corpus: Fraction of Jane Austen’s oeuvre, ~600.000 tokens Ulf Leser: Text Analytics, Winter Semester 2010/2011 22 Statistical Estimators • We were a bit sloppy so far • We want p ( w1 ,..., wn ) p( wn ) = p ( wn | w1 ,..., wn −1 ) = p ( w1 ,...wn −1 ) • But we only have count ( w1 ,..., wn ) • So far, we always implicitly assumed count ( w1 ,..., wn ) p ( w1 ,..., wn ) = N – N: all observed N-grams Ulf Leser: Text Analytics, Winter Semester 2010/2011 23 MLE for N-gram Models • This is called a Maximum Likelihood Estimator (MLE) • MLE gives maximum likelihood to the training data – Provably, if n-grams are independent, which they are not – Gives probability zero to all events not in the training data – The probability mass is spent entirely on the training data • Recall: Probabilities must sum up to 1 – Prone to overfitting • Does not include background knowledge external to the corpus • We need to smooth the estimates to account for the restrictedness of the sample Ulf Leser: Text Analytics, Winter Semester 2010/2011 24 Smoothing I: Laplace‘s Law • Give some probability mass to unseen events • Oldest (and simplest) suggestion: “Adding 1” count ( w1 ,..., wn ) + 1 pLAP ( w1 ,..., wn ) = N+B – Where B is the number of possible n-grams, i.e., Kn – All n-grams get a probability≠0 • But –moves too much mass to the unknown – Estimates for seen n-grams are scaled down dramatically – Estimates for unseen n-grams are small, but there are so many • And many of them are truly impossible – In a corpus of 40 M words with K~400T, 99.7% of the total probability mass is spend in unseen events Ulf Leser: Text Analytics, Winter Semester 2010/2011 25 Smoothing II: Lidstone‘s Law • Laplace not suitable if there are many events, but few seen • Lidstone’s law gives less probability mass to unseen events count ( w1 ,..., wn ) + λ pLIP ( w1 ,..., wn ) = N +λ*B – Small λ: More mass is given to seen events – Typical estimate is λ=0.5 – Appropriate values can be learned (next slide) • Still: Estimate of seen events is linear in the MLE estimate – Not a good approximation of empirical distributions • More advanced techniques: Good-Turing Estimator, n-gram interpolations Ulf Leser: Text Analytics, Winter Semester 2010/2011 26 Learning Appropriate Values for λ • • • • • We “simulate” seen and unseen events Divide corpus in two disjoint parts C1 and C2 Count frequencies of n-grams in C1 Count c: how many n-grams in C1 are not present in C2 Set λ =c B – The probability of an n-gram (in c2) to be considered as not existing although in reality it does exist Ulf Leser: Text Analytics, Winter Semester 2010/2011 27 Option III: Back-Off Modelle • If we cannot find a n-gram with count≠0, use a n-1 gram – Or an n-2 gram, … • Thus, in case there is p(w1,…,wn)≠0, we use instead – We “back off” to a simpler model p ( w1 ,..., wn ) p ( w2 ,..., wn ) p ( w3 ,..., wn ) ≈ ≈ ≈K p ( wn | w1 ,..., wn −1 ) = p ( w1 ,...wn −1 ) p ( w2 ,...wn −1 ) p ( w3 ,...wn −1 ) – Stop at the first n-k gram with non-zero count • Alternative: Always look at different n’s – With different weights p ( wn − 2 , wn −1 , wn ) p ( wn −1 , wn ) p ( wn ) + λ2 + λ3 p ( wn ) = λ1 p ( wn − 2 , wn −1 ) p ( wn −1 ) N Ulf Leser: Text Analytics, Winter Semester 2010/2011 28 Content of this Lecture • N-Gram models • Data sparsity • Case study on n-gram power: Dialect classification – Kristian Basler, Yvonne Grunow, Ralph Vidal, Stephan Wagner, Hagen Zahn: Seminararbeit, 2006/2007 Ulf Leser: Text Analytics, Winter Semester 2010/2011 29 How do they know? Ulf Leser: Text Analytics, Winter Semester 2010/2011 30 Dialect Classification • Let’s use N-Grams for something different • Given a text, determine its human language • If the set of possible languages is fixed, this is a classification problem: Assign a text its class (language) – More on classification: Later in this course • General approach – – – – Gather example pages for each language Learn a model for each language Score a new page wrt. to each language model Model that scores highest determines the class/language • What type of models can we use? Ulf Leser: Text Analytics, Winter Semester 2010/2011 31 Dialects and Languages • We make it more difficult: Classifying dialects • “A language is a dialect with an army and a nayv” – Probably Max Weinreich, 1945 • No clear distinction – German – Dutch: Floating change when approaching and crossing the border – “Serbisch”, “Kroatisch”, “Serbokroatisch” - how languages may emerge – Myth? In the medieval age, one could travel from Spain to Norway without ever meeting two people within a one-day march that would not understand each other [Source unknown] • Differences: Usually, dialects are regionally bound and have no standard writing Ulf Leser: Text Analytics, Winter Semester 2010/2011 32 Examples Over Time Althochdeutsch 750 - 1050 Mittelhochdeutsch 1050 - 1450 Neuhochdeutsch 1450 - Althochdeutsch Gilaubiu in got fater Mittelhoch an got vater -deutsch Ich Neuhoch- geloube deutsch Ich glaube an Gott Vater almachtigon scepphion himmilis enti almechtigen schephaer himels unde der erde den allmächtigen Schöpfer des Himmels und der Erde Ulf Leser: Text Analytics, Winter Semester 2010/2011 erda 33 German Dialects in the 20th Century Ulf Leser: Text Analytics, Winter Semester 2010/2011 34 Methods and Data • Data – Corpus of 1GB web pages with app 60 Billion tokens – Contained <100 documents in anyone of 40 “German” dialects • Training – Manually collected samples from the web – Approximately 100 KB of Dialect text – Same amount of Hochdeutsch • Since it was very difficult to collect sufficient training samples, task change: Classify text as dialect or not – But do not discover the concrete dialect itself Ulf Leser: Text Analytics, Winter Semester 2010/2011 35 Dialects • • • • • • • • • • • • • • • Nordniederdeutsch Niederfränkisch Westfälisch Ostfälisch MecklenburgischVorpommersch Mittelpommersch Ostpommersch Niederpreußisch Hochpreußisch Brandenburgisch Berlinisch Ripuarisch Moselfränkisch Zentralhessisch Südhessisch • • • • • • • • • • • • • • • Nordhessisch Thüringisch Nord-Obersächsisch Obersächsisch Südmärkisch Schlesisch Rheinfränkisch Pennsylvania German Ostfränkisch Niederalemannisch Schwäbisch Hochalemannisch Höchstalemannisch / Walserdeutsch Nordbairisch Mittelbairisch Ulf Leser: Text Analytics, Winter Semester 2010/2011 36 Examples • Oberlausitzer • Alemannisch • Pommerisch - Plattdeutsch (Lied) • Jiddisch • Schwäbisch • Kölsch – Wu bei dr Spraa, de Arlnsträucher stiehn und Butterblum´n uff dr Wiese blühn, do steigt de Brücke über´s Woasser drüber, a schmoaler Waig gitt noa Windschsohland nüber. – Dr Stadtdeil isch 1250 dur d'Fürste zue Fürsteberg gründet worre. Es sin diversi Nammensänderige erfolgt: – Wo de Ostseewellen trecken an den Strand, wo de gele Ginster bläuht in'n Dünensand, wo de Möven schriegen grell in't Stormgebruus', dor is mine Heimat, dor bün ik tau Huus. – Schpil ssche mir a tango ojs in jidisch, sol doss sojn missnagdisch oder chassidisch. As di bobele alejn sol kenen doss farschtejn un take a tenzele gejn. – Auf de' schwäb'sche Eisebahne Gibt's gar viele Haltstatione: Schtuagart, Ulm, and Biberach, Mekkabeure, Durlesbach. – ls kleine Krott küss de op de Welt, häs nix zo sage, häs kei Geld. Weiß nit richtig, wä de bes, do denks, verdammte Dress! Ulf Leser: Text Analytics, Winter Semester 2010/2011 37 Dialect Classification • VSM – approach – Hochdeutsche docs will have other features than dialect docs • Set of Hochdeutsch and dialect words only have a small overlap • The classifier should learn how to quantify this difference – Problem: Words we have not seen in the training data are ignored for classification (large influence if the training data is sparse) • N-gram approach: Use each character n-gram as feature (for n=2,3,4) – – – – Consider: “Rusalka je studentka hudz'by“ But: „ls kleine Krott küss de op de Welt“ Advantage: Less unseen character n-grams than unseen words Advantage: Much less features Ulf Leser: Text Analytics, Winter Semester 2010/2011 38 Results Ulf Leser: Text Analytics, Winter Semester 2010/2011 39 Results • Of 391 dialect texts, 385 were identified correctly – 5 FP, 1 FN • N-Gram and BoW approach are equally good – N-gram model is much faster and requires less space • But: Even simpler methods have equal performance – Simple word count: Consider the disjoint sets V (Hochdeutsch) and W (dialect) – Count for each doc the number of words from V (v) and W (w) – Set score=w/(v+w) – If score>threshold, classify text as dialect – Runtime: 14 min (on 1GB) versus ~20 hours for SVM • Conclusion: Dialect classification is simple Ulf Leser: Text Analytics, Winter Semester 2010/2011 40