Text Analytics

Transcrição

Text Analytics
Text Analytics
Language Modeling
Ulf Leser
Content of this Lecture
•
•
•
•
Language Models
Markov Models
Data sparsity
Case study: Dialect classification
• Most material from [MS99], Chapter 6
Ulf Leser: Text Analytics, Winter Semester 2010/2011
2
Problem
• Given a prefix of a sentence: Predict the next word
– “At 5 o’clock, we usually drink …”
•
•
•
•
•
“tea” – quite likely
“beer” – quite unlikely
“a beer” – slightly more likely, but still
“biscuits” – semantically wrong
“the windows need cleaning” – syntactically wrong
• Similar to Shannon’s Game: Given a series of characters,
predict the next one (used in communication theory)
• Abstract formulation: Given a language L and the prefix
P[1..n] of a sequence S, S∈L: Predict S[n+1]
• Usually, this is a ranking problem – no crisp solution
Ulf Leser: Text Analytics, Winter Semester 2010/2011
3
Applications
• Speech/character recognition
– Given the transcribed prefix of a sentence – which words do we
expect next?
• Automatic translation
• Given the translated prefix of a sentence – which words do we
expect next?
• T9 on mobiles
– T9: Use prediction of next word as a-priori (background) probability
for enhancing interactive prediction
• Help to disambiguate between different options
• Helps to make useful suggestions
• Helps to point to likely errors (observation ≠ expectation)
Ulf Leser: Text Analytics, Winter Semester 2010/2011
4
Language Models
• Usual approach: Learn/use a model of the language
• Classical approach: Grammars
– Every language has a grammar (regular, context-free, …)
– Grammars can be learned from examples
• Not trivial, underdetermined, not covered here
– But: Usually, multiple continuations of a prefix are allowed
– (Deterministic) Grammars do not help in deciding which is the most
probable one
• Better: Probabilistic grammars
– Probabilistic automata: Transitions have a relative frequency
– Also called “sequential models”
Ulf Leser: Text Analytics, Winter Semester 2010/2011
5
N-Grams over Words
• A very simple and difficult to beat, statistical approach to
language modeling are n-gram models
– “Indeed, it is difficult to beat a trigram model on the purely linear
task of predicting the next word” [MS99]
• Definition
An n-gram (over words) is a sequence of n words.
• Count frequencies of all n-grams in a corpus
– Slide window of size n over the text and keep counter for each ever
seen n-gram
• Given a prefix, predict most probable continuation(s) based
on the frequencies – how?
Ulf Leser: Text Analytics, Winter Semester 2010/2011
6
N-Grams for Language Modeling
• Assume a sentence prefix with n-1 words <w1,…,wn-1>
• Lookup counts of all n-grams starting with <w1,…,wn-1>
– I.e., n-grams <w1,…,wn-1,wn>
• Chose the most frequent wn
• More formally
– Compute, for every possibly wn,
p ( w1 ,..., wn )
p( wn ) = p( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 )
– Chose wn which maximizes p(wn)
Ulf Leser: Text Analytics, Winter Semester 2010/2011
7
Stochastic Processes
• Which n should we chose?
• Consider language generation as a sequential stochastic
process
• At each stage, the process generates a new word
– Like a DFA, but transitions have probabilities
• Question: How big is the memory? How many previous
words does the process use to determine the next step?
–
–
–
–
0: Markov chain of order 0: No memory at all
1: Markov chain order 1: Next word only depends on prev. word
2: Markov chain order 2: Next word only depends on 2 prev. words
…
Ulf Leser: Text Analytics, Winter Semester 2010/2011
8
Which n?
• In language modeling, one usually chooses n=3-4
• That seems small, but most language effects are local
– But not all: “Dan swallowed the large, shiny, red …” (Car? Pil?
Strawberry?)
• Furthermore: We cannot obtain sensible relative frequencies
for higher orders
– Not enough training data – see later
Ulf Leser: Text Analytics, Winter Semester 2010/2011
9
Content of this Lecture
•
•
•
•
Language Models
Markov Models
Data sparsity
Case study: Dialect classification
• Most material from [MS99], Chapter 6
Ulf Leser: Text Analytics, Winter Semester 2010/2011
10
History and Applications
• Andrej Andrejewitsch Markov (1856-1922)
– Russian Mathematician
– Developed Markov Models (or Markov Chains) as a method for
analyzing language
– Markov, A. A. (1913). "Beispiel statistischer Untersuchungen des
Textes ‚Eugen Onegin‘, das den Zusammenhang von Ereignissen in
einer Kette veranschaulicht (Original in Russisch)." Bulletin de
l'Academie Imperiale des Sciences de St.-Petersbourg: 153-162.
• Markov Models and Hidden Markov Models are popular in
–
–
–
–
Language Modeling, Part-of-speech tagging
Speech recognition
Named entity recognition / information extraction
Biological sequence analysis
Ulf Leser: Text Analytics, Winter Semester 2010/2011
11
Markov Models
• Definition
Assume an alphabet Σ. A Markov Model of order 1 is a
sequential stochastic process with |Σ| states s1, …, sn with
– Every state emits exactly one symbol from Σ
– No two states emit the same symbol
– For a sequence <w1,w2,…> of states, the following holds
p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1)
– For order k, we have =p(wn=sn|wn-1=sn-1, …, wn-k=sn-k)
– ai,j =p(wn=sj|wn-1=si) are called transition probabilities
• Remarks
– In our case, we have Σ = {set of all words of a language}
– Computing start probabilities is an issue we essentially ignore
Ulf Leser: Text Analytics, Winter Semester 2010/2011
12
Visualization
• Since every state emits exactly one word, we can merge
states and words
• State transition graph
– Nodes are states (labeled with their emission)
– Arcs are transitions labeled with a non-zero probability
• Example
– “I go home”,
“I go shopping”,
“I am shopping,
“I go abroad”,
“Go shopping”
0,25
am
I
home
1
0,75
0,25
abroad
Ulf Leser: Text Analytics, Winter Semester 2010/2011
go
0,25
0,50
shopping
13
Probability of a Sequence of States (=a Sentence)
• Assume a Markov Model M of order 1 and a sequence S of
states with |S|=n
• With which probability was S generated by M, i.e., what is
the value of p(S|M)?
p ( S | M ) = p ( w1 = S [1]) * ∏ p ( wi = S [i ] | wi −1 = S [i − 1])
i = 2..n
= a0, S [1] * ∏ aS [i −1], S [ i ] = a0,1 * ∏ ai −1,i
i = 2..n
Ulf Leser: Text Analytics, Winter Semester 2010/2011
i = 2..n
14
Example
0,25
am
I
home
1
0,75
0,25
abroad
go
0,25
0,50
shopping
• p(“I go home”)
= p(w1=„I“|w0)* p(w2=„go“|w1=„I“) *
p(w3=„home“|w2=„go“)
= 1 * 0.75* 0.25 = 0.1875
• Problem: Pairs we have not seen in training get prob. 0
– Example: “I am abroad”
– With this small “corpus”, almost all transitions get p=0
Ulf Leser: Text Analytics, Winter Semester 2010/2011
15
Higher Order Markov Models
• Markov Models of order k
– The probability of being in state s after n steps depends on the k
predecessor states sn-1,…sn-k
p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1, …, wn-k=sn-k)
• We can trivially transform any order k model M (k>1) into
a Markov Model of order 1 (M’)
– M’ has |M|k states (all combinations of states of length k)
Ulf Leser: Text Analytics, Winter Semester 2010/2011
16
Predicting the Next State
• The problem of language modeling is a bit different
– And simpler
• We do not want to reason about an entire sequence, but
only about the next state, given some previous states
– We borrow terminology from Markov models, but not (yet) algs
p ( wn ) = p ( wn | w1 ,..., wn −1 )
0,25
= p( wn | wn −1 )
am
I
home
1
0,75
abroad
0,25
go
0,25
0,50
=
p ( wn −1 , wn )
p ( wn −1 )
~ p ( wn −1 , wn )
shopping
Ulf Leser: Text Analytics, Winter Semester 2010/2011
This is the most frequent
bi-gram with prefix wn-1
17
Problem
• We learn our transition probabilities from a limited sample
• Thus, we actually only estimate the true transition
probabilities
– Which we will never know, regardless how big our training corpus
• Introduces a systematic error which we can try to remove
– Sample selection is important
– Problem is researched a lot in statistics
• Extreme cases
– Transitions we do not see at all in the corpus
– These get a probability of 0 and will never be predicted
– This does not mean that they are non-existing in the language
• Our model (yet) cannot adequately cope with data sparsity
Ulf Leser: Text Analytics, Winter Semester 2010/2011
18
Importance of Data Sparsity
• How many n-grams do exist?
– Assume a language of 20.000 words
– n=1: 20.000, n=2: 4E8, n=3: 8E12, n=4: 1,6E17, …
• In “normal” corpora over large alphabets (like natural
languages), almost all n-grams (n>4) are very sparse
– Exponential growth cannot be balanced by “use larger corpora”
– Of course, not all n-grams really exist, but still
– Especially rare n-grams are prone to be overlooked
• Trade-off
– Large n: More expressive model, but frequency estimations are bad
– Small n: Less expressive model, but better estimates
Ulf Leser: Text Analytics, Winter Semester 2010/2011
19
Solutions we will not Discuss in Detail
• Reduce the number of words using stemming
– Might help to go from n=3..4 to n=4…5
– But very important grammatical clues are lost
• More abstract: Use some form of “binning” of tokens into
classes and compute n-grams over token classes, not
token
– All numbers -> one class
– All verbs -> one class (POS tags)
– All verbs related to “movement” -> one class
Ulf Leser: Text Analytics, Winter Semester 2010/2011
20
Content of this Lecture
• N-Gram models
• Data sparsity
• Case study: Dialect classification
Ulf Leser: Text Analytics, Winter Semester 2010/2011
21
Example
• Unigrams: Always the
most frequent word in
the corpus, does not
differentiate
• Bi-grams: Correct
word often ranks high,
but not always
• Tri-grams: Has a
perfect hit, but already
suffers from sparsity
• Four-grams: Unusable
•
Corpus: Fraction of Jane
Austen’s oeuvre, ~600.000
tokens
Ulf Leser: Text Analytics, Winter Semester 2010/2011
22
Statistical Estimators
• We were a bit sloppy so far
• We want
p ( w1 ,..., wn )
p( wn ) = p ( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 )
• But we only have
count ( w1 ,..., wn )
• So far, we always implicitly assumed
count ( w1 ,..., wn )
p ( w1 ,..., wn ) =
N
– N: all observed N-grams
Ulf Leser: Text Analytics, Winter Semester 2010/2011
23
MLE for N-gram Models
• This is called a Maximum Likelihood Estimator (MLE)
• MLE gives maximum likelihood to the training data
– Provably, if n-grams are independent, which they are not
– Gives probability zero to all events not in the training data
– The probability mass is spent entirely on the training data
• Recall: Probabilities must sum up to 1
– Prone to overfitting
• Does not include background knowledge external to the
corpus
• We need to smooth the estimates to account for the
restrictedness of the sample
Ulf Leser: Text Analytics, Winter Semester 2010/2011
24
Smoothing I: Laplace‘s Law
• Give some probability mass to unseen events
• Oldest (and simplest) suggestion: “Adding 1”
count ( w1 ,..., wn ) + 1
pLAP ( w1 ,..., wn ) =
N+B
– Where B is the number of possible n-grams, i.e., Kn
– All n-grams get a probability≠0
• But –moves too much mass to the unknown
– Estimates for seen n-grams are scaled down dramatically
– Estimates for unseen n-grams are small, but there are so many
• And many of them are truly impossible
– In a corpus of 40 M words with K~400T, 99.7% of the total
probability mass is spend in unseen events
Ulf Leser: Text Analytics, Winter Semester 2010/2011
25
Smoothing II: Lidstone‘s Law
• Laplace not suitable if there are many events, but few seen
• Lidstone’s law gives less probability mass to unseen events
count ( w1 ,..., wn ) + λ
pLIP ( w1 ,..., wn ) =
N +λ*B
– Small λ: More mass is given to seen events
– Typical estimate is λ=0.5
– Appropriate values can be learned (next slide)
• Still: Estimate of seen events is linear in the MLE estimate
– Not a good approximation of empirical distributions
• More advanced techniques: Good-Turing Estimator, n-gram
interpolations
Ulf Leser: Text Analytics, Winter Semester 2010/2011
26
Learning Appropriate Values for λ
•
•
•
•
•
We “simulate” seen and unseen events
Divide corpus in two disjoint parts C1 and C2
Count frequencies of n-grams in C1
Count c: how many n-grams in C1 are not present in C2
Set
λ =c B
– The probability of an n-gram (in c2) to be considered as not
existing although in reality it does exist
Ulf Leser: Text Analytics, Winter Semester 2010/2011
27
Option III: Back-Off Modelle
• If we cannot find a n-gram with count≠0, use a n-1 gram
– Or an n-2 gram, …
• Thus, in case there is p(w1,…,wn)≠0, we use instead
– We “back off” to a simpler model
p ( w1 ,..., wn ) p ( w2 ,..., wn ) p ( w3 ,..., wn )
≈
≈
≈K
p ( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 ) p ( w2 ,...wn −1 ) p ( w3 ,...wn −1 )
– Stop at the first n-k gram with non-zero count
• Alternative: Always look at different n’s
– With different weights
p ( wn − 2 , wn −1 , wn )
p ( wn −1 , wn )
p ( wn )
+ λ2
+ λ3
p ( wn ) = λ1
p ( wn − 2 , wn −1 )
p ( wn −1 )
N
Ulf Leser: Text Analytics, Winter Semester 2010/2011
28
Content of this Lecture
• N-Gram models
• Data sparsity
• Case study on n-gram power: Dialect classification
– Kristian Basler, Yvonne Grunow, Ralph Vidal, Stephan Wagner,
Hagen Zahn: Seminararbeit, 2006/2007
Ulf Leser: Text Analytics, Winter Semester 2010/2011
29
How do they know?
Ulf Leser: Text Analytics, Winter Semester 2010/2011
30
Dialect Classification
• Let’s use N-Grams for something different
• Given a text, determine its human language
• If the set of possible languages is fixed, this is a
classification problem: Assign a text its class (language)
– More on classification: Later in this course
• General approach
–
–
–
–
Gather example pages for each language
Learn a model for each language
Score a new page wrt. to each language model
Model that scores highest determines the class/language
• What type of models can we use?
Ulf Leser: Text Analytics, Winter Semester 2010/2011
31
Dialects and Languages
• We make it more difficult: Classifying dialects
• “A language is a dialect with an army and a nayv”
– Probably Max Weinreich, 1945
• No clear distinction
– German – Dutch: Floating change when approaching and crossing
the border
– “Serbisch”, “Kroatisch”, “Serbokroatisch” - how languages may
emerge
– Myth? In the medieval age, one could travel from Spain to Norway
without ever meeting two people within a one-day march that
would not understand each other [Source unknown]
• Differences: Usually, dialects are regionally bound and
have no standard writing
Ulf Leser: Text Analytics, Winter Semester 2010/2011
32
Examples Over Time
Althochdeutsch
750 - 1050
Mittelhochdeutsch
1050 - 1450
Neuhochdeutsch
1450 -
Althochdeutsch Gilaubiu in got fater
Mittelhoch
an got vater
-deutsch
Ich
Neuhoch- geloube
deutsch Ich glaube an Gott Vater
almachtigon
scepphion himmilis
enti
almechtigen
schephaer
himels
unde der erde
den
allmächtigen
Schöpfer
des
Himmels
und der Erde
Ulf Leser: Text Analytics, Winter Semester 2010/2011
erda
33
German Dialects in the 20th Century
Ulf Leser: Text Analytics, Winter Semester 2010/2011
34
Methods and Data
• Data
– Corpus of 1GB web pages with app 60 Billion tokens
– Contained <100 documents in anyone of 40 “German” dialects
• Training
– Manually collected samples from the web
– Approximately 100 KB of Dialect text
– Same amount of Hochdeutsch
• Since it was very difficult to collect sufficient training
samples, task change: Classify text as dialect or not
– But do not discover the concrete dialect itself
Ulf Leser: Text Analytics, Winter Semester 2010/2011
35
Dialects
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Nordniederdeutsch
Niederfränkisch
Westfälisch
Ostfälisch
MecklenburgischVorpommersch
Mittelpommersch
Ostpommersch
Niederpreußisch
Hochpreußisch
Brandenburgisch
Berlinisch
Ripuarisch
Moselfränkisch
Zentralhessisch
Südhessisch
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Nordhessisch
Thüringisch
Nord-Obersächsisch
Obersächsisch
Südmärkisch
Schlesisch
Rheinfränkisch
Pennsylvania German
Ostfränkisch
Niederalemannisch
Schwäbisch
Hochalemannisch
Höchstalemannisch / Walserdeutsch
Nordbairisch
Mittelbairisch
Ulf Leser: Text Analytics, Winter Semester 2010/2011
36
Examples
•
Oberlausitzer
•
Alemannisch
•
Pommerisch - Plattdeutsch (Lied)
•
Jiddisch
•
Schwäbisch
•
Kölsch
– Wu bei dr Spraa, de Arlnsträucher stiehn und Butterblum´n uff dr Wiese blühn, do
steigt de Brücke über´s Woasser drüber, a schmoaler Waig gitt noa Windschsohland
nüber.
– Dr Stadtdeil isch 1250 dur d'Fürste zue Fürsteberg gründet worre. Es sin diversi
Nammensänderige erfolgt:
– Wo de Ostseewellen trecken an den Strand, wo de gele Ginster bläuht in'n
Dünensand, wo de Möven schriegen grell in't Stormgebruus', dor is mine Heimat,
dor bün ik tau Huus.
– Schpil ssche mir a tango ojs in jidisch, sol doss sojn missnagdisch oder chassidisch.
As di bobele alejn sol kenen doss farschtejn un take a tenzele gejn.
– Auf de' schwäb'sche Eisebahne Gibt's gar viele Haltstatione: Schtuagart, Ulm, and
Biberach, Mekkabeure, Durlesbach.
– ls kleine Krott küss de op de Welt, häs nix zo sage, häs kei Geld. Weiß nit richtig, wä
de bes, do denks, verdammte Dress!
Ulf Leser: Text Analytics, Winter Semester 2010/2011
37
Dialect Classification
• VSM – approach
– Hochdeutsche docs will have other features than dialect docs
• Set of Hochdeutsch and dialect words only have a small overlap
• The classifier should learn how to quantify this difference
– Problem: Words we have not seen in the training data are ignored
for classification (large influence if the training data is sparse)
• N-gram approach: Use each character n-gram as feature
(for n=2,3,4)
–
–
–
–
Consider: “Rusalka je studentka hudz'by“
But: „ls kleine Krott küss de op de Welt“
Advantage: Less unseen character n-grams than unseen words
Advantage: Much less features
Ulf Leser: Text Analytics, Winter Semester 2010/2011
38
Results
Ulf Leser: Text Analytics, Winter Semester 2010/2011
39
Results
• Of 391 dialect texts, 385 were identified correctly
– 5 FP, 1 FN
• N-Gram and BoW approach are equally good
– N-gram model is much faster and requires less space
• But: Even simpler methods have equal performance
– Simple word count: Consider the disjoint sets V (Hochdeutsch) and
W (dialect)
– Count for each doc the number of words from V (v) and W (w)
– Set score=w/(v+w)
– If score>threshold, classify text as dialect
– Runtime: 14 min (on 1GB) versus ~20 hours for SVM
• Conclusion: Dialect classification is simple
Ulf Leser: Text Analytics, Winter Semester 2010/2011
40

Documentos relacionados