Text Analytics

Transcrição

Text Analytics

Text Analytics
Language Modeling
Ulf Leser
Content of this Lecture
•
•
•
•
Language Models
Markov Models
Data sparsity
Case study: Dialect classification
• Most material from [MS99], Chapter 6
Ulf Leser: Text Analytics, Winter Semester 2010/2011
2
Problem
• Given a prefix of a sentence: Predict the next word
– “At 5 o’clock, we usually drink …”
•
•
•
•
•
“tea” – quite likely
“beer” – quite unlikely
“a beer” – slightly more likely, but still
“biscuits” – semantically wrong
“the windows need cleaning” – syntactically wrong
• Similar to Shannon’s Game: Given a series of characters,
predict the next one (used in communication theory)
• Abstract formulation: Given a language L and the prefix
P[1..n] of a sequence S, S∈L: Predict S[n+1]
• Usually, this is a ranking problem – no crisp solution
3
Applications
• Speech/character recognition
– Given the transcribed prefix of a sentence – which words do we
expect next?
• Automatic translation
• Given the translated prefix of a sentence – which words do we
expect next?
• T9 on mobiles
– T9: Use prediction of next word as a-priori (background) probability
for enhancing interactive prediction
• Help to disambiguate between different options
• Helps to make useful suggestions
• Helps to point to likely errors (observation ≠ expectation)
4
Language Models
• Usual approach: Learn/use a model of the language
• Classical approach: Grammars
– Every language has a grammar (regular, context-free, …)
– Grammars can be learned from examples
• Not trivial, underdetermined, not covered here
– But: Usually, multiple continuations of a prefix are allowed
– (Deterministic) Grammars do not help in deciding which is the most
probable one
• Better: Probabilistic grammars
– Probabilistic automata: Transitions have a relative frequency
– Also called “sequential models”
5
N-Grams over Words
• A very simple and difficult to beat, statistical approach to
language modeling are n-gram models
– “Indeed, it is difficult to beat a trigram model on the purely linear
task of predicting the next word” [MS99]
• Definition
An n-gram (over words) is a sequence of n words.
• Count frequencies of all n-grams in a corpus
– Slide window of size n over the text and keep counter for each ever
seen n-gram
• Given a prefix, predict most probable continuation(s) based
on the frequencies – how?
6
N-Grams for Language Modeling
• Assume a sentence prefix with n-1 words <w1,…,wn-1>
• Lookup counts of all n-grams starting with <w1,…,wn-1>
– I.e., n-grams <w1,…,wn-1,wn>
• Chose the most frequent wn
• More formally
– Compute, for every possibly wn,
p ( w1 ,..., wn )
p( wn ) = p( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 )
– Chose wn which maximizes p(wn)
7
Stochastic Processes
• Which n should we chose?
• Consider language generation as a sequential stochastic
process
• At each stage, the process generates a new word
– Like a DFA, but transitions have probabilities
• Question: How big is the memory? How many previous
words does the process use to determine the next step?
–
–
–
–
0: Markov chain of order 0: No memory at all
1: Markov chain order 1: Next word only depends on prev. word
2: Markov chain order 2: Next word only depends on 2 prev. words
…
8
Which n?
• In language modeling, one usually chooses n=3-4
• That seems small, but most language effects are local
– But not all: “Dan swallowed the large, shiny, red …” (Car? Pil?
Strawberry?)
• Furthermore: We cannot obtain sensible relative frequencies
for higher orders
– Not enough training data – see later
9
•
•
•
•
Language Models
Markov Models
Data sparsity
Case study: Dialect classification
• Most material from [MS99], Chapter 6
10
History and Applications
• Andrej Andrejewitsch Markov (1856-1922)
– Russian Mathematician
– Developed Markov Models (or Markov Chains) as a method for
analyzing language
– Markov, A. A. (1913). "Beispiel statistischer Untersuchungen des
Textes ‚Eugen Onegin‘, das den Zusammenhang von Ereignissen in
einer Kette veranschaulicht (Original in Russisch)." Bulletin de
l'Academie Imperiale des Sciences de St.-Petersbourg: 153-162.
• Markov Models and Hidden Markov Models are popular in
–
–
–
–
Language Modeling, Part-of-speech tagging
Speech recognition
Named entity recognition / information extraction
Biological sequence analysis
11
Markov Models
• Definition
Assume an alphabet Σ. A Markov Model of order 1 is a
sequential stochastic process with |Σ| states s1, …, sn with
– Every state emits exactly one symbol from Σ
– No two states emit the same symbol
– For a sequence <w1,w2,…> of states, the following holds
p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1)
– For order k, we have =p(wn=sn|wn-1=sn-1, …, wn-k=sn-k)
– ai,j =p(wn=sj|wn-1=si) are called transition probabilities
• Remarks
– In our case, we have Σ = {set of all words of a language}
– Computing start probabilities is an issue we essentially ignore
12
Visualization
• Since every state emits exactly one word, we can merge
states and words
• State transition graph
– Nodes are states (labeled with their emission)
– Arcs are transitions labeled with a non-zero probability
• Example
– “I go home”,
“I go shopping”,
“I am shopping,
“I go abroad”,
“Go shopping”
0,25
am
I
home
1
0,75
0,25
abroad
go
0,25
0,50
shopping
13
Probability of a Sequence of States (=a Sentence)
• Assume a Markov Model M of order 1 and a sequence S of
states with |S|=n
• With which probability was S generated by M, i.e., what is
the value of p(S|M)?
p ( S | M ) = p ( w1 = S [1]) * ∏ p ( wi = S [i ] | wi −1 = S [i − 1])
i = 2..n
= a0, S [1] * ∏ aS [i −1], S [ i ] = a0,1 * ∏ ai −1,i
i = 2..n
i = 2..n
14
Example
0,25
am
I
home
1
0,75
0,25
abroad
go
0,25
0,50
shopping
• p(“I go home”)
= p(w1=„I“|w0)* p(w2=„go“|w1=„I“) *
p(w3=„home“|w2=„go“)
= 1 * 0.75* 0.25 = 0.1875
• Problem: Pairs we have not seen in training get prob. 0
– Example: “I am abroad”
– With this small “corpus”, almost all transitions get p=0
15
Higher Order Markov Models
• Markov Models of order k
– The probability of being in state s after n steps depends on the k
predecessor states sn-1,…sn-k
p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1, …, wn-k=sn-k)
• We can trivially transform any order k model M (k>1) into
a Markov Model of order 1 (M’)
– M’ has |M|k states (all combinations of states of length k)
16
Predicting the Next State
• The problem of language modeling is a bit different
– And simpler
• We do not want to reason about an entire sequence, but
only about the next state, given some previous states
– We borrow terminology from Markov models, but not (yet) algs
p ( wn ) = p ( wn | w1 ,..., wn −1 )
0,25
= p( wn | wn −1 )
am
I
home
1
0,75
abroad
0,25
go
0,25
0,50
=
p ( wn −1 , wn )
p ( wn −1 )
~ p ( wn −1 , wn )
shopping
This is the most frequent
bi-gram with prefix wn-1
17
Problem
• We learn our transition probabilities from a limited sample
• Thus, we actually only estimate the true transition
probabilities
– Which we will never know, regardless how big our training corpus
• Introduces a systematic error which we can try to remove
– Sample selection is important
– Problem is researched a lot in statistics
• Extreme cases
– Transitions we do not see at all in the corpus
– These get a probability of 0 and will never be predicted
– This does not mean that they are non-existing in the language
• Our model (yet) cannot adequately cope with data sparsity
18
Importance of Data Sparsity
• How many n-grams do exist?
– Assume a language of 20.000 words
– n=1: 20.000, n=2: 4E8, n=3: 8E12, n=4: 1,6E17, …
• In “normal” corpora over large alphabets (like natural
languages), almost all n-grams (n>4) are very sparse
– Exponential growth cannot be balanced by “use larger corpora”
– Of course, not all n-grams really exist, but still
– Especially rare n-grams are prone to be overlooked
• Trade-off
– Large n: More expressive model, but frequency estimations are bad
– Small n: Less expressive model, but better estimates
19
Solutions we will not Discuss in Detail
• Reduce the number of words using stemming
– Might help to go from n=3..4 to n=4…5
– But very important grammatical clues are lost
• More abstract: Use some form of “binning” of tokens into
classes and compute n-grams over token classes, not
token
– All numbers -> one class
– All verbs -> one class (POS tags)
– All verbs related to “movement” -> one class
20
• N-Gram models
• Data sparsity
• Case study: Dialect classification
21
Example
• Unigrams: Always the
most frequent word in
the corpus, does not
differentiate
• Bi-grams: Correct
word often ranks high,
but not always
• Tri-grams: Has a
perfect hit, but already
suffers from sparsity
• Four-grams: Unusable
•
Corpus: Fraction of Jane
Austen’s oeuvre, ~600.000
tokens
22
Statistical Estimators
• We were a bit sloppy so far
• We want
p ( w1 ,..., wn )
p( wn ) = p ( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 )
• But we only have
count ( w1 ,..., wn )
• So far, we always implicitly assumed
count ( w1 ,..., wn )
p ( w1 ,..., wn ) =
N
– N: all observed N-grams
23
MLE for N-gram Models
• This is called a Maximum Likelihood Estimator (MLE)
• MLE gives maximum likelihood to the training data
– Provably, if n-grams are independent, which they are not
– Gives probability zero to all events not in the training data
– The probability mass is spent entirely on the training data
• Recall: Probabilities must sum up to 1
– Prone to overfitting
• Does not include background knowledge external to the
corpus
• We need to smooth the estimates to account for the
restrictedness of the sample
24
Smoothing I: Laplace‘s Law
• Give some probability mass to unseen events
• Oldest (and simplest) suggestion: “Adding 1”
count ( w1 ,..., wn ) + 1
pLAP ( w1 ,..., wn ) =
N+B
– Where B is the number of possible n-grams, i.e., Kn
– All n-grams get a probability≠0
• But –moves too much mass to the unknown
– Estimates for seen n-grams are scaled down dramatically
– Estimates for unseen n-grams are small, but there are so many
• And many of them are truly impossible
– In a corpus of 40 M words with K~400T, 99.7% of the total
probability mass is spend in unseen events
25
Smoothing II: Lidstone‘s Law
• Laplace not suitable if there are many events, but few seen
• Lidstone’s law gives less probability mass to unseen events
count ( w1 ,..., wn ) + λ
pLIP ( w1 ,..., wn ) =
N +λ*B
– Small λ: More mass is given to seen events
– Typical estimate is λ=0.5
– Appropriate values can be learned (next slide)
• Still: Estimate of seen events is linear in the MLE estimate
– Not a good approximation of empirical distributions
• More advanced techniques: Good-Turing Estimator, n-gram
interpolations
26
Learning Appropriate Values for λ
•
•
•
•
•
We “simulate” seen and unseen events
Divide corpus in two disjoint parts C1 and C2
Count frequencies of n-grams in C1
Count c: how many n-grams in C1 are not present in C2
Set
λ =c B
– The probability of an n-gram (in c2) to be considered as not
existing although in reality it does exist
27
Option III: Back-Off Modelle
• If we cannot find a n-gram with count≠0, use a n-1 gram
– Or an n-2 gram, …
• Thus, in case there is p(w1,…,wn)≠0, we use instead
– We “back off” to a simpler model
p ( w1 ,..., wn ) p ( w2 ,..., wn ) p ( w3 ,..., wn )
≈
≈
≈K
p ( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 ) p ( w2 ,...wn −1 ) p ( w3 ,...wn −1 )
– Stop at the first n-k gram with non-zero count
• Alternative: Always look at different n’s
– With different weights
p ( wn − 2 , wn −1 , wn )
p ( wn −1 , wn )
p ( wn )
+ λ2
+ λ3
p ( wn ) = λ1
p ( wn − 2 , wn −1 )
p ( wn −1 )
N
28
• N-Gram models
• Data sparsity
• Case study on n-gram power: Dialect classification
– Kristian Basler, Yvonne Grunow, Ralph Vidal, Stephan Wagner,
Hagen Zahn: Seminararbeit, 2006/2007
29
How do they know?
30
Dialect Classification
• Let’s use N-Grams for something different
• Given a text, determine its human language
• If the set of possible languages is fixed, this is a
classification problem: Assign a text its class (language)
– More on classification: Later in this course
• General approach
–
–
–
–
Gather example pages for each language
Learn a model for each language
Score a new page wrt. to each language model
Model that scores highest determines the class/language
• What type of models can we use?
31
Dialects and Languages
• We make it more difficult: Classifying dialects
• “A language is a dialect with an army and a nayv”
– Probably Max Weinreich, 1945
• No clear distinction
– German – Dutch: Floating change when approaching and crossing
the border
– “Serbisch”, “Kroatisch”, “Serbokroatisch” - how languages may
emerge
– Myth? In the medieval age, one could travel from Spain to Norway
without ever meeting two people within a one-day march that
would not understand each other [Source unknown]
• Differences: Usually, dialects are regionally bound and
have no standard writing
32
Examples Over Time
Althochdeutsch
750 - 1050
Mittelhochdeutsch
1050 - 1450
Neuhochdeutsch
1450 -
Althochdeutsch Gilaubiu in got fater
Mittelhoch
an got vater
-deutsch
Ich
Neuhoch- geloube
deutsch Ich glaube an Gott Vater
almachtigon
scepphion himmilis
enti
almechtigen
schephaer
himels
unde der erde
den
allmächtigen
Schöpfer
des
Himmels
und der Erde
erda
33
German Dialects in the 20th Century
34
Methods and Data
• Data
– Corpus of 1GB web pages with app 60 Billion tokens
– Contained <100 documents in anyone of 40 “German” dialects
• Training
– Manually collected samples from the web
– Approximately 100 KB of Dialect text
– Same amount of Hochdeutsch
• Since it was very difficult to collect sufficient training
samples, task change: Classify text as dialect or not
– But do not discover the concrete dialect itself
35
Dialects
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Nordniederdeutsch
Niederfränkisch
Westfälisch
Ostfälisch
MecklenburgischVorpommersch
Mittelpommersch
Ostpommersch
Niederpreußisch
Hochpreußisch
Brandenburgisch
Berlinisch
Ripuarisch
Moselfränkisch
Zentralhessisch
Südhessisch
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Nordhessisch
Thüringisch
Nord-Obersächsisch
Obersächsisch
Südmärkisch
Schlesisch
Rheinfränkisch
Pennsylvania German
Ostfränkisch
Niederalemannisch
Schwäbisch
Hochalemannisch
Höchstalemannisch / Walserdeutsch
Nordbairisch
Mittelbairisch
36
Examples
•
Oberlausitzer
•
Alemannisch
•
Pommerisch - Plattdeutsch (Lied)
•
Jiddisch
•
Schwäbisch
•
Kölsch
– Wu bei dr Spraa, de Arlnsträucher stiehn und Butterblum´n uff dr Wiese blühn, do
steigt de Brücke über´s Woasser drüber, a schmoaler Waig gitt noa Windschsohland
nüber.
– Dr Stadtdeil isch 1250 dur d'Fürste zue Fürsteberg gründet worre. Es sin diversi
Nammensänderige erfolgt:
– Wo de Ostseewellen trecken an den Strand, wo de gele Ginster bläuht in'n
Dünensand, wo de Möven schriegen grell in't Stormgebruus', dor is mine Heimat,
dor bün ik tau Huus.
– Schpil ssche mir a tango ojs in jidisch, sol doss sojn missnagdisch oder chassidisch.
As di bobele alejn sol kenen doss farschtejn un take a tenzele gejn.
– Auf de' schwäb'sche Eisebahne Gibt's gar viele Haltstatione: Schtuagart, Ulm, and
Biberach, Mekkabeure, Durlesbach.
– ls kleine Krott küss de op de Welt, häs nix zo sage, häs kei Geld. Weiß nit richtig, wä
de bes, do denks, verdammte Dress!
37
Dialect Classification
• VSM – approach
– Hochdeutsche docs will have other features than dialect docs
• Set of Hochdeutsch and dialect words only have a small overlap
• The classifier should learn how to quantify this difference
– Problem: Words we have not seen in the training data are ignored
for classification (large influence if the training data is sparse)
• N-gram approach: Use each character n-gram as feature
(for n=2,3,4)
–
–
–
–
Consider: “Rusalka je studentka hudz'by“
But: „ls kleine Krott küss de op de Welt“
Advantage: Less unseen character n-grams than unseen words
Advantage: Much less features
38
Results
39
Results
• Of 391 dialect texts, 385 were identified correctly
– 5 FP, 1 FN
• N-Gram and BoW approach are equally good
– N-gram model is much faster and requires less space
• But: Even simpler methods have equal performance
– Simple word count: Consider the disjoint sets V (Hochdeutsch) and
W (dialect)
– Count for each doc the number of words from V (v) and W (w)
– Set score=w/(v+w)
– If score>threshold, classify text as dialect
– Runtime: 14 min (on 1GB) versus ~20 hours for SVM
• Conclusion: Dialect classification is simple
40

Text Analytics

Transcrição

Documentos relacionados

Text Analytics - Institut für Informatik

dnt q bow

Praktikum Information Integration

Einführung

Foundations of Information Retrieval

Text Analytics

clinton engines

Materials Logistics and Production

Let`s do good things and talk about it

Business English 1

6 Months Digital Marketing Internship

who is behind ere international?