Ensemble learning on Portuguese POS Tagging

Transcrição

Ensemble learning on Portuguese POS Tagging
Ensemble learning on Portuguese
POS Tagging
Presenter: Samuel Zeng
Date: 2011/08/31
Objective
Investigate the performance of SBCB algorithm and
other ensemble learning method on Portuguese partof-speech tagging
 Proposed a Portuguese part-of-speech tagger based
on SBCB algorithm

Agenda
 Introduction
 Methodology
 Experiment result
 Conclusion
Introduction – POS Tagging

What is Part of speech Tagging ?
The process of assigning a part-of-speech to each word in a sentence
based on both its definition, as well as its context
Example
Input: He can can a can.
Output: He/pron can/aux can/vb a/det can/n ./puct
Introduction – POS Tagging

What is Part of speech Tagging ?
The process of assigning a part-of-speech to each word in a sentence
based on both its definition, as well as its context
Computer Perspective:
Training Data
Tagger
Text
Automatically tagging
Tagged Text
Introduction – POS Tagging

What is Part of speech Tagging ?
The process of assigning a part-of-speech to each word in a sentence
based on both its definition, as well as its context
A practical issue in natural language processing (NLP),
especially in the development of a machine translation
system.
 The performance of the POS tagging may interference
the subsequent analytical tasks in the translation
process, and thereafter affects the translation quality.

Introduction – Classification

In machine learning and pattern recognition,
classification refers to an algorithmic procedure for
assigning a given piece of input data into one of a
given number of categories
Training
Classifier
Prediction
Introduction – Ensemble learning

Ensemble learning employs multiple classifiers and
combines their prediction capabilities
Combine method
Output
Combine structure
Method to Combining
Classifier 1
Classifier 2
Classifier 3
Input
…
…
Classifier n
Base classifier design
System Input
Introduction – SBCB learning algorithm
SBCB (Selecting Base Classifiers on Bagging)
 Equips with an optimization process - selecting optimal
classifiers

voting
C1
C2
C3
…
Cm
Optimization process
C1
C2
C3
…
Cn
D1
D2
D3
…
Dn
Dataset
Optimization Process
 Input: original m classifiers
 m classifiers predict instances in validate-set
 Two filtering steps:
 Check accuracy: eliminate low accurate
classifier(<0.5)
 Check diversity: iteratively eliminate
Validate-set-set
the classifier that has lowest
contribution to diversity of set of
classifiers
 Output: optimal n classifiers (n ≤ m)
Methodology – Tagging idea
Tagging
=
Dictionary-based
searching
Num
Word
POS
…
1
De
P
…
2
..
..
..
+
Ambiguous word
Classification
Classifier
Methodology – Tagging idea
Tagging process
Alguma vez se havia de ver a vaidade sem lugar.
 Lookup Dictionary
Checking Process
Alguma vez se havia de ver a vaidade sem lugar.
/Q
/N /SE
?
/P /VB /D
/N
?
Classifier
/N
Ambiguous word or
out-of-vocabulary word
Alguma vez se havia de ver a vaidade sem lugar.
/Q
/N
/SE /HV /P /VB /D
/N
/P
/N
Methodology – Implementation idea
Tagged source
Dictionary
component
Dictionary
Leaner
Tagger
Classifier
Tagged File
Test source
Methodology – Dictionary
• Function:
– To check which word in a sentence is ambiguous word
or unknown word by searching the dictionary
– To pre-assign POS to unambiguous word
• Dictionary should be collected from tagged corpus
which have been rightly confirmed.
• Entries of dictionary:
id
word
pos
…
…
…
Methodology – Classifier
• Objective: induce a classifier to assign a tag to
ambiguous word and unknown word in a sentence.
• Ambiguous word: has more than one possible POS
and its particular POS depends on the context
• Unknown word: has no checking result in dictionary
• Classifier induction - What we need:
– Feature-based dataset
– Classification algorithm
Methodology - Classifier
• Feature list
No.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Feature
W0
W-1
W-2
W1
P-1
P-2
P1
Suf3
SL
SU
CC
AL
AU
CN
CP
CH
CO
Description
Local word
Previously-1st word
Previously-2nd word
Next-1st word
POS of previously-1st word
POS of previously-2nd word
POS of next 1st word
Suffix (3)
Start with lower case?
Start with upper case?
Contain any capital letter?
All lower case?
All upper case?
Contain Number?
Contain period?
Contain hyphen?
Contain other symbol?
Methodology - Classifier
Example of feature value extraction:
Alguma/Q vez/N se/SE havia/HV de/P ver/VB a/D vaidade/N sem/P lugar/N ./.
Dictionary Checking
Alguma
Q
vez
se
havia
de
ver
a
vaidade
sem
lugar
.
N
SE
HV
P
VB
D
N
P
N
.
CONJS
CL
P
Instance
W0
W-1
W-2
W1
…….
CLASS
1
se
vez
Alguma
havia
SE
2
a
ver
de
vaidade
D
Experiment 1: Classifier evaluation
• Objective: investigate how effectiveness of SBCB
algorithm on tagging part-of-speech of ambiguous
word
• Preparation:
– Ambiguous word dataset (88,273 instances)
– Classification candidate (5)
Classifier
C4.5
Naï
ve
Bayes
Adaboost
Bagging
SBCB
Accuracy
71.20
70.81
83.16
83.02
84.54
Experiment 2: Tagger evaluation
• Objective: focus on the entire tagger performance
• Data:
Type
Train dataset
Test dataset
Sentences
31,318
4,095
• Result:
Scale
50%
70%
80%
90%
100%
Correct tagged
68895
73609
74702
77340
77609
Accuracy
81.99%
87.60%
88.90%
92.04%
92.36%
Tokens
638,437
84,029
Conclusion
Further evaluate the proposed SBCB algorithm
through a practical application: construction of partof-speech tagger for Portuguese.
 From the preliminary evaluation result, our tagger
can achieve 92.36% accuracy on Tycho corpus.

Application
 Online Tagging
Tagger
Dictionary
Learner
Experimenter
 Offline Tagging

Documentos relacionados

shallow processing of portuguese: from

shallow processing of portuguese: from Even if ambiguity cannot be fully resolved, the problem space can sometimes be reduced by restricting the amount of possible alternatives. In either case, the growth of alternatives is greatly redu...

Leia mais