Ensemble learning on Portuguese POS Tagging

Transcrição

Ensemble learning on Portuguese
POS Tagging
Presenter: Samuel Zeng
Date: 2011/08/31
Objective
Investigate the performance of SBCB algorithm and
other ensemble learning method on Portuguese partof-speech tagging
 Proposed a Portuguese part-of-speech tagger based
on SBCB algorithm

Agenda
 Introduction
 Methodology
 Experiment result
 Conclusion
Introduction – POS Tagging

What is Part of speech Tagging ?
The process of assigning a part-of-speech to each word in a sentence
based on both its definition, as well as its context
Example
Input: He can can a can.
Output: He/pron can/aux can/vb a/det can/n ./puct

Computer Perspective:
Training Data
Tagger
Text
Automatically tagging
Tagged Text

A practical issue in natural language processing (NLP),
especially in the development of a machine translation
system.
 The performance of the POS tagging may interference
the subsequent analytical tasks in the translation
process, and thereafter affects the translation quality.

Introduction – Classification

In machine learning and pattern recognition,
classification refers to an algorithmic procedure for
assigning a given piece of input data into one of a
given number of categories
Training
Classifier
Prediction
Introduction – Ensemble learning

Ensemble learning employs multiple classifiers and
combines their prediction capabilities
Combine method
Output
Combine structure
Method to Combining
Classifier 1
Classifier 2
Classifier 3
Input
…
…
Classifier n
Base classifier design
System Input
Introduction – SBCB learning algorithm
SBCB (Selecting Base Classifiers on Bagging)
 Equips with an optimization process - selecting optimal
classifiers

voting
C1
C2
C3
…
Cm
Optimization process
C1
C2
C3
…
Cn
D1
D2
D3
…
Dn
Dataset
Optimization Process
 Input: original m classifiers
 m classifiers predict instances in validate-set
 Two filtering steps:
 Check accuracy: eliminate low accurate
classifier(<0.5)
 Check diversity: iteratively eliminate
Validate-set-set
the classifier that has lowest
contribution to diversity of set of
classifiers
 Output: optimal n classifiers (n ≤ m)
Methodology – Tagging idea
Tagging
=
Dictionary-based
searching
Num
Word
POS
…
1
De
P
…
2
..
..
..
+
Ambiguous word
Classification
Classifier
Methodology – Tagging idea
Tagging process
Alguma vez se havia de ver a vaidade sem lugar.
 Lookup Dictionary
Checking Process
/Q
/N /SE
?
/P /VB /D
/N
?
Classifier
/N
Ambiguous word or
out-of-vocabulary word
/Q
/N
/SE /HV /P /VB /D
/N
/P
/N
Methodology – Implementation idea
Tagged source
Dictionary
component
Dictionary
Leaner
Tagger
Classifier
Tagged File
Test source
Methodology – Dictionary
• Function:
– To check which word in a sentence is ambiguous word
or unknown word by searching the dictionary
– To pre-assign POS to unambiguous word
• Dictionary should be collected from tagged corpus
which have been rightly confirmed.
• Entries of dictionary:
id
word
pos
…
…
…
Methodology – Classifier
• Objective: induce a classifier to assign a tag to
ambiguous word and unknown word in a sentence.
• Ambiguous word: has more than one possible POS
and its particular POS depends on the context
• Unknown word: has no checking result in dictionary
• Classifier induction - What we need:
– Feature-based dataset
– Classification algorithm
Methodology - Classifier
• Feature list
No.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Feature
W0
W-1
W-2
W1
P-1
P-2
P1
Suf3
SL
SU
CC
AL
AU
CN
CP
CH
CO
Description
Local word
Previously-1st word
Previously-2nd word
Next-1st word
POS of previously-1st word
POS of previously-2nd word
POS of next 1st word
Suffix (3)
Start with lower case?
Start with upper case?
Contain any capital letter?
All lower case?
All upper case?
Contain Number?
Contain period?
Contain hyphen?
Contain other symbol?
Methodology - Classifier
Example of feature value extraction:
Alguma/Q vez/N se/SE havia/HV de/P ver/VB a/D vaidade/N sem/P lugar/N ./.
Dictionary Checking
Alguma
Q
vez
se
havia
de
ver
a
vaidade
sem
lugar
.
N
SE
HV
P
VB
D
N
P
N
.
CONJS
CL
P
Instance
W0
W-1
W-2
W1
…….
CLASS
1
se
vez
Alguma
havia
SE
2
a
ver
de
vaidade
D
Experiment 1: Classifier evaluation
• Objective: investigate how effectiveness of SBCB
algorithm on tagging part-of-speech of ambiguous
word
• Preparation:
– Ambiguous word dataset (88,273 instances)
– Classification candidate (5)
Classifier
C4.5
Naï
ve
Bayes
Adaboost
Bagging
SBCB
Accuracy
71.20
70.81
83.16
83.02
84.54
Experiment 2: Tagger evaluation
• Objective: focus on the entire tagger performance
• Data:
Type
Train dataset
Test dataset
Sentences
31,318
4,095
• Result:
Scale
50%
70%
80%
90%
100%
Correct tagged
68895
73609
74702
77340
77609
Accuracy
81.99%
87.60%
88.90%
92.04%
92.36%
Tokens
638,437
84,029
Conclusion
Further evaluate the proposed SBCB algorithm
through a practical application: construction of partof-speech tagger for Portuguese.
 From the preliminary evaluation result, our tagger
can achieve 92.36% accuracy on Tycho corpus.

Application
 Online Tagging
Tagger
Dictionary
Learner
Experimenter
 Offline Tagging

Ensemble learning on Portuguese POS Tagging

Transcrição

Documentos relacionados

An Improved Hybrid System for Sentiment Analysis in Twitter

Visual On-line Learning in Distributed Camera Networks

Diagnostic of Pathology on the Vertebral Column with Embedded

W06-Suzuki Samurai angielska - Auto-Hak

shallow processing of portuguese: from

INTERCHANGE 3 UNIT 9 VOCABULARY TRADUÇÃO To pet sit

GIR Titan-Hyperion Installer manual

Manual for the CRPC CQPweb interface - CLUL

reaction

the library - Mairo Vergara

Except where reference is made to the work of

[18F]fluorodeoxyglucose Positron Emission Tomography for

Foto original do Titanic (1912)

Vulnerability in recreational settings: comparing 5 Portuguese

Operação M44 Operation M44

A Rhododendron Expedition to Sulawesi, Indonesia May 1997

mOthertongue - ScholarWorks@UMass Amherst

Investigations into the Use of Preposition Sense in Semantic

Compound Models for Vision-Based Pedestrian