Ensemble learning on Portuguese POS Tagging
Transcrição
Ensemble learning on Portuguese POS Tagging
Ensemble learning on Portuguese POS Tagging Presenter: Samuel Zeng Date: 2011/08/31 Objective Investigate the performance of SBCB algorithm and other ensemble learning method on Portuguese partof-speech tagging Proposed a Portuguese part-of-speech tagger based on SBCB algorithm Agenda Introduction Methodology Experiment result Conclusion Introduction – POS Tagging What is Part of speech Tagging ? The process of assigning a part-of-speech to each word in a sentence based on both its definition, as well as its context Example Input: He can can a can. Output: He/pron can/aux can/vb a/det can/n ./puct Introduction – POS Tagging What is Part of speech Tagging ? The process of assigning a part-of-speech to each word in a sentence based on both its definition, as well as its context Computer Perspective: Training Data Tagger Text Automatically tagging Tagged Text Introduction – POS Tagging What is Part of speech Tagging ? The process of assigning a part-of-speech to each word in a sentence based on both its definition, as well as its context A practical issue in natural language processing (NLP), especially in the development of a machine translation system. The performance of the POS tagging may interference the subsequent analytical tasks in the translation process, and thereafter affects the translation quality. Introduction – Classification In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given piece of input data into one of a given number of categories Training Classifier Prediction Introduction – Ensemble learning Ensemble learning employs multiple classifiers and combines their prediction capabilities Combine method Output Combine structure Method to Combining Classifier 1 Classifier 2 Classifier 3 Input … … Classifier n Base classifier design System Input Introduction – SBCB learning algorithm SBCB (Selecting Base Classifiers on Bagging) Equips with an optimization process - selecting optimal classifiers voting C1 C2 C3 … Cm Optimization process C1 C2 C3 … Cn D1 D2 D3 … Dn Dataset Optimization Process Input: original m classifiers m classifiers predict instances in validate-set Two filtering steps: Check accuracy: eliminate low accurate classifier(<0.5) Check diversity: iteratively eliminate Validate-set-set the classifier that has lowest contribution to diversity of set of classifiers Output: optimal n classifiers (n ≤ m) Methodology – Tagging idea Tagging = Dictionary-based searching Num Word POS … 1 De P … 2 .. .. .. + Ambiguous word Classification Classifier Methodology – Tagging idea Tagging process Alguma vez se havia de ver a vaidade sem lugar. Lookup Dictionary Checking Process Alguma vez se havia de ver a vaidade sem lugar. /Q /N /SE ? /P /VB /D /N ? Classifier /N Ambiguous word or out-of-vocabulary word Alguma vez se havia de ver a vaidade sem lugar. /Q /N /SE /HV /P /VB /D /N /P /N Methodology – Implementation idea Tagged source Dictionary component Dictionary Leaner Tagger Classifier Tagged File Test source Methodology – Dictionary • Function: – To check which word in a sentence is ambiguous word or unknown word by searching the dictionary – To pre-assign POS to unambiguous word • Dictionary should be collected from tagged corpus which have been rightly confirmed. • Entries of dictionary: id word pos … … … Methodology – Classifier • Objective: induce a classifier to assign a tag to ambiguous word and unknown word in a sentence. • Ambiguous word: has more than one possible POS and its particular POS depends on the context • Unknown word: has no checking result in dictionary • Classifier induction - What we need: – Feature-based dataset – Classification algorithm Methodology - Classifier • Feature list No. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Feature W0 W-1 W-2 W1 P-1 P-2 P1 Suf3 SL SU CC AL AU CN CP CH CO Description Local word Previously-1st word Previously-2nd word Next-1st word POS of previously-1st word POS of previously-2nd word POS of next 1st word Suffix (3) Start with lower case? Start with upper case? Contain any capital letter? All lower case? All upper case? Contain Number? Contain period? Contain hyphen? Contain other symbol? Methodology - Classifier Example of feature value extraction: Alguma/Q vez/N se/SE havia/HV de/P ver/VB a/D vaidade/N sem/P lugar/N ./. Dictionary Checking Alguma Q vez se havia de ver a vaidade sem lugar . N SE HV P VB D N P N . CONJS CL P Instance W0 W-1 W-2 W1 ……. CLASS 1 se vez Alguma havia SE 2 a ver de vaidade D Experiment 1: Classifier evaluation • Objective: investigate how effectiveness of SBCB algorithm on tagging part-of-speech of ambiguous word • Preparation: – Ambiguous word dataset (88,273 instances) – Classification candidate (5) Classifier C4.5 Naï ve Bayes Adaboost Bagging SBCB Accuracy 71.20 70.81 83.16 83.02 84.54 Experiment 2: Tagger evaluation • Objective: focus on the entire tagger performance • Data: Type Train dataset Test dataset Sentences 31,318 4,095 • Result: Scale 50% 70% 80% 90% 100% Correct tagged 68895 73609 74702 77340 77609 Accuracy 81.99% 87.60% 88.90% 92.04% 92.36% Tokens 638,437 84,029 Conclusion Further evaluate the proposed SBCB algorithm through a practical application: construction of partof-speech tagger for Portuguese. From the preliminary evaluation result, our tagger can achieve 92.36% accuracy on Tycho corpus. Application Online Tagging Tagger Dictionary Learner Experimenter Offline Tagging
Documentos relacionados
shallow processing of portuguese: from
Even if ambiguity cannot be fully resolved, the problem space can sometimes be reduced by restricting the amount of possible alternatives. In either case, the growth of alternatives is greatly redu...
Leia mais