Mining Compara(ve Sentences from Social Media Text
Transcrição
Mining Compara(ve Sentences from Social Media Text
Mining Compara,ve Sentences from Social Media Text Fabíola S. F. Pereira* and Sandra de Amo *[email protected] DMNLP’15 ECML/PKDD 2015 Workshop Porto, Portugal Mo,va,on What should I buy? Mo,va,on Xbox is bePer than PS4!! PS4 graphics are the best ever Between Xbox and PS4, I prefer WiiU J Texts on Social Media are good source for users preferences Mo,va,on Xbox is be@er than PS4!! PS4 graphics are the best ever Between Xbox and PS4, I prefer WiiU J Texts on Social Media are good source for users preferences – compara,ve opinions Outline ü ComparaWve Opinions ü ComparaWve Sentences Mining Techniques ü Datasets ü Experiments ü Conclusion Compara,ve Opinions [Jindal 2006] Regular opinion The graphics quality of this video game is amazing! Compara,ve opinion The graphics quality of PS4 is be&er than that of XBox. I prefer Coke to Pepsi. ü Different seman,c meanings and syntac,c forms ü Appropriate to establish an order rela,on between two enWWes ü Preference mining Compara,ve Opinions [Jindal 2006] Feature Example Our scope ComparaWve Wii U games are be@er than XBox games v SuperlaWve PS4 is the best ever! v Opinionated The graphics quality of PS4 is be@er than that of XBox v Not-‐opinionated PS4 is larger than XBox v Gradable comparisons (order) superla;ve comparison, equa;ve comparison, non-‐equal gradable comparison v Non-‐gradable comparisons XBox consoles come with movement sensors, but PS4 do not x Outline ü ComparaWve Opinions ü ComparaWve Sentences Mining Techniques ü Datasets ü Experiments ü Conclusion Compara,ve Sentences Mining Techniques 1. N-‐grams classificaWon (baseline) 2. ClassificaWon based on Class SequenWal Rules (CSRs) paPerns [Jindal and Liu 2006] 3. GeneWc algorithm for CSRs mining (proposed technique) N-‐grams Classifica,on Example not lik movie poor quality horrible cast class Sentence 1 0.10 0.69 2 0.10 0.53 0.02 0.3 non-‐compara,ve Sentence 2 0.40 0.0 2 0.17 0.93 0.01 1 non-‐compara,ve Sentence 3 0.10 0.69 2 0.10 0.53 0.02 0.3 compara,ve Classifier ü ü ü ü SVM Naïve Bayes MLP RBF N-‐grams Classifica,on Pre-‐processing steps: ü Stop words ü Stemming ü TF-‐IDF ü 1000 features extracWon – InformaWon Gain index ü Unigrams # sentences 1000 1500 # comparative sentences 97 (9.7%) 199 (13.26%) Texts dates Classifica,on 2003-2007 Dec 2014 N-‐grams Topic mp3 players XBox and PS4 Parameters Table 1: Datasets used for tests Unigrams Train/Test 10-fold cross-validation Pre-process stop words, stemm, infogain # Features 1000 MLP-Unigram RBF-Unigram Momentum 0.8 Learning rate 0.6 # neurons in hidden layer 15 2 # hidden layers 1 1 (a) Parameters for n-grams approach CSR Train/Test 10-fold cross-validation Radius 3 Compara,ve Sentences Mining Techniques 1. N-‐grams classificaWon (baseline) 2. ClassificaWon based on Class SequenWal Rules (CSRs) paPerns [Jindal and Liu 2006] 3. GeneWc algorithm for CSRs mining (proposed technique) PS, basta aplicar algoritmos de mineração de padrões seque Mineração de sentenças comparativas em redes sociais Class Sequen,al Rules [Jindal 2006] Span [6]. plo, considere o conjunto deDefiniWon itens I = {1, 2, 3, 4, 5, 6, 7}. A sequência s1 =< {3}{4, 5} > está c quência s2 =< {6}{3, 7}{4, 5, 6}. Já a sequência s3 =< {6, 4} > não está contida em s2 . sequência de entrada D para mineração é um conjunto de pares do tipo: D = {(s1 , y1 ), (s2 , Sequence Class n )}, onde si é uma sequência e yi 2 Y é o rótulo da classe. No contexto desse trabalh parativo, nao comparativo}. Finalmente, uma RCPS é uma implicação da forma <{1}{3}{5}{7,8,9}> c1 <{1}{3}{6}{7,8}> X! y, onde X é uma sequência ecy1 2 Y. <{1,6}{9}> c2 de si . Uma instância (si , yi ) s ma instância (si , yi ) cobre uma RCPS se X é uma subsequência RCPS se X é uma subsequência de si e yi = y. O suporte de <{3}{5,6}> c2 uma regra é fração entre o t ncias que satisfaz a regra sobre o total de instâncias da base. A confiança é a proporção e <{1}{3}{4}{7,8}> c2 ncias que cobrem a regra e que a satisfazem ao mesmo tempo. la Exemplo dea base uma de sequências paranamineração omo1: exemplo, considere de 5base sequências e 2 classes apresentada Tabela 1. Com de um s mo de 20 % e confiança mínima de 40 %, uma das regras mineradas é: Class SequenWal Rule (CSR): < {1}{3}{7, 8} >! c1 [support = 2/5 and confidence 2/3] ara minerar RCPS, basta aplicar algoritmos de mineração de padrões sequenciais. Neste tr o Prefix Span [6]. ailizado a tarefa de mineração de regras de classificação de padrõ Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Building a CSRs database… this camera has significantly more noise at iso 100 than the nikon 4500. this/DT camera/NN has/VBZ significantly/RB more/JJR noise/NN at/IN iso/NN 100/CD than/IN the/DT nikon/NN 4500/CD <{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}> è comparaWve Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Mining CSRs paPerns from database… Sequence Class <{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}> comparative <{NN}{IN}{NN}> non_comparative <{PRP}{bestJJS}{NN} > comparative <{NN}{VBZ}{RB}{NN}{IN}{NN}> non_comparative PrefixSpam Frequent Pa@erns Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Classifying based on frequent paPerns (features)… <{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}> <{NN} {bePerJJR}{IN}{NN}> <{PRP}{bestJJS}{NN} > class Sentence 1 1 0 1 non-‐compara,ve Sentence 2 1 1 0 non-‐compara,ve Sentence 3 1 0 0 compara,ve Classifier ü ü ü ü SVM Naïve Bayes MLP RBF Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Sentences CSRs Sentence 1 c1 <{}{}> c1 Sentence 2 c2 <{}{}> c2 Sentence n c2 <{}{}> c2 <{}{}> c1 <{}{}> c2 <{}{}> <{}{}> Sentence 2 c2 Model Sentence 1 c1 Frequent CSRs Frequent Pa@ern Mining Algorithm Classifier Sentence n c2 Matrix sentences x frequent CSRs Pre-processing # features stop words, stemm, infogain 1000 MLP-Unigram RBF-Unigram Momentum 0.8 [Jindal 2-006] Tx aprendizado 0.6 Parameters # neurônios camada oculta 15 2 # camadas ocultas 1 1 Class Sequen,al Rules (a) Parâmetros para técnica de Unigrams CSR Train/test 10-fold cross-validation Radius 3 minsup 0.1 minconf 0.6 Seq. pattern algorithm PrefixSpan MLP-CSR RBF-CSR Momentum 0.8 Learning rate 1.0 # neurons hidden layer 7 7 # hidden layers 1 1 (b) Parâmetros para técnica de RCPS AG Treinamento/Teste 0.8/0.2 Taxa de Mutação (Tm ) 0.8 Taxa de Crossover (Tc ) 0.8 Compara,ve Sentences Mining Techniques 1. N-‐grams classificaWon (baseline) 2. ClassificaWon based on Class SequenWal Rules (CSRs) paPerns [Jindal and Liu 2006] 3. GeneWc algorithm for CSRs mining (proposed technique) Gene,c Algorithm CSR Mining (GA-‐CSR) Idea: 1. Mining Class SequenWal Rules with a GA-‐based algorithm 2. Classifying sentences according to final populaWon Gene,c Algorithm CSR Mining (GA-‐CSR) Sentences CSRs Sentence 1 c1 <{}{}> c1 Sentence 2 c2 <{}{}> c2 Sentence n c2 <{}{}> c2 Gene,c Algorithm Final popula,on of CSRs <{}{}> c1 <{}{}> c2 Model Gene,c Algorithm Encoding Chromosome: 0RBS 0JJR 1DT 1DT 0JJ 1NN 0WRB comparaWve Class sequen,al rule (CSR): < {DT} {DT} {NN} > comparaWve Gene,c Algorithm Flowchart IniWal PopulaWon check minsup minconf Encoded Sentences (Test) Calculate Fitness Final PopulaWon SelecWon Encoded Sentences (Train) Crossover MutaWon InserWon/ Removal Op Survivor SelecWon Gene,c Algorithm Features v Fitness: Sp * Se v Sp = specificity = TN/(TN+FP) v Se = sensiWvity = TP/(TP+FN) v PopulaWon randomly generated v RoulePe wheel selecWon v Two point crossover v MutaWon: varying over gene domain v InserWon/Removal Operator: probability parameter v Survivor selecWon: fitness-‐based, top T (populaWon size) Momentum Learning rate Gene,c Algorithm # neurons in hidden layer #Parameters hidden layers 0.8 1.0 7 1 7 1 (b) Parameters for CSR approach GA-CSR Train/Test 10-fold cross-validation Crossover rate (Tc ) 0.8 Insertion rate (Pi ) 0.3 Removal rate (Pr ) 0.3 minsup 0.1 minconf 0.6 Population size (Tp ) 50 Fitness Se ⇤ Sp (c) Parameters GA-CSR approach Fig. 2: Parameterization Outline ü ComparaWve Opinions ü ComparaWve Sentences Mining Techniques ü Datasets ü Experiments ü Conclusion Datasets DB-Amazon DB-Twitter # sentences 1000 1500 # comparative sentences 97 (9.7%) 199 (13.26%) Texts dates 2003-2007 Dec 2014 Topic mp3 players XBox and PS4 Table 1: Datasets used for tests ü Manually labeled ü URLs removed Train/Test Pre-process # Features Unigrams 10-fold cross-validation stop words, stemm, infogain 1000 Outline ü ComparaWve Opinions ü ComparaWve Sentences Mining Techniques ü Datasets ü Experiments ü Conclusion baseline that does not take into account elaborated features of our m oblem. Varying classifiers algorithms does not impact on results that re ximum accuracy of 68.6% for RBF neural Experimental Resultsnetwork. Fig. 3: Experimental results over DB-Amazon The second test set ran over DB-Twitter (Figure 4). Graphics curve ed the trend, however the average accuracy decreased around 10%. T xplained due to the large amount of noise in Twitter texts. Moreov Experimental Results es grammatical errors potentially harm the grammatical pattern appr Fig. 4: Experimental results over DB-Twitter Discussion -‐ Weak points ü Datasets size ü Simple geneWc algorithm, empirically adjusted ü More tests with balanced/unbalanced datasets Outline ü ComparaWve Opinions ü ComparaWve Sentences Mining Techniques ü Datasets ü Experiments ü Conclusion Conclusions ü TwiPer dataset labeled with comparaWve/non-‐ comparaWve tweets ü Overview on comparaWve mining techniques over social media text ü GA-‐CSR algorithm proposed ü Experimental evaluaWon Future work ü Preference mining model ü Temporal preferences ü MeeWng topology and content for mining preferences in social networks Mining Compara,ve Sentences from Social Media Text Fabíola S. F. Pereira* and Sandra de Amo *[email protected] www.lsi.ufu.br/~fabiola/comparaWve-‐mining DMNLP’15 ECML/PKDD 2015 Workshop Porto, Portugal Tools ü Pre-‐processing (n-‐grams): weka ü PrefixSpam, GA, CSR: java implemented ü Classifiers: weka ü Datasets: twiPer4j, manually labeled ü Available at: www.lsi.ufu.br/~fabiola/comparaWve-‐mining