Mining Compara(ve Sentences from Social Media Text

Transcrição

Mining Compara(ve Sentences from Social Media Text
Mining Compara,ve Sentences from Social Media Text Fabíola S. F. Pereira* and Sandra de Amo *[email protected] DMNLP’15 ECML/PKDD 2015 Workshop Porto, Portugal Mo,va,on What should I buy? Mo,va,on Xbox is bePer than PS4!! PS4 graphics are the best ever Between Xbox and PS4, I prefer WiiU J Texts on Social Media are good source for users preferences Mo,va,on Xbox is be@er than PS4!! PS4 graphics are the best ever Between Xbox and PS4, I prefer WiiU J Texts on Social Media are good source for users preferences – compara,ve opinions Outline ü  ComparaWve Opinions ü  ComparaWve Sentences Mining Techniques ü  Datasets ü  Experiments ü  Conclusion Compara,ve Opinions [Jindal 2006] Regular opinion The graphics quality of this video game is amazing! Compara,ve opinion The graphics quality of PS4 is be&er than that of XBox. I prefer Coke to Pepsi. ü  Different seman,c meanings and syntac,c forms ü  Appropriate to establish an order rela,on between two enWWes ü  Preference mining Compara,ve Opinions [Jindal 2006] Feature Example Our scope ComparaWve Wii U games are be@er than XBox games v SuperlaWve PS4 is the best ever! v Opinionated The graphics quality of PS4 is be@er than that of XBox v Not-­‐opinionated PS4 is larger than XBox v Gradable comparisons (order) superla;ve comparison, equa;ve comparison, non-­‐equal gradable comparison v Non-­‐gradable comparisons XBox consoles come with movement sensors, but PS4 do not x Outline ü  ComparaWve Opinions ü  ComparaWve Sentences Mining Techniques ü  Datasets ü  Experiments ü  Conclusion Compara,ve Sentences Mining Techniques 1. N-­‐grams classificaWon (baseline) 2. ClassificaWon based on Class SequenWal Rules (CSRs) paPerns [Jindal and Liu 2006] 3. GeneWc algorithm for CSRs mining (proposed technique) N-­‐grams Classifica,on Example not lik movie poor quality horrible cast class Sentence 1 0.10 0.69 2 0.10 0.53 0.02 0.3 non-­‐compara,ve Sentence 2 0.40 0.0 2 0.17 0.93 0.01 1 non-­‐compara,ve Sentence 3 0.10 0.69 2 0.10 0.53 0.02 0.3 compara,ve Classifier ü 
ü 
ü 
ü 
SVM Naïve Bayes MLP RBF N-­‐grams Classifica,on Pre-­‐processing steps: ü  Stop words ü  Stemming ü  TF-­‐IDF ü  1000 features extracWon – InformaWon Gain index ü  Unigrams # sentences
1000
1500
# comparative sentences 97 (9.7%) 199 (13.26%)
Texts
dates Classifica,on 2003-2007
Dec 2014
N-­‐grams Topic
mp3 players XBox and PS4
Parameters
Table 1: Datasets used for tests
Unigrams
Train/Test
10-fold cross-validation
Pre-process
stop words, stemm, infogain
# Features
1000
MLP-Unigram RBF-Unigram
Momentum
0.8
Learning rate
0.6
# neurons in hidden layer
15
2
# hidden layers
1
1
(a) Parameters for n-grams approach
CSR
Train/Test
10-fold cross-validation
Radius
3
Compara,ve Sentences Mining Techniques 1. N-­‐grams classificaWon (baseline) 2. ClassificaWon based on Class SequenWal Rules (CSRs) paPerns [Jindal and Liu 2006] 3. GeneWc algorithm for CSRs mining (proposed technique) PS, basta aplicar algoritmos de mineração de padrões seque
Mineração de sentenças comparativas em redes sociais
Class Sequen,al Rules [Jindal 2006] Span [6].
plo, considere o conjunto deDefiniWon
itens I = {1,
2, 3, 4, 5, 6, 7}. A sequência s1 =< {3}{4, 5} > está c
quência s2 =< {6}{3, 7}{4, 5, 6}. Já a sequência s3 =< {6, 4} > não está contida em s2 .
sequência de entrada D para mineração é um conjunto de pares do tipo: D = {(s1 , y1 ), (s2 ,
Sequence
Class
n )}, onde si é uma sequência e yi 2 Y é o rótulo da classe. No contexto desse trabalh
parativo, nao comparativo}. Finalmente, uma RCPS é uma implicação da forma
<{1}{3}{5}{7,8,9}> c1
<{1}{3}{6}{7,8}>
X!
y, onde X é uma sequência ecy1 2 Y.
<{1,6}{9}>
c2 de si . Uma instância (si , yi ) s
ma instância (si , yi ) cobre uma RCPS
se X é uma subsequência
RCPS se X é uma subsequência de
si e yi = y. O suporte de
<{3}{5,6}>
c2 uma regra é fração entre o t
ncias que satisfaz a regra sobre o total de instâncias da base. A confiança é a proporção e
<{1}{3}{4}{7,8}> c2
ncias que cobrem a regra e que a satisfazem ao mesmo tempo.
la
Exemplo
dea base
uma
de sequências
paranamineração
omo1:
exemplo,
considere
de 5base
sequências
e 2 classes apresentada
Tabela 1. Com de
um s
mo de 20 % e confiança mínima de 40 %, uma das regras mineradas é:
Class SequenWal Rule (CSR): < {1}{3}{7, 8} >! c1 [support = 2/5 and confidence 2/3]
ara minerar RCPS, basta aplicar algoritmos de mineração de padrões sequenciais. Neste tr
o Prefix Span
[6].
ailizado
a tarefa
de mineração
de regras de classificação de padrõ
Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Building a CSRs database… this camera has significantly more noise at iso 100 than the nikon 4500. this/DT camera/NN has/VBZ significantly/RB more/JJR noise/NN at/IN iso/NN 100/CD than/IN the/DT nikon/NN 4500/CD <{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}> è comparaWve Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Mining CSRs paPerns from database… Sequence
Class
<{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}>
comparative
<{NN}{IN}{NN}>
non_comparative
<{PRP}{bestJJS}{NN} >
comparative
<{NN}{VBZ}{RB}{NN}{IN}{NN}>
non_comparative
PrefixSpam Frequent Pa@erns Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Classifying based on frequent paPerns (features)… <{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}> <{NN} {bePerJJR}{IN}{NN}> <{PRP}{bestJJS}{NN} > class Sentence 1 1 0 1 non-­‐compara,ve Sentence 2 1 1 0 non-­‐compara,ve Sentence 3 1 0 0 compara,ve Classifier ü 
ü 
ü 
ü 
SVM Naïve Bayes MLP RBF Class Sequen,al Rules [Jindal 2006] Mining CSRs PaPerns Sentences CSRs Sentence 1 c1 <{}{}> c1 Sentence 2 c2 <{}{}> c2 Sentence n c2 <{}{}> c2 <{}{}> c1 <{}{}> c2 <{}{}> <{}{}> Sentence 2 c2 Model Sentence 1 c1 Frequent CSRs Frequent Pa@ern Mining Algorithm Classifier Sentence n c2 Matrix sentences x frequent CSRs Pre-processing
# features
stop words, stemm, infogain
1000
MLP-Unigram RBF-Unigram
Momentum
0.8
[Jindal 2-006] Tx aprendizado
0.6
Parameters
# neurônios
camada oculta
15
2
# camadas ocultas
1
1
Class Sequen,al Rules (a) Parâmetros para técnica de Unigrams
CSR
Train/test
10-fold cross-validation
Radius
3
minsup
0.1
minconf
0.6
Seq. pattern algorithm
PrefixSpan
MLP-CSR RBF-CSR
Momentum
0.8
Learning rate
1.0
# neurons hidden layer
7
7
# hidden layers
1
1
(b) Parâmetros para técnica de RCPS
AG
Treinamento/Teste
0.8/0.2
Taxa de Mutação (Tm )
0.8
Taxa de Crossover (Tc )
0.8
Compara,ve Sentences Mining Techniques 1. N-­‐grams classificaWon (baseline) 2. ClassificaWon based on Class SequenWal Rules (CSRs) paPerns [Jindal and Liu 2006] 3. GeneWc algorithm for CSRs mining (proposed technique) Gene,c Algorithm CSR Mining (GA-­‐CSR) Idea: 1.  Mining Class SequenWal Rules with a GA-­‐based algorithm 2.  Classifying sentences according to final populaWon Gene,c Algorithm CSR Mining (GA-­‐CSR) Sentences CSRs Sentence 1 c1 <{}{}> c1 Sentence 2 c2 <{}{}> c2 Sentence n c2 <{}{}> c2 Gene,c Algorithm Final popula,on of CSRs <{}{}> c1 <{}{}> c2 Model Gene,c Algorithm Encoding Chromosome: 0RBS 0JJR 1DT 1DT 0JJ 1NN 0WRB comparaWve Class sequen,al rule (CSR): < {DT} {DT} {NN} > comparaWve Gene,c Algorithm Flowchart IniWal PopulaWon check minsup minconf Encoded Sentences (Test) Calculate Fitness Final PopulaWon SelecWon Encoded Sentences (Train) Crossover MutaWon InserWon/
Removal Op Survivor SelecWon Gene,c Algorithm Features v Fitness: Sp * Se v  Sp = specificity = TN/(TN+FP) v  Se = sensiWvity = TP/(TP+FN) v  PopulaWon randomly generated v  RoulePe wheel selecWon v  Two point crossover v  MutaWon: varying over gene domain v  InserWon/Removal Operator: probability parameter v  Survivor selecWon: fitness-­‐based, top T (populaWon size) Momentum
Learning rate
Gene,c Algorithm # neurons
in hidden
layer #Parameters
hidden layers
0.8
1.0
7
1
7
1
(b) Parameters for CSR approach
GA-CSR
Train/Test
10-fold cross-validation
Crossover rate (Tc )
0.8
Insertion rate (Pi )
0.3
Removal rate (Pr )
0.3
minsup
0.1
minconf
0.6
Population size (Tp )
50
Fitness
Se ⇤ Sp
(c) Parameters GA-CSR approach
Fig. 2: Parameterization
Outline ü  ComparaWve Opinions ü  ComparaWve Sentences Mining Techniques ü  Datasets ü  Experiments ü  Conclusion Datasets DB-Amazon DB-Twitter
# sentences
1000
1500
# comparative sentences 97 (9.7%) 199 (13.26%)
Texts dates
2003-2007
Dec 2014
Topic
mp3 players XBox and PS4
Table 1: Datasets used for tests
ü  Manually labeled ü  URLs removed Train/Test
Pre-process
# Features
Unigrams
10-fold cross-validation
stop words, stemm, infogain
1000
Outline ü  ComparaWve Opinions ü  ComparaWve Sentences Mining Techniques ü  Datasets ü  Experiments ü  Conclusion baseline that does not take into account elaborated features of our m
oblem. Varying classifiers algorithms does not impact on results that re
ximum accuracy of 68.6%
for RBF neural
Experimental Resultsnetwork.
Fig. 3: Experimental results over DB-Amazon
The second test set ran over DB-Twitter (Figure 4). Graphics curve
ed the trend, however the average accuracy decreased around 10%. T
xplained due to the large amount of noise in Twitter texts. Moreov
Experimental Results
es grammatical errors
potentially harm
the grammatical pattern appr
Fig. 4: Experimental results over DB-Twitter
Discussion -­‐ Weak points ü  Datasets size ü  Simple geneWc algorithm, empirically adjusted ü  More tests with balanced/unbalanced datasets Outline ü  ComparaWve Opinions ü  ComparaWve Sentences Mining Techniques ü  Datasets ü  Experiments ü  Conclusion Conclusions ü  TwiPer dataset labeled with comparaWve/non-­‐
comparaWve tweets ü  Overview on comparaWve mining techniques over social media text ü  GA-­‐CSR algorithm proposed ü  Experimental evaluaWon Future work ü  Preference mining model ü  Temporal preferences ü  MeeWng topology and content for mining preferences in social networks Mining Compara,ve Sentences from Social Media Text Fabíola S. F. Pereira* and Sandra de Amo *[email protected] www.lsi.ufu.br/~fabiola/comparaWve-­‐mining DMNLP’15 ECML/PKDD 2015 Workshop Porto, Portugal Tools ü  Pre-­‐processing (n-­‐grams): weka ü  PrefixSpam, GA, CSR: java implemented ü  Classifiers: weka ü  Datasets: twiPer4j, manually labeled ü  Available at: www.lsi.ufu.br/~fabiola/comparaWve-­‐mining 

Documentos relacionados