Congresso Brasileiro de Software: Teoria e Prática

Transcrição

Congresso Brasileiro de Software: Teoria e Prática
Congresso Brasileiro de Software: Teoria e Prática
29 de setembro a 04 de outubro de 2013
Brasília-DF
Anais
SBES 2013
XXVII Simpósio brasileiro de engenharia de software
SBES 2013
SBES 2013
XXVII Simpósio Brasileiro de Engenharia de Software
29 de setembro a 04 de outubro de 2013
Brasília-DF, Brasil
ANAIS
Volume 01
ISSN: 2175-­9677
COORDENADOR DO COMITÊ DE PROGRAMA
Auri M. R. Vincenzi, Universidade Federal de Goiás
COORDENAÇÃO DO CBSOFT 2013
Genaína Rodrigues – UnB
Rodrigo Bonifácio – UnB
Edna Dias Canedo - UnB
Realização
Universidade de Brasília (UnB)
Departamento de Ciência da Computação (DIMAp/UFRN)
Promoção
Sociedade Brasileira de Computação (SBC)
Patrocínio
CAPES, CNPq, Google, INES, Ministério da Ciência, Tecnologia e Inovação, Ministério do Planejamento,
Orçamento e Gestão e RNP
Apoio
Instituto Federal Brasília, Instituto Federal Goiás, Loop Engenharia de Computação, Secretaria de
Turismo do GDF, Secretaria de Ciência Tecnologia e Inovação do GDF e Secretaria da Mulher do GDF
2
SBES 2013
SBES 2013
XXVII Brazilian Symposium on Software Engineering (SBES)
September 29 to October 4, 2013
Brasília-DF, Brazil
PROCEEDINGS
Volume 01
ISSN: 2175-­9677
PROGRAM CHAIR
Auri M. R. Vincenzi, Universidade Federal de Goiás, Brasil
CBSOFT 2013 gENERAL CHAIRS
Genaína Rodrigues – UnB
Rodrigo Bonifácio – UnB
Edna Dias Canedo - UnB
ORGANIZATION
Universidade de Brasília (UnB)
Departamento de Ciência da Computação (DIMAp/UFRN)
PROMOTION
Brazilian Computing Society (SBC)
SPONSORS
CAPES, CNPq, Google, INES, Ministério da Ciência, Tecnologia e Inovação, Ministério do Planejamento,
Orçamento e Gestão e RNP
SUPPORT
Instituto Federal Brasília, Instituto Federal Goiás, Loop Engenharia de Computação, Secretaria de
Turismo do GDF, Secretaria de Ciência Tecnologia e Inovação do GDF e Secretaria da Mulher do GDF
3
SBES 2013
Autorizo a reprodução parcial ou total desta obra, para fins acadêmicos, desde que citada a fonte
4
SBES 2013
Apresentação
Bem-vindo à XXVII edição do Simpósio Brasileiro de Engenharia de Software (SBES) que, este ano,
é sediado na capital do Brasil, Brasília. Como tem acontecido desde 2010, o SBES 2013 faz parte
do Congresso Brasileiro de Software: Teoria e Prática (CBSoft), que reúne o Simpósio Brasileiro de
Linguagens de Programação (SBLP), o Simpósio Brasileiro de Métodos Formais (SBMF), o Simpósio
Brasileiro de Componentes, Arquiteturas e Reutilização de Software (SBCARS) e da Miniconferência
Latino-Americana de Linguagens de Padrões para Programação (MiniPLoP).
Dentro do SBES o participante encontra seções técnicas, o Fórum de Educação de Engenharia de
Software e três palestrantes convidados: dois internacionais e um nacional. Complementando este
programa, o CBSoft oferece uma gama de atividades, incluindo cursos de curta duração, workshops,
tutoriais, uma sessão de ferramentas, a trilha da Industrial e o workshop de Teses e Dissertações. Nas
seções técnicas do SBES, trabalhos de pesquisa inéditos são apresentados, cobrindo uma variedade de
temas sobre engenharia de software, mencionados na chamada de trabalhos, amplamente divulgados
na comunidade brasileira e internacional. Um processo de revisão rigoroso permitiu a seleção criteriosa
de artigos com a mais alta qualidade. O Comitê de Programa incluiu 76 membros da comunidade
nacional e internacional de Engenharia de Software. Ao todo, 113 pesquisadores participaram na
revisão dos 70 trabalhos submetidos. Desses, 17 artigos foram aceitos para apresentação e publicação
nos anais do SBES. Pode-se observar que o processo de seleção foi competitivo, o que resultou numa
taxa de aceitação de 24% dos artigos submetidos. Além da publicação de artigos no anais, disponível
na Biblioteca Digital do IEEE, os oito melhores artigos – escolhido por um comitê selecionado a partir
do Comitê de Programa – são convidados a submeter uma versão estendida para o Journal of Software
Engineering Research and Development (JSERD).
Para SBES 2013 os palestrantes convidados são:
Jeff Offut (George Mason University) - “How the Web Brought Evolution Back Into Design”;
Sam Malek (George Mason University) - “Toward the Making of Software that Learns to Manage
Itself”;
e Thais Vasconcelos Batista (DIMAP-UFRN) - “Arquitetura de Software: uma Disciplina Fundamental
para Construção de Software”.
Finalmente, gostaríamos de agradecer a todos aqueles que contribuíram com esta edição do SBES.
Agradecemos ao os membros do Comitê Gestor do SBES e CBSoft, os membros do comitê de programa,
os avaliadores dos trabalhos, as comissões organizadoras e todos aqueles que de alguma forma
tornaram possível a realização de mais um evento com o padrão de qualidade dos melhores eventos
internacionais. Mais uma vez, bem-vindo ao SBES 2013. Brasília, DF, setembro/outubro de 2013.
Auri Marcelo Rizzo Vincenzi (INF/UFG)
Coordenador do Comitê de Programa da Trilha Principal
5
SBES 2013
Foreword
Welcome to the XXVII edition of the Brazilian Symposium on Software Engineering (SBES), which
this year takes place in the capital of Brazil, Brasilia. As has happened since 2010, the SBES 2013 is
part of the Brazilian Conference on Software: Theory and Practice (CBSoft) that gathers the Brazilian
Symposium on Programming Languages (SBLP), the Brazilian Symposium on Formal Methods (SBMF),
the Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS) and the Latin
American Mini conference on Pattern Languages of Programming (MiniPLoP).
Within the SBES the participant finds two technical tracks, the Forum on Software Engineering
Education and three invited speakers: two international and one national. Complementing this
program, the CBSoft provides a range of activities including short courses, workshops, tutorials, a
tools session, the Industrial Track and the Workshop of Theses and Dissertations. In the main technical
track of SBES, unpublished research papers are presented, covering a range of topics on Software
Engineering, mentioned in the call for papers, widely advertised in the Brazilian and international
community. A rigorous peer review process enabled the careful selection of articles with the highest
quality. The Program Committee included 76 members of the international Software Engineering
community. In all, 113 investigators participated in the review of the 70 papers submitted. From
those, 17 articles were accepted for presentation and publication in the SBES proceedings. It can be
seen from these figures that we had a very competitive process that resulted in an acceptance rate of
24% of submitted articles. Besides the publication of articles in the proceedings, available in the IEEE
Digital Library, the top eight articles – chosen by a committee selected from members of the Program
Committee – are invited to submit an extended version to the Journal of Software Engineering Research
and Development (JSERD).
For SBES 2013 the invited speakers are:
“How the Web Brought Evolution Back Into Design” - Jeff Offut (George Mason University)
“Toward the Making of Software that Learns to Manage Itself” - Sam Malek (George Mason
University)
“Software Architecture: a Core Discipline to Engineer Software” - Thais Vasconcelos Batista
(DIMAP-UFRN)
Finally, we would like to thank those who contributed to making this edition of SBES. We thank the
members of the Steering Committee of the SBES and CBSoft, the program committee members, the
reviewers of papers, the organizing committees and all those who somehow made possible the
realization of yet another event with a quality standard of the best international events. Once again,
welcome to SBES 2013. Brasilia, DF, September/October 2013.
Auri Marcelo Rizzo Vincenzi (INF/UFG)
Coordinator of the Program Committee of the Main Track
6
SBES 2013
Comitês Técnicos / Technical Committees
SBES Steering Committee
Alessandro Garcia, PUC-Rio
Auri Marcelo Rizzo Vincenzi, UFG
Marcio Delamaro, USP
Sérgio Soares, UFPE
Thais Batista, UFRN
CBSoft General Committee
Genaína Nunes Rodrigues, UnB
Rodrigo Bonifácio, UnB
Edna Dias Canedo, UnB
CBSoft Local Committee
Diego Aranha, UnB
Edna Dias Canedo, UnB
Fernanda Lima, UnB
Guilherme Novaes Ramos, UnB
Marcus Vinícius Lamar, UnB
George Marsicano, UnB
Giovanni Santos Almeida, UnB
Hilmer Neri, UnB
Luís Miyadaira, UnB
Maria Helena Ximenis, UnB
Comitê do programa / Program Committee
Adenilso da Silva Simão, ICMC - Universidade de São Paulo, Brasil
Alessandro Garcia, PUC-Rio, Brasil
Alfredo Goldman, IME - Universidade de São Paulo, Brasil
Antônio Tadeu Azevedo Gomes, LNCC, Brasil
Antônio Francisco Prado, Universidade Federal de São Carlos, Brasil
Arndt von Staa, PUC-Rio, Brasil
Augusto Sampaio, Universidade Federal de Pernambuco, Brasil
Carlos Lucena, PUC-Rio, Brasil
Carolyn Seaman, Universidade de Maryland, EUA
Cecilia Rubira, Unicamp, Brasil
Christina Chavez, Universidade Federal da Bahia, Brasil
Claudia Werner, COPPE /UFRJ, Brasil
Claudio Sant’Anna, Universidade Federal da Bahia, Brasil
Daltro Nunes, UFRGS, Brasil,
Daniel Berry, Universidade de Waterloo, Canadá
Daniela Cruzes, Universidade Norueguesa de Ciência e Tecnologia, Noruega
Eduardo Almeida, Universidade Federal da Bahia, Brasil
Eduardo Aranha, Universidade Federal do Rio Grande do Norte, Brasil
7
SBES 2013
Eduardo Figueiredo, Universidade Federal de Minas Gerais, Brasil
Ellen Francine Barbosa, ICMC - Universidade de São Paulo, Brasil
Fabiano Ferrari, Universidade Federal de São Carlos, Brasil
Fabio Queda Bueno da Silva, Universidade Federal de Pernambuco, Brasil
Fernanda Alencar, Universidade Federal de Pernambuco, Brasil
Fernando Castor, Universidade Federal de Pernambuco, Brasil
Flavia Delicato, Universidade Federal do Rio Grande do Norte, Brasil
Flavio Oquendo, Universidade Européia de Brittany - UBS/VALORIA, França
Glauco Carneiro, Universidade de Salvador, Brasil
Gledson Elias, Universidade Federal da Paraíba, Brasil
Guilherme Travassos, COPPE/UFRJ, Brasil
Gustavo Rossi, Universidade Nacional de La Plata, Argentina
Itana Maria de Souza Gimenes, Universidade Estadual de Maringá, Brasil
Jaelson Freire Brelaz de Castro, Universidade Federal de Pernambuco, Brasil
Jair Leite, Universidade Federal do Rio Grande do Norte, Brasil
João Araújo, Universidade Nova de Lisboa, Portugal
José Carlos Maldonado, ICMC - Universidade de São Paulo, Brasil
José Conejero, Universidade de Extremadura, Espanha
Leila Silva, Universidade Federal de Sergipe, Brasil
Leonardo Murta, UFF, Brasil
Leonor Barroca, Open Un./UK, Great Britain
Luciano Baresi, Politecnico di Milano, Itália
Marcelo Fantinato, Universidade de São Paulo, Brasil
Marcelo de Almeida Maia, Universidade Federal de Uberlândia, Brasil
Marco Aurélio Gerosa, IME-USP, Brasil
Marco Túlio Valente, Universidade Federal de Minas Gerais, Brasil
Marcos Chaim, Universidade de São Paulo, Brasil
Márcio Barros, Universidade Federal do Estado do Rio de Janeiro, Brasil
Mehmet Aksit, Universidade de Twente, Holanda
Nabor Mendonça, Universidade de Fortaleza, Brasil
Nelio Cacho, Universidade Federal do Rio Grande do Norte, Brasil
Nelson Rosa, Universidade Federal de Pernambuco, Brasil
Oscar Pastor, Universidade Politécnica de Valência, Espanha
Otávio Lemos, Universidade Federal de São Paulo, Brasil
Patricia Machado, Universidade Federal de Campina Grande, Brasil
Paulo Borba, Universidade Federal de Pernambuco, Brasil
Paulo Masiero, ICMC - Universidade de São Paulo, Brasil
Paulo Merson, Software Engineering Institute, EUA
Paulo Pires, Universidade Federal do Rio de Janeiro, Brasil
Rafael Bordini, PUCRS, Brasil
Rafael Prikladnicki, PUCRS, Brasil
Regina Braga, Universidade Federal de Juiz de Fora, Brasil
Ricardo Choren, IME-Rio, Brasil
Ricardo Falbo, Universidade Federal de Espírito Santo, Brasil
Roberta Coelho, Universidade Federal do Rio Grande do Norte, Brasil
Rogerio de Lemos, Universidade de Kent, Reino Unido
Rosana Braga, ICMC - Universidade de São Paulo, Brasil
Rosângela Penteado, Universidade Federal de São Carlos, Brasil
Sandra Fabbri, Universidade Federal de São Carlos, Brasil
Sérgio Soares, Universidade Federal de Pernambuco, Brasil
8
SBES 2013
Silvia Abrahão, Universidade Politécnica de Valencia, Espanha
Silvia Vergilio, Universidade Federal do Paraná, Brasil
Simone Souza, ICMC - Universidade de São Paulo, Brasil
Thais Vasconcelos Batista, Universidade Federal do Rio Grande do Norte, Brasil
Tiago Massoni, Universidade Federal de Campina Grande, Brasil
Uirá Kulesza, Universidade Federal do Rio Grande do Norte, Brasil
Valter Camargo, Universidade Federal de São Carlos, Brasil
Vander Alves, Universidade de Brasília, Brasil
revisores externos / External Reviewers
A. César França, Federal University of Pernambuco, Brazil
Americo Sampaio, Universidade de Fortaleza, Brazil
Anderson Belgamo, Universidade Metodista de Piracicaba, Brazil
Andre Endo, ICMC/USP, Brazil
Breno França, UFRJ, Brazil
Bruno Cafeo, Pontifícia Universidade Católica do Rio de Janeiro, Brazil
Bruno Carreiro da Silva, Universidade Federal da Bahia, Brazil
Célio Santana, Universidade Federal Rural de Pernambuco, Brazil
César Couto, CEFET-MG, Brazil
Cristiano Maffort, CEFET-MG, Brazil
Draylson Souza, ICMC-USP, Brazil
Edson Oliveira Junior, Universidade Estadual de Maringá, Brazil
Fernando H. I. Borba Ferreira, Universidade Presbiteriana Mackenzie, Brasil
Frank Affonso, UNESP - Universidade Estadual Paulista, Brazil
Gustavo Henrique Lima Pinto, Federal University of Pernambuco, Brazil
Heitor Costa, Federal University of Lavras, Brazil
Higor Souza, University of São Paulo, Brazil
Igor Steinmacher, Universidade Tecnológica Federal do Paraná, Brazil
Igor Wiese, UTFPR -Universidade Tecnológica Federal do Parana, Brazil
Ingrid Nunes, UFRGS, Brazil
Juliana Saraiva, Federal University of Pernambuco, Brazil
Lucas Bueno, University of São Paulo, Brazil
Luiz Carlos Ribeiro Junior, Universidade de Brasilia – UnB, Brazil
Marcelo Eler, Universidade de São Paulo, Brazil
Marcelo Gonçalves, Universidade de São Paulo, Brazil
Marcelo Morandini, Universidade de São Paulo, Brazil
Mauricio Arimoto, Universidade de São Paulo, Brazil
Milena Guessi, Universidade de São Paulo, Brazil
Paulo Afonso Parreira Júnior, Universidade Federal de São Carlos, Brazil
Paulo Meirelles, IME – USP, Brazil
Pedro Santos Neto, Universidade Federal do Piauí, Brazil
Ricardo Terra, UFMG, Brazil
Roberto Araujo, EACH/USP, Brazil
Sidney NogueiraFederal University of Pernambuco, Brazil
Vanessa Braganholo, UFF, Brazil
Viviane Santos Universidade de São Paulo, Brazil
Yijun YuOpen University, Great Britain
9
SBES 2013
Comitê organizador / Organizing Committee
COORDENAÇÃO GERAL
Genaína Nunes Rodrigues, CIC, UnB
Rodrigo Bonifácio, CIC, UnB
Edna Dias Canedo, CIC, UnB
COMITÊ LOCAL
Diego Aranha, CIC, UnB
Edna Dias Canedo, FGA, UnB
Fernanda Lima, CIC, UnB
Guilherme Novaes Ramos, CIC, UnB
Marcus Vinícius Lamar, CIC, UnB
George Marsicano, FGA, UnB
Giovanni Santos Almeida, FGA, UnB
Hilmer Neri, FGA, UnB
Luís Miyadaira, FGA, UnB
Maria Helena Ximenis, CIC, UnB
COORDENADOR DO COMITÊ DE PROGRAMA SBES 2013
Auri M. R. Vincenzi, Universidade Federal de Goiás, Brasil
10
SBES 2013
palestras convidadas / invited keynotes
TOWARD THE MAKING OF SOFTWARE THAT LEARNS TO MANAGE ITSELF
SAM MALEK
A self-managing software system is capable of adjusting its behavior at runtime in response to changes
in the system, its requirements, or the environment in which it executes. Self-management capabilities
are sought-after to automate the management of complex software in many computing domains,
including service-oriented, mobile, cyber-physical and ubiquitous settings. While the benefits of such
software are plenty, its development has shown to be much more challenging than the conventional
software.
At the state of the art, it is not an impervious engineering problem – in principle – to develop a selfadaptation solution tailored to a given system, which can respond to a bounded set of conditions that
are expected to require automated adaptation. However, any sufficiently complex software system
– once deployed in the field – is subject to a broad range of conditions and many diverse stimuli.
That may lead to the occurrence of behavioral patterns that have not been foreseen previously: in
fact, those may be the ones that cause the most critical problems, since, by definition, they have
not manifested themselves, and have not been accounted for during the previous phases of the
engineering process. A truly self-managing system should be able to cope with such unexpected
behaviors, by modifying or enriching its adaptation logic and provisions accordingly.
In this talk, I will first provide an introduction to some of the challenges of making software systems
self-managing. Afterwards, I will provide an overview of two research projects in my group that have
tackled these challenges through the applications of automated inference techniques (e.g., machine
learning, data mining). The results have been promising, allowing the software engineers to empower
a software system with advanced self-management capabilities with minimal effort. I will conclude the
talk with an outline of future research agenda for the community.
HOW THE WEB BROUGHT EVOLUTION BACK INTO DESIGN
JEFF OFFUTT
To truly understand the effect the Web is having on software engineering, we need to look to the past.
Evolutionary design was near universal in the days before the industrial revolution. The production
costs were very high, but craftsmen were able to implement continuous improvement–every new
object could be better than the last. Software is different; it has a near-zero production cost, allowing
millions of identical copies to be made. Unfortunately, near-zero production cost means software must
be near-perfect “out of the box.” This fact has driven our research agenda for 50 years. But it is no
longer true!
This talk will discuss how near-zero production cost for near-perfect software has driven our research
agenda. Then it will point out how the web has eliminated the need for near-perfect software out of the
box. The talk will finish by describing how this shift is changing software development and research,
and speculate on how this change our future research agenda.
11
SBES 2013
SOFTWARE ARCHITECTURE: A CORE DISCIPLINE TO ENGINEER SOFTWARE
THAIS BATISTA
Software architecture has emerged in the last decades as an important discipline of software
engineering, dealing with the design decisions to define the organization of the system that have a
long-lasting impact on its quality attributes. The architectural description documents the decisions
and it is used as a blueprint to other activities in the software engineering process, such as
implementation, testing, and evaluation. In this talk we will discuss the role of software architecture as
a core activity to engineer software, its influence on other activities of software development, and the
new trends and challenges in this area.
12
SBES 2013
PALESTRANTES / keynotes
Sam Malek (George Mason University)
Sam Malek is an Associate Professor in the Department of Computer Science at George Mason
University. He is also the director of Software Design and Analysis Laboratory at GMU, a faculty
associate of the C4I Center, and a member of the DARPA’s Computer Science Study Panel. Malek’s
general research interests are in the field of software engineering, and to date his focus has spanned
the areas of software architecture, autonomic software, and software dependability. Malek received
his PhD and MS degrees in Computer Science from the University of Southern California, and his BS
degree in Information and Computer Science from the University of California, Irvine. He has received
numerous awards for his research contributions, including the National Science Foundation CAREER
award (2013) and the GMU Computer Science Department Outstanding Faculty Research Award (2011).
He has managed research projects totaling more than three million dollars in funding received from
NSF, DARPA, IARPA, ARO, FBI, and SAIC. He is a member of the ACM, ACM SIGSOFT, and IEEE.
Jeff Offutt (George Mason University)
Dr. Jeff Offutt is Professor of Software Engineering at George Mason University and holds part-time
visiting faculty positions at the University of Skovde, Sweden, and at Linkoping University, Linkoping
Sweden. Offutt has invented numerous test strategies, has published over 150 refereed research
papers (h-index of 51 on Google Scholar), and is co-author of Introduction to Software Testing. He
is editor-in-chief of Wiley’s journal of Software Testing, Verification and Reliability; co-founded the
IEEE International Conference on Software Testing, Verification, and Validation; and was its founding
steering committee chair. He was awarded the George Mason University Teaching Excellence Award,
Teaching With Technology, in 2013, and was named a GMU Outstanding Faculty member in 2008
and 2009. For the last ten years he has led the 25-year old MS program in Software Engineering, and
led the efforts to create PhD and BS programs in Software Engineering. His current research interests
include software testing, analysis and testing of web applications, secure software engineering, objectoriented program analysis, usable software security, and software evolution. Offutt received the PhD in
computer science in 1988 from the Georgia Institute of Technology and is on the web at http://www.
cs.gmu.edu/~offutt/.
Thais Batista (UFRN)
Thais Batista is an Associate Professor at the Federal University of Rio Grande do Norte (UFRN) since
1996. She holds a Ph.D in Computer Science from the Catholic University of Rio de Janeiro (PUC-Rio),
Brazil, 2000. In 2004-2005 she was a post-doctoral researcher at the Lancaster University, UK. Her
main research area is software architecture, distributed systems, middleware, cloud computing. Índice
13
SBES 2013
Índice de Artigos / Table of Contents
Criteria for Comparison of Aspect-Oriented
Requirements Engineering Approaches : Critérios
para Comparação de Abordagens para Engenharia de
Requisitos Orientada a Aspectos
16
Paulo Afonso Parreira Júnior, Rosângela Aparecida Dellosso Penteado
Using Tranformation Rules to Align Requirements
and Archictectural Models
26
Monique Soares, Carla Silva, Gabriela Guedes, Jaelson Castro, Cleice Souza,
Tarcisio Pereira
An automatic approach to detect traceability links
using fuzzy logic
36
Andre Di Thommazo, Thiago Ribeiro, Guilherme Olivatto, Vera Werneck,
Sandra Fabbri
Determining Integration and Test Orders in the
Presence of Modularization Restrictions
46
Wesley Klewerton Guez Assunção, Thelma Elita Colanzi, Silvia Regina
Vergilio, Aurora Pozo
Functional Validation Driven by Automated
Tests / Validação Funcional Dirigida por Testes
Automatizados
56
Thiago Delgado Pinto, Arndt von Staa
Visualization, Analysis, and Testing of Java and
AspectJ Programs with Multi-Level System Graphs
64
Otavio Augusto Lazzarini Lemos, Felipe Capodifoglio Zanichelli, Robson
Rigatto, Fabiano Ferrariy, Sudipto Ghosh
A Method for Model Checking Context-Aware
Exception Handling
74
Lincoln S. Rocha, Rossana M. C. Andrade, Alessandro F. Garcia
Prioritization of Code Anomalies based on
Architecture Sensitiveness
Roberta Arcoverde, Everton Guimarães, Isela Macía, Alessandro Garcia,
Yuanfang Cai
14
84
SBES 2013
Are domain-specific detection strategies for code
anomalies reusable? An industry multi-project
study : Reuso de Estratégias Sensíveis a Domínio
para Detecção de Anomalias de Código: Um Estudo de
Múltiplos Casos
94
Alexandre Leite Silva, Alessandro Garcia, Elder José Reioli, Carlos José Pereira
de Lucena
F3T: From Features to Frameworks Tool
104
Matheus Viana, Rosangela Penteado, Antônio do Prado, Rafael Durelli
A Metric of Software Size as a Tool for IT Governance
114
Marcus Vinícius Borela de Castro, Carlos Alberto Mamede Hernandes
124
An Approach to Business Processes Decomposition
for Cloud Deployment: Uma Abordagem para
Decomposição de Processos de Negócio para
Execução em Nuvens Computacionais
Lucas Venezian Povoa, Wanderley Lopes de Souza, Antonio Francisco do
Prado, Luís Ferreira Pires, Evert F. Duipmans
On the InFLuence of Model Structure and Test
CaseProFIle on the Prioritization of Test Cases in
theContext of Model-based Testing
134
Joao Felipe S. Ouriques, Emanuela G. Cartaxo, Patrícia D. L. Machado
144
The Impact of Scrum on Customer Satisfaction: An
Empirical Study
Bruno Cartaxo, Allan Araujo, Antonio Sa Barreto, Sergio Soares
Identifying a Subset of TMMi Practices to Establish a
Streamlined Software Testing Process
152
Kamilla Gomes Camargo, Fabiano Cutigi Ferrari, Sandra Camargo Pinto
Ferraz Fabbri
On the Relationship between Features Granularity
and Non-conformities in Software Product Lines: An
Exploratory Study
162
Iuri Santos Souza, Rosemeire Fiaccone, Raphael Pereira de Oliveira, Eduardo
Santana de Almeida
172
An Extended Assessment of Data-driven Bayesian
Networks in Software Effort Prediction
Ivan A. P. Tierno, Daltro J. Nunes
15
Criteria for Comparison of Aspect-Oriented
Requirements Engineering Approaches
Critérios para Comparação de Abordagens para Engenharia de
Requisitos Orientada a Aspectos
Paulo Afonso Parreira Júnior 1, 2, Rosângela Aparecida Dellosso Penteado 2
1
Bacharelado em Ciência da Computação – UFG (Câmpus Jataí) - Jataí – Goiás, Brasil
2
Departamento de Computação - UFSCar - São Carlos - São Paulo, Brasil
{paulo_junior, rosangela}@dc.ufscar.br
Resumo— Early-aspects referem-se a requisitos de software que
se encontram espalhados ou entrelaçados com outros requisitos e
são tratados pela Engenharia de Requisitos Orientada a Aspectos
(EROA). Várias abordagens para EROA têm sido propostas nos
últimos anos e possuem diferentes características, limitações e
pontos fortes. Sendo assim, torna-se difícil a tomada de decisão por
parte de: i) engenheiros de software, quanto à escolha da
abordagem mais apropriada as suas necessidades; e ii)
pesquisadores em EROA, quando o intuito for entenderem as
diferenças existentes entre suas abordagens e as existentes na
literatura. Este trabalho tem o objetivo de apresentar um conjunto
de critérios para comparação de abordagens para EROA, criado
com base nas variabilidades e características comuns dessas
abordagens. Além disso, tais critérios são aplicados a seis
abordagens e os resultados obtidos podem servir como um guia
para que usuários escolham a abordagem que melhor atenda as
suas necessidades, bem como facilite a realização de pesquisas na
área de EROA.
Palavras-chave — Engenharia de Software Orientada a
Aspectos, Critérios para Comparação, Avaliação Qualitativa, Early
Aspects.
Abstract— Early-aspects consist of software requirements that
are spread or tangled with other requirements and can be treated by
Aspect-Oriented Requirements Engineering (AORE). Many AORE
approaches have been proposed in recent years and have different
features, strengths and limitations. Thus, it becomes difficult the
decision making by: i) software engineers, regards to the choice of
the most appropriate approach to your needs, and ii) AORE
researchers, when the intent is to understand the differences
between their own approaches and other ones in the literature. This
paper aims to present a set of comparison criteria for AORE
approaches, based on common features and variability of these
approaches. Such criteria are applied on six of the main AORE
approaches and the results can serve as a guide so that users can
choose the approach that best meets their needs, and to facilitate
the conduct of research in AORE.
Keywords — Aspect-Oriented Requirements Engineering,
Comparison Criteria, Qualitative Evaluation, Early Aspects.
I.
INTRODUÇÃO
O aumento da complexidade do software e a sua
aplicabilidade nas mais diversas áreas requerem que a
Engenharia de Requisitos (ER) seja realizada de modo
abrangente e completo, a fim de: i) contemplar todas as
necessidades dos stakeholders [1]; e ii) possibilitar que os
engenheiros de software tenham o completo entendimento da
funcionalidade do software, dos serviços e restrições
existentes e do ambiente sobre o qual ele deve operar [2].
Um requisito de software define uma propriedade ou
capacidade que atende às regras de negócio de um software
[1]. Um conjunto de requisitos relacionados com um mesmo
objetivo, durante o desenvolvimento do software, define o
conceito de “interesse” (concern). Por exemplo, um interesse
de segurança pode contemplar diversos requisitos relacionados
a esse objetivo, que é garantir que o software seja seguro.
Idealmente, cada interesse do software deveria estar
alocado em um módulo específico do software, que
satisfizesse aos seus requisitos. Quando isso ocorre, diz-se que
o software é bem modularizado, pois todos os seus interesses
estão claramente separados [2]. Entretanto, há alguns tipos de
interesses (por exemplo, desempenho, segurança, persistência,
entre outros) para os quais essa alocação não é possível apenas
utilizando as abstrações usuais da engenharia de software,
como casos de uso, classes e objetos, entre outros. Tais
interesses são denominados “interesses transversais” ou “early
aspect” e referem-se aos requisitos de software que se
encontram espalhados ou entrelaçados com outros requisitos.
A falta de modularização ocasionada pelos requisitos
espalhados e entrelaçados tende a dificultar a manutenção e a
evolução do software, pois prejudica a avaliação do
engenheiro de software quanto aos efeitos provocados pela
inclusão, remoção ou alteração de algum requisito sobre os
demais [1]. A Engenharia de Requisitos Orientada a Aspectos
(EROA) é uma área de pesquisa que objetiva promover
melhorias com relação à Separação de Interesses (Separation
of Concerns) [3] durante as fases iniciais do desenvolvimento
do software, oferecendo estratégias mais adequadas para
identificação, modularização e composição de interesses
transversais.
Várias abordagens para EROA têm sido desenvolvidas nos
últimos anos [4][5][7][8][9][10][11][12][13][14], cada uma
com diferentes características, limitações e pontos fortes.
Além disso, avaliações qualitativas ou quantitativas dessas
abordagens foram realizadas [1][2][15][16][17][19][20].
Mesmo com a grande variedade de estudos avaliativos, apenas
alguns aspectos das abordagens para EROA são considerados.
Assim, para se ter uma visão mais abrangente sobre uma
determinada abordagem há necessidade de se recorrer a outros
estudos. Por exemplo, as informações sobre as atividades da
EROA contempladas em uma abordagem são obtidas na
publicação na qual ela foi proposta ou em alguns estudos
comparativos que a envolvem. Porém, nem sempre essas
publicações apresentam informações precisas sobre a
escalabilidade e/ou cobertura e a precisão dessa abordagem,
sendo necessário recorrer a estudos de avaliação quantitativa.
Na literatura há escassez de estudos que realizam a
comparação de abordagens para EROA por meio de um
conjunto bem definido de critérios. Também é difícil
encontrar, em um mesmo trabalho, a comparação de
características qualitativas e quantitativas das abordagens.
Esses fatos dificultam a tomada de decisão por parte de: i)
engenheiros de software, quanto à escolha da abordagem mais
apropriada as suas necessidades; e ii) pesquisadores em
EROA, para entenderem as diferenças existentes entre suas
abordagens e as demais existentes na literatura.
Este trabalho apresenta um conjunto de oito critérios para
facilitar a comparação de abordagens para EROA. Esses
critérios foram desenvolvidos com base nas variabilidades e
características comuns de diversas abordagens, bem como nos
principais trabalhos relacionados à avaliação qualitativa e
quantitativa dessas abordagens. Os critérios elaborados
contemplam: (1) o tipo de simetria de cada abordagem; (2) as
atividades da EROA e (3) interesses contemplados por ela; (4)
as técnicas utilizadas para realização de suas atividades; (5) o
nível de envolvimento necessário para sua aplicação, por parte
do usuário; (6) sua escalabilidade; (7) nível de apoio
computacional disponível; e (8) as avaliações já realizadas
sobre tal abordagem.
A fim de verificar a aplicabilidade dos critérios propostos,
seis das principais abordagens para EROA disponíveis na
literatura são comparadas: Separação Multidimensional de
Interesses [8]; Theme [9][10]; EA-Miner [4][5]; Processo
baseado em XML para Especificação e Composição de
Interesses Transversais [7]; EROA baseada em Pontos de
Vista [13][14]; e Aspect-Oriented Component Requirements
Engineering (AOCRE) [11]. O resultado obtido com essa
comparação pode servir como um guia para que usuários
possam compreender de forma mais clara e abrangente as
principais características, qualidades e limitações dessas
abordagens para EROA, escolhendo assim, aquela que melhor
atenda as suas necessidades.
O restante deste artigo está organizado da seguinte forma.
Na Seção 2 é apresentada uma breve descrição sobre EROA,
com enfoque sobre suas principais atividades. Na Seção 3 é
apresentada uma visão geral sobre as abordagens para EROA
comparadas neste trabalho. O conjunto de critérios para
comparação de abordagens para EROA está na Seção 4. A
aplicação dos critérios sobre as abordagens apresentadas é
exibida e uma discussão dessa aplicação é mostrada na Seção
5. Os trabalhos relacionados estão na Seção 6 e, por fim, as
conclusões e trabalhos futuros são apresentados na Seção 7.
II. ENGENHARIA DE REQUISITOS ORIENTADA A ASPECTOS
O princípio da Separação de Interesses tem por premissa a
identificação e modularização de partes do software relevantes
a um determinado conceito, objetivo ou propósito [3].
Abordagens tradicionais para desenvolvimento de software,
como a Orientação a Objetos (OO), foram criadas com base
nesse princípio, porém, certos interesses de escopo amplo (por
exemplo, segurança, sincronização e logging) não são fáceis
de serem modularizados e mantidos separadamente durante o
desenvolvimento do software. O software gerado pode conter
representações entrelaçadas, que dificultam o seu
entendimento e a sua evolução [7].
Uma abordagem efetiva para ER deve conciliar a
separação de interesses com a necessidade de atender aos
interesses de escopo amplo [8]. A EROA surge como uma
tentativa de se contemplar esse objetivo por meio da utilização
de estratégias específicas para modularização de interesses que
são difíceis de serem isolados em módulos individuais. Um
“interesse” encapsula um ou mais requisitos especificados
pelos stakeholders e um “interesse transversal” ou “early
aspect” é um interesse que se intercepta com outros interesses
do software. A explícita modularização de interesses
transversais em nível de requisitos permite que engenheiros de
software raciocinem sobre tais interesses de forma isolada
desde o início do ciclo de vida do software, o que pode
facilitar a criação de estratégias para sua modularização.
Na Figura 1 está ilustrado o esquema de um processo
genérico para EROA, proposto Chitchyan et al. [4], que foi
desenvolvido com base em outros processos existentes na
literatura [8][9][12][14] (os retângulos de bordas arredondadas
representam as atividades do processo).
Figura 1. Processo genérico para EROA (adaptado de Chitchyan et al. [4]).
A partir de um conjunto inicial de requisitos disponível, a
atividade Identificação de Interesses identifica e classifica
interesses do software como base ou transversais. Em seguida,
a atividade Identificação de Relacionamento entre Interesses
permite que o engenheiro de software conheça as influências e
as restrições impostas pelos interesses transversais sobre os
outros interesses do software. A atividade Triagem auxilia na
decisão sobre quais desses interesses são pertinentes ao
software e se há repetições na lista de interesses identificados.
A atividade Refinamento de Interesses ocorre quando houver
necessidade de se alterar o conjunto de interesses e
relacionamentos já identificados.
Os interesses classificados como pertinentes são então
representados durante a atividade Representação de Interesses
em um determinado formato (template), de acordo com a
abordagem para EROA utilizada. Esse formato pode ser um
texto, um modelo de casos de uso, pontos de vista, entre
outros. Por exemplo, no trabalho de Rashid et al. [13][14],
interesses são representados por meio de pontos de vista; no
de Baniassad e Clarke [9][10] são utilizados temas. Durante a
representação dos interesses, o engenheiro de software pode
identificar a necessidade de refinamento, ou seja, de
incluir/remover interesses e/ou relacionamentos. Isso
ocorrendo, ele pode retornar para as atividades anteriores do
processo da Figura 1. Finalmente, os interesses representados
em um determinado template precisam ser compostos e
analisados para a detecção dos conflitos entre interesses do
software. Essas análises são feitas durante as atividades de
Composição de Interesses e de Análise, Identificação e
Resolução de Conflitos. Em seguida, os conflitos identificados
são resolvidos com o auxílio dos stakeholders.
Em geral, as atividades descritas no processo da Figura 1
são agregadas em quatro atividades maiores, a saber:
“Identificação”, “Representação” e “Composição” de
interesses e “Análise e Resolução de Conflitos”. Essas
atividades são utilizadas como base para apresentação das
características das abordagens para EROA na Seção 3 deste
trabalho.
sistema. Seja C1, C2, C3, ..., Cn os interesses concretos de um
determinado sistema e SC1, SC2, SC3, ..., SCn, os conjuntos de
interesses que eles entrecortam, respectivamente.
III. ABORDAGENS PARA EROA
A escolha das seis abordagens para EROA analisadas neste
trabalho foi realizada por meio de um processo de Revisão
Sistemática (RS), cujo protocolo foi omitido neste trabalho,
devido a restrições de espaço. Tais abordagens têm sido
consideradas maduras por outros autores em seus estudos
comparativos [2][15][17], bem como foram divulgadas em
veículos e locais de publicação de qualidade e avaliadas de
forma quantitativa com sistemas reais. Apenas os principais
conceitos dessas abordagens são apresentados; mais detalhes
podem ser encontrados nas referências aqui apresentadas.
Figura 2. Regras de composição para o interesse “Recuperação da
Informação” (adaptado de Moreira et al. [8]).
A. Separação Multidimensional de Interesses
Esta abordagem propõe que requisitos devem ser
decompostos de forma uniforme com relação a sua natureza
funcional, não funcional ou transversal [8]. Tratando todos os
interesses da mesma forma, pode-se então escolher qualquer
conjunto de interesses como base para analisar a influência
dos outros interesses sobre essa base.
i) Identificação e Representação de Interesses. Tem por
base a observação de que certos interesses, como por exemplo,
mobilidade, recuperação de informação, persistência, entre
outros aparecem frequentemente durante o desenvolvimento
de software. Assim, os autores dividiram o espaço de
interesses em dois: i) o dos metainteresses, que consiste em
um conjunto abstrato de interesses típicos, como os que foram
mencionados acima; e ii) o dos sistema, que contempla os
interesses específicos do sistema do usuário. Para se utilizar
esta abordagem, os requisitos do sistema devem ser analisados
pelo engenheiro de requisitos e categorizados com base nos
interesses existentes no espaço de metainteresses, gerando
assim os interesses concretos. Para representação dos
interesses, tanto os abstratos (metainteresses) quanto os
concretos, são utilizados templates XML.
ii) Composição de Interesses. Após a representação dos
interesses, regras de composição são definidas para se
especificar como um determinado interesse influencia outros
requisitos ou interesses do sistema. As regras de composição
também são especificadas por meio de templates XML. Na
Figura 2 é apresentado um exemplo de regra de composição
na qual o interesse “Recuperação de Informações” afeta todos
os requisitos do interesse de “Customização” (especificado
pelo atributo id = “all”), o requisito 1 do interesse
“Navegação” e o requisito 1 do interesse “Mobilidade”
(especificados pelo atributo id = “1”), incluindo seus subrequisitos (especificado pelo atributo children = “include”).
iii) Análise e Resolução de Conflitos. É realizada a partir
da observação das interações de um interesse com os outros do
<?xml version="1.0" ?>
<Composition>
<Requirement concern="InformationRetrieval" id="all">
<Constraint action="provide" operator="during">
<Requirement concern="Customisability" id="all" />
<Requirement concern="Navigation" id="1" />
<Requirement concern="Mobility" id="1" children="include" />
</Constraint>
<Outcome action="fulfilled" />
</Requirement>
</Composition>
Para se identificar os conflitos entre C1 e C2 deve-se
analisar a Interseção de Composição SC1 ∩ SC2. Uma
Interseção de Composição é definida por: seja o interesse Ca
membro de SC1 e SC2. Ca aparece na interseção de
composição SC1 ∩ SC2 se e somente se, C1 e C2 afetarem o
mesmo conjunto de requisitos presentes em Ca. Por exemplo,
na Figura 2, nota-se que o interesse “Recuperação de
Informações” afeta o requisito 1 do interesse “Navegação”.
Supondo que o interesse “Mobilidade” também afete esse
requisito, então SCRecuperação de Informações ∩ SCMobilidade =
{“Navegação”}.
Os conflitos são analisados com base no tipo de
contribuição que um interesse pode exercer sobre outro com
relação a uma base de interesses. Essas contribuições podem
ser negativas (-), positivas (+) ou neutras. Uma matriz de
contribuição é construída, de forma que cada célula apresenta
o tipo da contribuição (+ ou -) dos interesses em questão com
relação aos interesses do conjunto de interseções de
composição localizado dentro da célula. Uma célula vazia
denota a não existência de relacionamento entre os interesses.
Se a contribuição é neutra, então apenas o conjunto de
interseções de composição é apresentado.
B. Theme
A abordagem Theme [9][10] apoia EROA em dois níveis:
a) de requisitos, por meio da Theme/Doc, que fornece
visualizações para requisitos textuais que permitem expor o
relacionamento entre comportamentos em um sistema; b) de
projeto, por meio da Theme/UML, que permite ao
desenvolvedor modelar os interesses base e transversais de um
sistema e especificar como eles podem ser combinados.
i) Identificação de Interesses. Para esta atividade o
engenheiro de software dispõe da visualização de ações, um
tipo de visualização dos requisitos do sistema proposto pelos
autores. Duas entradas são obrigatórias para se gerar uma
visualização de ações: i) uma lista de ações-chaves, isto é,
verbos identificados pelo engenheiro de software ao analisar o
documento de requisitos; e ii) o conjunto de requisitos do
sistema. Na Figura 3 é apresentada a visualização de ações
criada a partir de um conjunto de requisitos e de uma lista de
ações-chaves de um pequeno sistema de gerenciamento de
cursos [9]. As ações-chaves são representadas por losangos e
os requisitos do texto, por caixas com bordas arredondadas.
Se um requisito contém uma ação-chave em sua descrição,
então ele é associado a essa ação-chave por meio de uma seta
da caixa com borda
correspondente à ação.
arredondada
para
o
losango
Figura 3. Exemplo de uma visualização de ações [9].
A ideia é utilizar essa visualização para separar e isolar
ações e requisitos em dois grupos: 1) o grupo “base” que é
autocontido, ou seja, não possui requisitos que se referem a
ações do outro grupo; e 2) o grupo “transversal” que possui
requisitos que se referem a ações do grupo base. Para atingir
essa separação em grupos, o engenheiro de software deve
examinar os requisitos para classificá-los em um dos grupos.
Caso o engenheiro de software decida que uma ação principal
entrecorta as demais ações do requisito em questão, então uma
seta de cor cinza com um ponto em uma de suas extremidades
é traçada da ação que entrecorta para a ação que é
entrecortada. Na Figura 3, denota-se que a ação logged
entrecorta as ações unregister, give e register.
ii) Representação e Composição de Interesses. Para
essas atividades utiliza-se Theme/UML, que trabalha com o
conceito de temas - elementos utilizados para representar
interesses e que podem ser do tipo base ou transversal. Os
temas base encapsulam as funcionalidades do domínio do
problema, enquanto que os transversais encapsulam os
interesses que afetam os temas base. A representação gráfica
de um tema é um pacote da UML denotado com o estereótipo
<<theme>>. Os temas transversais são representados por meio
de gabaritos da UML, que permitem encapsular o
comportamento transversal independentemente do tema base,
ou seja, sem considerar os pontos reais do sistema que serão
afetados. Um gabarito é representado graficamente por um
pacote da UML com um parâmetro no canto superior direito,
um template.
Após a especificação do sistema em temas base e
transversais, é necessário realizar a composição deles. Para
isso utiliza-se o relacionamento de ligação (bind), que
descreve para quais eventos ocorridos nos temas base o
comportamento do tema transversal deve ser disparado.
Para auxiliar o engenheiro de software a descobrir e
representar os temas e seus relacionamentos, visualizações de
temas (theme view) são utilizadas. Elas diferem das
visualizações de ações, pois não apresentam apenas requisitos
e ações, mas também entidades do sistema (informadas pelo
engenheiro de software) que serão utilizadas na modelagem
dos temas.
iii) Análise e Resolução de Conflitos. Os trabalhos
analisados sobre a abordagem Theme não apresentaram
detalhes sobre a realização dessa atividade.
C. EA-Miner
A abordagem EA-Miner segue o processo genérico
apresentado na Figura 1, o qual foi definido pelos mesmos
autores dessa abordagem. Além disso, os autores propuseram
uma suíte de ferramentas que apoiam as atividades desse
processo [4][5].
Essas ferramentas exercem dois tipos de papéis: i) gerador
de informações: que analisa os documentos de entrada e os
complementa com informações linguísticas, semânticas,
estatísticas e com anotações; e ii) consumidor de informações:
que utiliza as anotações e informações adicionais atribuídas ao
conjunto de entrada para múltiplos tipos de análise.
A principal geradora de informações da abordagem EAMiner é a ferramenta WMATRIX [6], uma aplicação web para
Processamento de Linguagem Natural (PLN), que é utilizada
por essa abordagem para identificação de conceitos do
domínio do sistema.
i) Identificação de Interesses. É realizada pela ferramenta
EA-Miner (Early Aspect Mining), que recebe o mesmo nome
da abordagem. Para identificação de interesses transversais
não funcionais, EA-Miner constrói uma árvore de requisitos
não funcionais com base no catálogo de Chung e Leite [18].
Os interesses transversais são identificados pela equivalência
semântica entre as palavras do documento de requisitos e as
categorias desse catálogo. Para identificação de interesses
transversais funcionais, EA-Miner utiliza uma estratégia
semelhante à da abordagem Theme, detectando a ocorrência
de verbos repetidos no documento de requisitos, o que pode
sugerir a presença de interesses transversais funcionais.
ii) Representação e Composição de Interesses. Para esta
atividade, utiliza-se a ferramenta ARCADE (Aspectual
Requirements Composition and Decision). Com ela, o
engenheiro de software pode selecionar quais requisitos são
afetados pelos interesses do sistema, escolher os
relacionamentos existentes entre eles e, posteriormente, gerar
regras de composição. ARCADE utiliza a mesma ideia de
regra
de
composição
da
abordagem “Separação
Multidimensional de Interesses” [8].
iii) Análise e Resolução de Conflitos. ARCADE possui
também um componente analisador de conflitos, o qual
identifica sobreposição entre aspectos com relação aos
requisitos que eles afetam. O engenheiro de requisitos é
alertado sobre essa sobreposição e decide se os aspectos
sobrepostos prejudicam ou favorecem um ao outro.
D. Processo baseado em XML para Especificação e
Composição de Interesses Transversais
O processo de Soeiro et al. [7] é composto das seguintes
atividades: identificar, especificar e compor interesses.
i) Identificação de Interesses. Ocorre por meio da análise
da descrição do sistema feita por parte do engenheiro de
software. Os autores indicam que a identificação dos
interesses pode ser auxiliada pelo uso de catálogos de
requisitos não funcionais, como o proposto por Chung e Leite
[18]. Para cada entrada do catálogo, deve-se decidir se o
interesse em questão existe ou não no sistema em análise.
ii) Representação e Composição de Interesses. Para
essas atividades foram criados templates XML com o intuito
de coletar e organizar todas as informações a respeito de um
interesse. A composição dos interesses do sistema ocorre por
regras de composição, que consistem dos seguintes elementos:
 Term: pode ser um interesse ou outra regra de
composição.

Operator: define o tipo de operação >>, [> ou ||. C1
>> C2 refere-se a uma composição sequencial e
significa que o comportamento de C2 inicia-se se e
somente se C1 tiver terminado com sucesso. C1 [> C2
significa que C2 interrompe o comportamento de C1
quando começa a executar. C1 || C2 significa que o
comportamento de C1 está sincronizado com o de C2.
 Outcome: expressa o resultado das restrições impostas
pelos operadores comentados anteriormente.
iii) Análise e Resolução de Conflitos. Os trabalhos
analisados sobre essa abordagem não apresentaram detalhes
sobre a realização desta atividade.
E. EROA baseada em Pontos de Vista
Rashid et al. [13][14] propuseram uma abordagem para
EROA baseada em pontos de vista (viewpoints). São utilizados
templates XML para especificação dos pontos de vista, dos
interesses transversais e das regras de composição entre
pontos de vista e interesses transversais do sistema. Além
disso, a ferramenta ARCADE automatiza a tarefa de
representação dos conceitos mencionados anteriormente com
base nos templates XML pré-definidos na abordagem.
A primeira atividade dessa abordagem consiste na
Identificação e Especificação dos Requisitos do Sistema e,
para isso, pontos de vista são utilizados.
i) Identificação e Representação de Interesses. É
realizada por meio da análise dos requisitos iniciais do sistema
pelo engenheiro de software. De modo análogo ao que é feito
com os pontos de vista, interesses também são especificados
em arquivos XML. Após a identificação dos pontos de vista e
dos interesses, é necessário detectar quais desses interesses são
candidatos a interesses transversais. Para isso cria-se uma
matriz de relacionamento, na qual os interesses do sistema são
colocados em suas linhas e os pontos de vista, nas colunas.
Cada célula dessa matriz, quando marcada, representa que um
determinado interesse exerce influência sobre os requisitos do
ponto de vista da coluna correspondente daquela célula. Sendo
assim, é possível observar quais pontos de vista são
entrecortados pelos interesses do sistema. Segundo os autores,
quando um interesse entrecorta os requisitos de vários pontos
de vista do sistema, isso pode indicar que se trata de um
interesse transversal.
ii) Composição de Interesses e Análise e Resolução de
Conflitos. Após a identificação dos candidatos a interesses
transversais e dos pontos de vista do sistema, os mesmos
devem ser compostos por meio de regras de composição e,
posteriormente, a análise e resolução de conflitos deve ser
realizada. A definição das regras de composição e da atividade
de análise e resolução de conflitos segue a mesma ideia da
abordagem “Separação Multidimensional de Interesses” [8].
F. Aspect-oriented Component Requirements Engineering
(AOCRE)
Whittle e Araújo [11] desenvolveram um processo de alto
nível para criar e validar interesses transversais e não
transversais. O processo se inicia com um conjunto de
requisitos adquiridos pela aplicação de técnicas usuais para
este fim.
i) Identificação e Representação de Interesses. Os
interesses funcionais e não funcionais são identificados a
partir dos requisitos do sistema. Os interesses funcionais são
representados por meio de casos de uso da UML e os
interesses não funcionais, por um template específico com as
informações: i) fonte do interesse (stakeholders, documentos,
entre outros); ii) requisitos a partir dos quais ele foi
identificado; iii) sua prioridade; iv) sua contribuição para
outro interesse não funcional; e v) os casos de uso (interesses
funcionais) afetados por ele.
Com base na análise do relacionamento entre interesses
funcionais e não funcionais, os candidatos a interesses
transversais são identificados e, posteriormente, refinados em
um conjunto de cenários. Cenários transversais (derivados dos
interesses transversais) são representados por IPSs
(Interaction Pattern Specifications) e cenários não transversais
são representados por diagramas de sequência da UML.
IPS é um tipo de Pattern Specifications (PSs) [23], um
modo de se representar formalmente características estruturais
e comportamentais de um determinado padrão. PSs são
definidas por um conjunto de papéis (roles) da UML e suas
respectivas propriedades. Dado um modelo qualquer, diz-se
que ele está em conformidade com uma PS se os elementos
desse modelo, que desempenham os papéis definidos na PS,
satisfazem a todas as propriedades definidas para esses papéis.
IPSs servem para especificar formalmente a interação entre
papéis de um software.
ii) Composição de Interesses. Cenários transversais são
compostos com cenários não transversais. A partir desse
conjunto de cenários compostos e de um algoritmo
desenvolvido pelos autores da abordagem, é gerado um
conjunto de máquinas de estados executáveis que podem ser
simuladas em ferramentas CASE para validar tal composição.
iii) Análise e Resolução de Conflitos. Os trabalhos
analisados sobre essa abordagem não apresentaram detalhes
sobre a realização dessa atividade.
IV.
CRITÉRIOS PARA COMPARAÇÃO DE ABORDAGENS PARA
EROA
O conjunto de critérios apresentado nesta seção foi
elaborado de acordo com: i) a experiência dos autores deste
trabalho que conduziram o processo de RS; ii) os trabalhos
relacionados à avaliação de abordagens para identificação de
interesses transversais [1][2][15][16][17][19][20]; e iii) os
trabalhos originais que descrevem as abordagens selecionadas
para comparação [4][5][7][8][9][10][11][13][14].
A confecção desse conjunto de critérios seguiu o seguinte
procedimento: i) a partir da leitura dos trabalhos relacionados
à avaliação de abordagens para identificação de interesses
transversais (obtidos por meio da RS) foi criado um conjunto
inicial de critérios; ii) esse conjunto foi verificado pelos
autores deste trabalho e aprimorado com novos critérios ou
adaptado com os já elencados; e iii) os critérios elencados
foram aplicados às abordagens apresentadas na Seção 3.
A. Tipo de Simetria: Assimétrica ou Simétrica
Abordagens para EROA podem ser classificadas como; a)
assimétricas – quando há distinção e tratamento explícitos
para os interesses transversais e não transversais; b) simétricas
– quando todos os interesses são tratados da mesma maneira.
É importante conhecer tal característica das abordagens para
EROA, pois ela fornece indícios sobre:
 a representatividade da abordagem em questão: em geral,
abordagens
assimétricas
possuem
melhor
representatividade, uma vez que os modelos gerados por
meio delas possuem elementos que fazem distinção
explícita entre interesses transversais e não transversais.
Isso pode favorecer o entendimento desses modelos e
consequentemente do software sob análise; e
 a compatibilidade com outras abordagens para EROA:
conhecer se uma abordagem é simétrica ou não pode
auxiliar pesquisadores e profissionais a refletirem sobre o
esforço necessário para adaptar essa abordagem as suas
necessidades. Por exemplo, criando mecanismos para
integrá-la com outras abordagens já existentes.
Para cada abordagem analisada com esse critério as
seguintes informações devem ser coletadas: nome da
abordagem em questão, tipo de simetria (simétrica ou
assimétrica) e descrição. Essa última informação especifica os
elementos de abstração utilizados para tratar com interesses
transversais e não transversais, o que explica a sua
classificação como simétrica ou assimétrica.
Para todos os critérios mencionados nas próximas
subseções, o nome da abordagem em análise foi uma das
informações coletadas e não será comentada.
B. Cobertura: Completa ou Parcial
Com esse critério pretende-se responder à seguinte
questão: “A abordagem contempla as principais atividades
preconizadas pela EROA?” Neste trabalho, considera-se como
completa a abordagem que engloba as principais atividades
descritas no processo genérico para EROA apresentado na
Figura 1, isto é, “Identificação”, “Representação” e
“Composição” de interesses e “Análise e Resolução de
Conflitos”. Uma abordagem parcial é aquela que trata apenas
com um subconjunto (não vazio) dessas atividades. Para cada
abordagem analisada com esse critério deve-se obter o tipo de
cobertura. Se for cobertura parcial, deve-se destacar as
atividades contempladas pela abordagem.
C. Propósito: Geral ou Específico
Este critério tem a finalidade de avaliar uma abordagem
quanto ao seu propósito, ou seja, se é específica para algum
tipo de interesse (por exemplo, interesses transversais
funcionais, interesses transversais não funcionais, interesses
de persistência, segurança, entre outros) ou se é de propósito
geral. Se o propósito da abordagem for específico, deve-se
destacar os tipos de interesses contemplados por ela.
D. Técnicas Utilizadas
Este critério elenca as técnicas utilizadas pela abordagem
para realização de suas atividades. Por exemplo, para a
atividade de identificação de interesses transversais não
funcionais, uma abordagem A pode utilizar técnicas de PLN,
juntamente com um conjunto de palavras-chave, enquanto que
outra abordagem B pode utilizar apenas catálogos de
requisitos não funcionais e análise manual dos engenheiros de
software.
Para esse critério, as seguintes informações são obtidas: i)
atividade da EROA contemplada pela abordagem; e ii) tipo de
técnicas utilizadas para realização dessa atividade.
E. Nível de Envolvimento do Usuário: Amplo ou Pontual
O envolvimento do usuário é amplo quando há
participação efetiva do usuário na maior parte das atividades
propostas pela abordagem, sem que ele seja auxiliado por
qualquer tipo de recurso ou artefato que vise a facilitar o seu
trabalho. Essa participação efetiva pode ocorrer por meio da:
i) inclusão de informações extras; ii) realização de análises
sobre artefatos de entrada e/ou saída; e iii) tradução de
informações de um formato para outro.
Um exemplo de participação efetiva do usuário ocorre
quando ele deve fornecer informações adicionais, além
daquelas constantes no documento de requisitos do sistema
para identificação dos interesses do sistema (por exemplo, um
conjunto de palavras-chave a ser confrontado com o texto do
documento de requisitos). Outro exemplo seria se a
representação de interesses do sistema fosse feita
manualmente, pelo usuário, de acordo com algum template
pré-estabelecido (em um arquivo XML ou diagrama da UML).
Um envolvimento pontual significa que o usuário pode
intervir no processo da EROA para tomar certos tipos de
decisões. Por exemplo, resolver um conflito entre dois
interesses que se relacionam. Sua participação, porém, tem a
finalidade de realizar atividades de níveis mais altos de
abstração, que dificilmente poderiam ser automatizadas. É
importante analisar tal critério, pois o tipo de envolvimento do
usuário pode impactar diretamente na escalabilidade da
abordagem e na produtividade proporcionada pela mesma. O
envolvimento excessivo do usuário pode tornar a abordagem
mais dependente da sua experiência e propensa a erros.
Para comparação das abordagens com base neste critério,
deve-se observar: i) o tipo de envolvimento do usuário exigido
pela abordagem; e ii) a descrição das atividades que o usuário
deve desempenhar.
F. Escalabilidade
Com esse critério, pretende-se conhecer qual é o porte dos
sistemas para os quais a abordagem em análise tem sido
aplicada.
Embora
algumas
abordagens
atendam
satisfatoriamente a sistemas de pequeno porte, não há
garantias que elas sejam eficientes para sistemas de médio e
grande porte. Os problemas que podem surgir quando o
tamanho do sistema cresce muito, em geral, estão
relacionados: i) à complexidade dos algoritmos utilizados pela
abordagem; ii) à necessidade de envolvimento do usuário, que
dependendo do esforço requisitado, pode tornar impraticável a
aplicação da abordagem em sistema de maior porte; e iii) à
degradação da cobertura e precisão da abordagem; entre
outros.
Para esse critério as seguintes informações devem ser
coletadas: i) o nome do sistema utilizado no estudo de caso em
que a abordagem foi avaliada; ii) os tipos de documentos
utilizados; iii) as medidas de tamanho/complexidade do
sistema (em geral, quando se trata de documentos textuais, os
tamanhos são apresentados em números de páginas e/ou
palavras); e iv) a referência da publicação na qual foi relatada
a aplicação desse sistema à abordagem em questão.
G. Apoio Computacional
Para quais de suas atividades a abordagem em análise
oferece apoio computacional? Essa informação é importante,
principalmente, se o tipo de envolvimento dos usuários
exigido pela abordagem for amplo. Em muitos casos, durante
a avaliação de uma abordagem para EROA, percebe-se um
relacionamento direto entre os critérios “Tipo de
Envolvimento do Usuário” e “Apoio Computacional”. Se a
abordagem exige envolvimento amplo do usuário,
consequentemente, ele deve possuir fraco apoio
computacional; se exige envolvimento pontual, possivelmente
deve oferecer apoio computacional adequado. Porém, essa
relação precisa ser observada com cuidado, pois pode haver
casos em que o fato de uma atividade exigir envolvimento
pontual do usuário para sua execução não esteja diretamente
ligado à execução automática da mesma.
Por exemplo, sejam A e B duas abordagens para EROA
que exijam que o usuário informe um conjunto de palavraschave para identificação de interesses em um documento de
requisitos. A abordagem A possui um apoio computacional
que varre o texto do documento de requisitos, selecionando
algumas palavras mais relevantes (utilizando-se de técnicas de
PLN) que possam ser utilizadas pelo engenheiro de software
como palavras-chave. A abordagem B não possui apoio
computacional algum, porém disponibiliza uma ontologia com
termos do domínio do sistema em análise e um dicionário de
sinônimos desses termos, que podem ser utilizados pelo
engenheiro de software como diretrizes para elencar o
conjunto de palavras-chave exigido pela abordagem. Neste
caso, as duas abordagens poderiam ser classificadas como
pontuais com relação ao critério “Tipo de Envolvimento do
Usuário”, mesmo B não possuindo apoio computacional.
Entretanto, um engenheiro de software que esteja utilizando a
abordagem A, provavelmente, terminará a tarefa de definição
do conjunto de palavras-chave em menor tempo do que outro
que esteja utilizando a abordagem B. Assim, deve-se conhecer
quais atividades da abordagem para EROA são automatizadas.
Para cada abordagem comparada com esse critério deve-se
obter: i) as atividades da EROA contempladas pela abordagem
em questão; ii) os nomes dos apoios computacionais utilizados
para automatização dessas atividades; e iii) a referência da
publicação, na qual o apoio computacional foi
proposto/apresentado. Uma abordagem pode oferecer mais de
um apoio computacional para uma mesma atividade.
H. Tipo de Avaliação da Abordagem
A quais tipos de avaliação a abordagem em questão para
EROA têm sido submetida? Para as avaliações realizadas, há
um relatório adequado sobre a acurácia da abordagem,
ressaltando detalhes importantes como cobertura, precisão e
tempo necessário para execução das atividades dessa
abordagem?
Para Wohlin et al. [22], a avaliação qualitativa está
relacionada à pesquisa sobre o objeto de estudo, sendo os
resultados apresentados por meio de informações descritas em
linguagem natural, como neste artigo. A avaliação
quantitativa, geralmente, é conduzida por meio de estudos de
caso e experimentos controlados, e os dados obtidos podem
ser comparados e analisados estatisticamente. Estudos de caso
e experimentos visam a observar um atributo específico do
objeto de estudo e estabelecer o relacionamento entre atributos
diferentes, porém, em experimentos controlados o nível de
controle é maior do que nos estudos de caso.
Para este critério deve-se destacar: i) o(s) tipo(s) de
avaliação(ões) realizada(s) sobre a abordagem, listando a
referência da avaliação conduzida (os tipos de avaliação são:
qualitativa, estudo de caso e experimento controlado); e ii) os
resultados obtidos com essa(s) avaliação(ões) realizada(s).
Para o item (ii) sugere-se a coleta dos valores médios obtidos
para as seguintes métricas: cobertura, precisão e tempo de
aplicação da abordagem. Tais métricas foram sugeridas, pois
são amplamente utilizadas para medição da eficácia de
produtos e processos em diversas áreas de pesquisa, tais como
recuperação da informação e processamento de linguagem
natural, entre outras. Na área de EROA, essas métricas têm
sido utilizadas em trabalhos relacionados à identificação de
interesses tanto em nível de código [21], quanto em nível de
requisitos [2][15].
A análise conjunta dos dados deste critério com os do
critério “Escalabilidade”
pode revelar
informações
importantes sobre a eficácia e eficiência de uma abordagem
para EROA.
V. AVALIAÇÃO DAS ABORDAGENS PARA EROA
As abordagens para EROA apresentadas na Seção 3 foram
comparadas com base nos critérios apresentados na Seção 4.
As siglas utilizadas para o nome das abordagens são: i) SMI Separação Multidimensional de Interesses; ii) EA-Miner Early-Aspect Mining; iii) Theme - Abordagem Theme; iv)
EROA/XML - Processo baseado em XML para Especificação
e Composição de Interesses Transversais; v) EROA/PV EROA baseada em Pontos de Vista; e vi) AOCRE - AspectOriented Component Requirements Engineering.
Na Tabela 1 encontra-se a avaliação dessas abordagens
quanto ao tipo de simetria, com breve justificativa para o tipo
escolhido.
Tabela 1. TIPO DE SIMETRIA DAS ABORDAGENS PARA EROA.
Abordagem
Tipo de
Simetria
SMI
Simétrica
EA-Miner
Assimétrica
Theme
Assimétrica
EROA/XML
Simétrica
EROA/PV
Assimétrica
AOCRE
Assimétrica
Descrição
Tanto os interesses transversais quanto os
não transversais são tratados de modo
uniforme.
Todos
são
denominados
“interesses” e podem influenciar /restringir
uns aos outros.
Os interesses transversais são tratados como
aspectos e os não transversais, como pontos
de vista.
Os interesses transversais são tratados como
temas transversais e os não transversais,
como temas base.
Tanto os interesses transversais quanto os
não transversais são tratados apenas como
interesse (concerns).
Os interesses transversais são tratados como
aspectos e os não transversais, como pontos
de vista.
Os interesses transversais são tratados como
IPSs e os não transversais, como diagramas
de sequência.
Todas as abordagens analisadas foram consideradas como
de propósito geral, pois contemplam tanto interesses
funcionais quanto não funcionais. Quanto à cobertura, SMI,
EA-Miner e EROA/PV são completas, uma vez que atendem
às principais atividades da EROA definidas no processo da
Figura 1. As abordagens Theme, EROA/XML e AOCRE
foram consideradas parciais, uma vez que não apresentam
apoio à atividade de Análise e Resolução de Conflitos.
As técnicas utilizadas por cada atividade das abordagens
comparadas são descritas na Tabela 2. Nota-se que as técnicas
mais utilizadas para a atividade “Identificação de Interesses”
são o uso de palavras-chave e catálogos para interesses não
funcionais. A técnica baseada em palavras-chave é fortemente
dependente da experiência dos engenheiros de software que a
aplica. Por exemplo, um profissional com pouca experiência
no domínio do software em análise ou sobre os conceitos de
interesses transversais pode gerar conjuntos vagos de palavraschave e que podem gerar muitos falsos positivos/negativos.
Além disso, técnicas como essas são ineficazes para detecção
de interesses implícitos, isto é, interesses que não aparecem
descritos no texto do documento de requisitos.
Já para “Representação” e “Composição” de interesses, a
maioria das abordagens optou por criar seus próprios modelos
de representação e composição de interesses utilizando para
isso a linguagem XML. O uso de XML é, muitas vezes,
justificado pelos autores das abordagens por permitir a
definição/representação de qualquer tipo de informação de
forma estruturada e por ser uma linguagem robusta e flexível.
Outra forma de representação, utilizada pelas abordagens
Theme e AOCRE, ocorre por meio de modelos bem
conhecidos da UML, como diagramas de sequência e estados,
para realização dessas atividades. Para a atividade “Análise e
Resolução de Conflitos”, também parece haver um consenso
na utilização de matrizes de contribuição e templates XML.
Tabela 2. TÉCNICAS UTILIZADAS PARA REALIZAÇÃO DAS ATIVIDADES
CONTEMPLADAS PELAS ABORDAGENS PARA EROA.
Identificação de Representação de
Interesses
Interesses
1
Palavras-chave e
Temas e
Técnicas de
Diagramas UML
Visualização
Composição de
Interesses
Análise &
Resolução de
Conflitos
Temas e Templates
UML
-
Templates XML e Matriz de
2
Regras de
Contribuição e
Composição
Templates XML
Templates XML e Matriz de
Catálogo de INF
3
Template XML
Regras de
Contribuição e
Estendido
Composição
Templates XML
Templates XML e
4 Catálogo de INF Template XML
Regras de
Composição
Templates XML e Matriz de
Matriz de
Pontos de Vista e
5
Regras de
Contribuição e
Relacionamento Templates XML
Composição
Templates XML
Diagramas de
Casos de Uso,
Diagramas de
Sequência,
Template
6
Sequência e
IPSs e
específico para
IPSs
Máquinas de
INF.
Estado
Legenda: 1) Theme; 2) EA-Miner; 3) SMI; 4) EROA/XML; 5) EROA/PV; 6)
AOCRE; INF: Interesses não funcionais.
escalabilidade e na acurácia da abordagem, quando sistemas
de larga escala forem analisados com elas.
A abordagem EA-Miner, entretanto, requer interferência
pontual do usuário, sendo a sua participação em atividades
mais estratégicas do que mecânicas. Isso se deve, em parte, à
utilização de uma suíte de ferramentas computacionais de
apoio à execução desta abordagem.
Tabela 3. TIPO DE ENVOLVIMENTO DO USUÁRIO REQUERIDO PELAS
ABORDAGENS PARA EROA.
Abordagem
Envolvimento
A
P
SMI
X
-
EA-Miner
-
X
Theme
X
-
EROA/XML
X
EROA/PV
X
AOCRE
X
Palavras-chave e
Templates XML
Catálogo de INF
O tipo de envolvimento do usuário requerido pelas
abordagens analisadas é apresentado na Tabela 3. A maioria
das abordagens (SMI, Theme, EROA/XML, EROA/PV e
AOCRE) foi classificada como as que exigem envolvimento
amplo de seus usuários. Isto pode ser um fator impactante na
Atividades Desenvolvidas
 Especificação dos interesses concretos do
sistema a partir de um conjunto de
metainteresses;
 Representação dos interesses em
templates de arquivos XML;
 Definição de regras de composição;
 Definição da contribuição (positiva ou
negativa) e das prioridades de um
interesse sobre o(s) outro(s).
 Tomada de decisão com relação às
palavras ambíguas detectadas no
documento de requisitos;
 Definição de regras de composição;
 Definição da contribuição (positiva ou
negativa) e das prioridades de um
interesse sobre o(s) outro(s).
 Definição de um conjunto de ações e
entidades-chave;
 Análise manual das visualizações geradas
pela abordagem com o objetivo de
encontrar interesses base e transversais;
 Construção manual dos temas a partir das
visualizações geradas pela abordagem;
 Definição de regras de composição.
 Identificação manual dos interesses do
sistema;
 Representação dos interesses em
templates de arquivos XML;
 Definição de regras de composição.
 Identificação manual dos interesses do
sistema;
 Representação dos interesses em
templates de arquivos XML;
 Definição de regras de composição;
 Definição da contribuição (positiva ou
negativa) e das prioridades de um
interesse sobre o(s) outro(s).
 Identificação manual dos interesses do
sistema;
 Representação dos interesses em
cenários;
 Definição de diagramas de sequência e
IPSs.
Legenda: A (Amplo); P (Pontual)
Na 0 estão descritas as ferramentas disponibilizadas por
cada abordagem para automatização de suas atividades. Notase que a abordagem mais completa em termos de apoio
computacional é a EA-Miner, pois todas as suas atividades são
automatizadas em partes ou por completo.
Por exemplo, a atividade de composição de interesses é
totalmente automatizada pela ferramenta ARCADE. O usuário
precisa apenas selecionar os interesses a serem compostos e
toda regra de composição é gerada automaticamente.
ARCADE trabalha com base nos conceitos da abordagem
SMI, automatizando as suas atividades. Nota-se ainda que as
atividades melhor contempladas com recursos computacionais
são “Representação” e “Composição de Interesses”. Dessa
forma, as atividades para EROA que exigem maior atenção da
comunidade científica para confecção de apoios
computacionais são “Identificação de Interesses” e “Análise e
Resolução de Conflitos”. A aplicação dos critérios
escalabilidade e tipo de avaliação para as abordagens
analisadas são apresentados na 0 e Tabela 6.
Tabela 4. APOIO COMPUTACIONAL DAS ABORDAGENS PARA EROA.
Abordagem
Theme
Atividade
Representação de Interesses
Identificação de Interesses
EA-Miner
Triagem
Representação e Composição
de Interesses e Análise e
Resolução de Conflitos
Representação e Composição
SMI
de Interesses e Análise e
Resolução de Conflitos
Representação e Composição
EROA/XML
de Interesses
Representação e Composição
EROA/PV de Interesses e Análise e
Resolução de Conflitos
AOCRE
Composição de Interesses
Apoio Computacional
Ref
Plugin Ecplise Theme/UML
[9]
EA-Miner, WMATRIX e
RAT(Requirement Analysis
Tool)
KWIP (Key Word In
Phrase)
[4]
ARCADE
ARCADE
[4]
APOR (AsPect-Oriented
Requirements tool)
[7]
ARCADE
[4]
Algoritmo proposto pelos
autores
[11]
Tabela 5. ESCALABILIDADE DAS ABORDAGENS PARA EROA.
Abordagem
Sistema
SMI,
EROA/XML
EROA/PV
Health
Watcher
EA-Miner e
Theme
Complaint
System
ATM
System
Documentos
Utilizados
Documento de
Requisitos e Casos
de Uso
Documento de
Requisitos
Documento de
Requisitos
Tamanho
Ref
19 páginas;
3.900
palavras.
[2]
230 páginas.
[15]
65 páginas.
[15]
As abordagens SMI, Theme e EROA/PV são as
abordagens mais avaliadas, tanto qualitativamente quanto
quantitativamente. Isso ocorre, pois essas foram algumas das
primeiras abordagens para EROA. Outros pontos interessantes
são que: i) EROA/XML não havia ainda sido avaliada
qualitativamente, de acordo com a revisão de literatura
realizada neste trabalho; e ii) não foram encontrados estudos
quantitativos que contemplassem a abordagem AOCRE.
Quanto à escalabilidade ressalta-se que a maioria delas,
com exceção da AOCRE, foi avaliada com documentos de
requisitos de médio e grande porte. EA-Miner e Theme foram
avaliadas com documentos de requisitos mais robustos (295
páginas de documentos, no total).
Com os valores presentes na Tabela 6 percebe-se que as
abordagens SMI, Theme, EROA/XML e EROA/PV
apresentaram os maiores tempos para execução das atividades
da EROA em proporção ao tamanho do documento de
requisitos. EA-Miner foi classificada neste trabalho como a
única abordagem que exige envolvimento pontual de seus
usuários. Infere-se que, pelo envolvimento pontual de seus
usuários, ela apresentou os melhores resultados com relação
ao tempo para realização das atividades da EROA.
Com base nos estudos de casos realizados observa-se que
quanto à cobertura e à precisão das abordagens, a identificação
de interesses base é melhor do que a de interesses transversais.
A justificativa para isso é que interesses base são mais
conhecidos e entendidos pela comunidade científica [2]. Além
disso, tais requisitos aparecem no documento de requisitos de
forma explícita, mais bem localizada e isolada, facilitando sua
identificação. Dessa forma, a atividade de identificação de
interesses em documentos de requisitos configura-se ainda um
problema de pesquisa relevante e desafiador e que merece a
atenção da comunidade científica.
VI. TRABALHOS RELACIONADOS
A literatura contém diversos trabalhos com o objetivo de
avaliar qualitativa ou quantitativamente as abordagens para
EROA. Herrera et al. [15] apresentaram uma análise quanto à
acurácia das abordagens EA-Miner e Theme, quando são
utilizados documentos de requisitos de dois sistemas de
software reais. As métricas relacionadas à eficácia e à
eficiência das abordagens, como cobertura, precisão e tempo,
foram as que receberam maior enfoque. Sendo assim, poucos
aspectos qualitativos das abordagens analisadas foram
levantados, como tipo de simetria, cobertura, entre outros.
Nessa mesma linha, Sampaio et al. [2] apresentaram um
estudo quantitativo para as abordagens: EROA/PV, SMI,
EROA/XML e Goal-based AORE. Foi avaliada a acurácia e a
eficiência dessas abordagens. Por se tratarem de abordagens
com características bem distintas, os autores elaboraram
também um mapeamento entre os principais conceitos delas e
criaram um esquema de nomenclatura comum para EROA.
Outros trabalhos, em formato de surveys [1][16][17]
[19][20], foram propostos com o intuito de comparar
abordagens para EROA, descrevendo as principais
características de cada abordagem. Entretanto, cada um desses
trabalhos considerou apenas um conjunto restrito e distinto de
características dessas abordagens, criando assim, um gap que
dificulta a compreensão mais abrangente das características
comuns e específicas de cada abordagem.
Singh e Gill [20] e Chitchyan et al. [1] fizeram a
caracterização de algumas abordagens para EROA, sem
utilizar um conjunto de critérios. Bakker et al. [19]
compararam algumas abordagens com relação: i) ao objetivo
da abordagem; ii) às atividades contempladas; iii) ao apoio
computacional oferecido; iv) aos artefatos utilizados; e v) à
rastreabilidade. Porém, não há informações sobre a acurácia
dessas abordagens, nem sobre os estudos avaliativos
realizados com elas. Bombonatti e Melnikoff [16]
compararam as abordagens considerando apenas os tipos de
interesses (funcionais ou não funcionais) e atividades da
EROA contemplados por essas abordagens. Rashid et al. [17]
comparam as abordagens para EROA sob o ponto de vista dos
objetivos da Engenharia de Requisitos, separação de
interesses, rastreabilidade, apoio à verificação de consistência,
entre outros.
A principal diferença deste trabalho em relação aos demais
comentados anteriormente está no fato de que o conjunto de
critérios proposto contempla não apenas os pontos qualitativos
comuns e específicos das abordagens para EROA analisadas,
mas proporciona um vínculo com informações quantitativas
obtidas por outros pesquisadores em trabalhos relacionados.
Tabela 6. TIPOS DE AVALIAÇÃO REALIZADOS COM AS ABORDAGENS PARA EROA.
Abordagem
SMI
EA-Miner
Theme
EROA/XML
EROA/PV
AOCRE
Tipo de Avaliação
Q
EC
[2]
[1][16][17][20]
-
EXC
-
-
[15]
-
[1][17]
-
-
-
[15]
-
[1] [17][19] [20]
[1][16][17][19][20]
[1][17] [20]
[2]
[2]
-
-
Cobertura
IB: 100%; ITF: 50%; ITNF: 70%
Complaint System
IB: 64%; ITF: 64%; ITNF: 45%
ATM System
IB: 86%; ITF: 80%; ITNF: 100%
Complaint System
IB: 73%; ITF: 55%; ITNF: 73%
ATM System
IB: 86%; ITF: 73%; ITNF: 40%
IB: 100%; ITF: 50%; ITNF: 55%
IB: 100%; ITF: 0%; ITNF: 100%
-
Relatório
Precisão
IB: 88%; ITF: 100%; ITNF: 77%
Complaint System
IB: 31%; ITF: 78%; ITNF: 71%
ATM System
IB: 35%; ITF: 63%; ITNF: 71%
Complaint System
IB: 48%; ITF: 86%; ITNF: 80%
ATM System
IB: 50%; ITF: 91%; ITNF: 50%
IB: 88%; ITF: 100%; ITNF: 100%
IB: 70%; ITF: 0%; ITNF: 83%
-
Tempo
104 min
Complaint System: 70 min
ATM System: 140 min
Complaint System: 760 min
ATM System: 214 min
173 min
62 min
-
Legenda: IB: Interesses Base; ITF: Interesses Transversais Funcionais; ITNF: Interesses Transversais Não Funcionais. Q: Qualitativa. EC: Estudo de Caso. EXC: Experimento Controlado.
Além disso, tais critérios compreendem um framework
comparativo que pode ser estendido para contemplar outros
tipos de critérios relacionados à área de ER.
VII. CONSIDERAÇÕES FINAIS
A grande variedade de abordagens para EROA existentes
na literatura, com características diferentes, tem tornado difícil
a escolha da mais adequada às necessidades dos usuários. Este
trabalho apresentou um conjunto de critérios para comparação
de abordagens para EROA, concebidos com base nas
características comuns e especificidades das principais
abordagens disponíveis na literatura, bem como em trabalhos
científicos que avaliaram algumas dessas abordagens. Além
disso, realizou-se a aplicação desses critérios sobre três
abordagens bem conhecidas. Essa comparação pode servir
como guia para que o engenheiro de software escolha a
abordagem para EROA mais adequada as suas necessidades.
Também foram destacados alguns dos pontos fracos das
abordagens analisadas, como por exemplo, a baixa precisão e
cobertura para interesses transversais não funcionais.
Como trabalhos futuros, pretende-se: i) expandir o
conjunto de critérios aqui apresentado a fim de se contemplar
características específicas para cada uma das fases da EROA;
ii) aplicar o conjunto de critérios expandido às abordagens já
analisadas com o intuito de se obter novas informações sobre
elas, bem como a novos tipos de abordagens existentes na
literatura; iii) desenvolver uma aplicação web que permita aos
engenheiros de software e pesquisadores da área de EROA
pesquisarem e/ou divulgarem seus trabalhos utilizando o
conjunto de critérios elaborados; e iv) por último, propor uma
nova abordagem que reutilize os pontos fortes e aprimore os
pontos fracos de cada abordagem analisada.
REFERÊNCIAS
[1]
[2]
[3]
[4]
Chitchyan, R.; Rashid, A; Sawyer, P.; Garcia, A.; Alarcon, M. P.; Bakker, J.;
Tekinerdogan, B.; Clarke, S.; Jackson, A. “Report synthesizing state-of-theart in aspect-oriented requirements engineering, architectures and design”.
Lancaster University: Lancaster, p. 1-259, 2005. Technical Report.
Sampaio, A.; Greenwood P.; Garcia, A. F.; Rashid, A. “A Comparative Study
of Aspect-Oriented Requirements Engineering Approaches”. In 1st
International Symposium on Empirical Software Engineering and
Measurement (ESEM '07) p 166-175, 2007.
Dijkstra, E. W. “A Discipline of Programming”. Pearson Prentice Hall, 217
p., ISBN: 978-0132158718, 1976.
Chitchyan, R.; Sampaio, A.; Rashid, A.; Rayson, P. “A tool suite for aspectoriented requirements engineering”. In International Workshop on Early
Aspects at ICSE. ACM, p. 19-26, 2006.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Sampaio, A.; Chitchyan, R.; Rashid, A.; Rayson, P. “EA-Miner: a Tool for
Automating Aspect-Oriented Requirements Identification”, Int'l Conf.
Automated Software Engineering (ASE), ACM, pp. 353-355, 2005.
WMATRIX. Corpus Analysis and Comparison Tool. Disponível em:
http://ucrel.lancs.ac.uk/wmatrix/. Acessado em: Abril de 2013.
Soeiro E.; Brito, I. S; Moreira, A. “An XML-Based Language for
Specification and Composition of Aspectual Concerns”. In 8th International
Conference on Enterprise Information Systems (ICEIS). 2006.
Moreira, A.; Rashid, A.; Araújo, J. “Multi-Dimensional Separation of
Concerns in Requirements Engineering”. 13th International Conference on
Requirements Engineering (RE). Proceedings… p. 285-296, 2005.
Baniassad, E.; Clarke, S. “Theme: An approach for aspect-oriented analysis
and design”. In 26th Int. Conf. on Software Engineering (ICSE’04). 2004.
Clarke, S.; Baniassad, E. “Aspect-Oriented Analysis and Design”: The
Theme Approach: Addison-Wesley, 2005.
Whittle J.; Araújo, J. “Scenario Modeling with Aspects”. IEEE Software, v.
151(4), p. 157-172, 2004.
Yijun Y.; Leite, J.C.S.P.; Mylopoulos, J. “From Goals to Aspects:
Discovering Aspects from Requirements Goal Models”. In International
Conference on Requirements Engineering (RE). 2004.
Rashid, A.; Moreira, A.; Araújo, J. “Modularisation and composition of
aspectual requirements”. In 2nd International Conference on Aspect-Oriented
Software Development (AOSD’03). ACM, 2003.
Rashid, A.; Sawyer, P.; Moreira, A.; Araújo, J. “Early Aspects: a Model for
Aspect-Oriented Requirements Engineering”. In International Conference on
Requirements Engineering (RE). 2002.
Herrera, J. et al. “Revealing Crosscutting Concerns in Textual Requirements
Documents: An Exploratory Study with Industry Systems”. In 26th Brazilian
Symposium on Software Engineering. p. 111-120, 2012.
Bombonatti, D. L. G.; Melnikoff, S. S. S. “Survey on early aspects
approaches: non-functional crosscutting concerns integration in software
sytems”. In 4th World Scientific and Engineering Academy and Society
(WSEAS). Wisconsin, USA, p. 137-142, 2010.
Rashid, A.; Chitchyan, R. “Aspect-oriented requirements engineering: a
roadmap”. In 13th Int. Workshop on Early Aspects (EA). p. 35-41, 2008.
Chung, L.; Leite, J. S. P. “Non-Functional Requirements in Software
Engineering”. Springer, 441 p., 2000.
Bakker J.; Tekinerdoğan, B.; Akist, M. “Characterization of Early Aspects
Approaches”. In Early Aspects: Aspect-Oriented Requirements Engineering
and Architecture Design. 2005.
Singh, N.; Gill, N. S. “Aspect-Oriented Requirements Engineering for
Advanced Separation of Concerns: A Review”. International Journal of
Computer Science Issues (IJCSI). v 8(5). 2011.
Kellens, A.; Mens, K., and Tonella, P. “A survey of automated code-level
aspect mining techniques”. Transactions on Aspect-Oriented Software
Development IV, v. 4640, p. 143-162, 2007.
Wohlin, C.; Runeson, P.; Höst, M.; Regnell, B.; Wesslén, A.
“Experimentation in Software Engineering: an Introduction”. 2000.
France, R.; Kim, D.; Ghosh, S.; Song, E. “A UML-based pattern specification
technique”. IEEE Trans. Software Engineering. v. 30 (3), pp. 193–206. 2004.
Using Tranformation Rules to Align Requirements and Archictectural Models
Monique Soares, Carla Silva, Gabriela Guedes, Jaelson Castro, Cleice Souza, Tarcisio Pereira
Centro de Informática
Universidade Federal de Pernambuco – UFPE
Recife, Brasil
{mcs4, ctlls, ggs, jbc, tcp}@cin.ufpe.br
Abstract— In previous works we have defined the STREAM
strategy to align requirements and architectural models. It
includes four activities and several transformations rules that
could be used to support the systematic generation of a
structural architectural model from goal oriented
requirements models. The activities include the Preparation
of
Requirements Models, Generation of
Architectural
Solutions, Selection of Architectural Solution and Refinement
of the
Architecture. The first two activities are time
consuming and rely on four horizontal and four vertical
transformation rules which are current performed manually,
requiring much attention from the analyst. For example, the
first activity consists of the refactoring of the goal models,
while the second one derives architectural models from the
refactored i* (iStar) models. In this paper we automate seven
out of the eight transformation rules of the two first activities
of STREAM approach. The transformation language used to
implement the rules was QVTO. We rely on a running
example to illustrate the use of the automated rules. Hence,
our approach has the potential to improve the process
productivity and the quality of the models produced.
KeywordsRequirements
Engineering,
Architecture, Transformation Rules, Automation
I.
Software
INTRODUCTION
The STREAM (A STrategy for Transition between
REquirements Models and Architectural Models) is a
systematic approach to integrate requirements engineering
and architectural design activities, based on model
transformation, to generate architectural models from
requirements models [1]. It generates structural architectural
models, described in Acme [4] (the target language), from
goal-oriented requirements models, expressed in i* (iStar)
[3] (i.e. the source language). This approach has four
activities, namely: Prepare Requirements Models, Generate
Architectural Solutions, Select Architectural Solution and
Refine Architecture.
The first two activities are time consuming and rely on
horizontal and vertical transformation rules (HTRs and
VTRs), respectively. Currently, these transformations rules
are made manually, requiring much attention from the
analyst. However, they seem likely to be automated, which
could reduce not only the human effort required to generate
the target models, but also minimize the number of errors
produced during the process. Hence, our proposal is to use
the QVT [2] transformation language to properly define the
rules, and also to develop some tool support to execute them.
Therefore, two research questions are addressed by this
paper: Is it possible to automate the transformation rules
defined in the first two STREAM activities, namely:
Prepare Requirements Models, Generate Architectural
Solutions? And, if so, how could these rules be automated?
Henceforth, the main objective of this paper is to
automate the transformation rules defined by the first two
phases of STREAM process 1 . To achieve this goal it is
necessary to:
 describe the transformation rules using a suitable
transformation language;
 make the vertical and horizontal transformation rules
compatible with the modeling environment used to
create the goal-oriented requirements models, i.e. the
iStarTool [6];
 make the vertical transformation rules compatible
with the modeling environment used to create the
structural architectural models, i.e. the AcmeStudio
[4].
In order to automate the HTRs and VTRs proposed by
the STREAM process, it was necessary to choose a language
that would properly describe the transformation rules and
transform the models used in STREAM approach. We opted
for the QVTO (Query / View / Transformation Operational)
language [2], a transformation language that is integrated
with Eclipse environment [16] and that is better supported
and maintained.
Note that as input of the first activity of the STREAM
process is based on an i* goal model. The iStarTool [6] is
used to generate XMI file of the goal-oriented requirements
model. This file is read by the Eclipse QVTO plugin, which
generates the XMI file of the Acme architectural model.
Note that this file is consistent with the metamodel created
in accordance with the AcmeStudio tool.
The rest of the paper is organized as follows. Section II
presents the theoretical background. Section III describes the
horizontal transformations rules in QVTO. In Section IV, we
present the vertical transformation rules in QVTO. In order
to illustrate our approach, in Section V we use the BTW
example [10]. Section VI presents some related works.
Finally, Section VII concludes the paper with a brief
explanation of the contributions achieved and the proposal of
future work.
1
Note that, it is out of scope of this paper to support the other two phases
of the approach (Select Architectural Solution, Refine Architecture).
TABLE I.
EXAMPLE OF HORIZONTAL TRANSFORMATION RULES
ADAPTED FROM [8]
Resulting model after applying
the rule
Rule Original Model
(b)
Ator X
Hu
rt
Goal 2
Tarefa 8
Goal 1
Ator Y
Tarefa 6
Softgoal 1
Goal 2
He
lp
Tarefa 8
Tarefa 7
Ator Z
Tarefa 6
D
Tarefa 1
D
Tarefa 5
Goal 1
rt
Hu
HTR1
Ator Y
Softgoal 1
Ator X
Help
(a)
Tarefa 5
Tarefa 1
Tarefa 2
Tarefa 7
Tarefa 3
Tarefa 4
G
Tarefa 2
G
D
Tarefa 4
D
Tarefa 3
Ator Z
Ator X
Ator X
Goal 2
Ator Z
Goal 2
Tarefa 1
Tarefa 1
Tarefa 6
Tarefa 6
Tarefa 1
Ator X
D
Tarefa 2
D
Ator Z
Ator X
Ator Z
Softgoal 1
Softgoal 1
Softgoal 1
D
Tarefa 1
D
Tarefa 6
Ator X
Softgoal 1
Tarefa 6
Ator Z
Ator X
Goal 1
Help
Help
Hurt
Hurt
A. STREAM
STREAM is a systematic approach to generate
architectural models from requirements models based on
model transformation [1]. The source and target modelling
languages are i* for requirements modelling and Acme for
architectural description, respectively.
The STREAM process consists of the following
activities: 1) Prepare requirements models, 2) Generate
architectural solutions, 3) Choose an architectural solution
and 4) Derive architecture.
Horizontal Transformation Rules (HTRs) are part of the
first activity. They are useful to increase the modularity of
the i* requirements models. Vertical Transformation Rules
(VTRs) are proposed in second activity. They are used to
derive architectural models from the modularized i*
requirements model. Non-functional requirements (NFRs)
are used in the third activity to select one of the possible
architectural descriptions obtained. Depending on the NFR
to be satisfied, some architectural patterns can be applied, in
activity 4.
The first STREAM activity is concerned with improving
the modularity of the expanded system actor. It allows
delegation of different parts of a problem to different
software actors (instead of having a unique software actor).
In particular, it is sub-divided into three steps: (i) analysis of
internal elements (identify which internal elements can be
extracted from the original software actor and relocated to a
new software actor); (ii) application of horizontal
transformation rules (the actual extraction and relocation of
the identified internal elements); and, (iii) evaluation of the
i* model (checking if the model needs to be modularized
again, i.e., return to the step 1).
In order to develop these steps, it is necessary to use,
respectively:
•
Heuristics to guide the decomposition of the
software's actor;
•
A set of rules to transform i* models;
•
Metrics for assessing the degree of modularization
of both the initial and modularized i* models.
This is a semi-automatic process, since not all the
activities can be automated. For example, the step 1 of the
first activity cannot be automated because the analyst is the
one in charge to choose the sub-graph to be moved to
another actor. The Horizontal Transformation Rule 1
(HTR1) moves a sub-graph previously selected. Hence,
HTR1 cannot be fully automatized because it always
depends on the sub-graph chosen by the analyst. Observe
that after applying the HTR1, the resulting model may not be
in compliance with the i* syntax. So, the next HTRs are to
correct possible syntax errors.
The Horizontal Transformation Rule 2 (HTR2) moves a
means-end link crossing actor‟s boundary. HTR2 considers
HTR2
In this section we present the baseline of this research:
the original rules from the STREAM approach and the model
transformation language (QVT) used to implement HTRs
and VTRs of STREAM.
the situation where the sub-graph moved to another actor has
the root element as a “means” in a means-end relationship.
The Horizontal Transformation Rule 3 (HTR3) moves a
contribution link crossing the actor‟s boundary. HTR3
considers the situation where the sub-graph moved to
another actor has a contribution relationship with others
elements that were not moved.
The Horizontal Transformation Rule 4 (HTR4) moves a
task-decomposition link crossing the actor‟s boundary.
HTR4 considers the situation where the sub-graph moved
has a task-decomposition relationship with other elements
that were not moved. Table 1 shows examples of these rules.
The graph to be moved in HTR1 is highlighted with a dashed
line and labelled with G.
HTR3
BACKGROUND
HTR4
II.
Tarefa 1
Ator Z
Goal 1
Tarefa 3
Tarefa 3
Tarefa 5
Tarefa 5
D
Tarefa 3
D
The transformation rules are intended to delegate internal
elements of the software actor to other (new) software actors.
This delegation must ensure that new actors have with the
original actor. Thus, the original model and the final model
are supposed to be semantically equivalent.
At the end of the first activity, the actors representing the
software are easier to understand and maintain, since there is
more actors with less internal elements.
In the second STREAM activity (Derive Architectural
Solutions), transformation rules are used to transform and i*
requirements model into an initial Acme architectural model.
In this case, we use the VTRs.
In order to facilitate the understanding, we have
separated the vertical transformation rules into four rules.
VTR1 maps the i* actors into Acme components. VTR2
maps the i* dependencies into Acme connectors. VTR3
maps a depender actor as a required port of Acme
connector. And last but not least, VTR4 maps the dependee
actor to a provided port of an Acme connector.
Note the goal of this paper is to fully automate three
HTRs (HTR2, HTR3 and HTR4) and all VTRs proposed by
the STREAM. HTR1 is not amenable to automation. First,
we specify them in QVTO [2]. It is worth noting that to
create the i* models, we have relied on the iStarTool tool
[6].
B. QVT
The QVT language has a hybrid declarative/imperative
nature. The declarative part is divided into a two-tier
architecture, which forms the framework for the execution
semantics of the imperative part [5]. It has the following
layers:
• A user-friendly Relations metamodel and language that
supports standard combination of complex object and
create the template object.
• A Core metamodel and language defined using minimal
extensions to EMOF and OCL.
In addition to the declarative languages (Relations and
Core), there are two mechanisms for invoking imperative
implementations of Relations or Core transformations: a
standard language (Operational Mappings) as well as nonstandard implementations (Black-box MOF Operation).
The QVT Operational Mapping language allows both the
definition of transformations using a complete imperative
approach (i.e. operational transformations) or it lets hybrid
approach in which the transformations are complemented
with imperatives operations (which implements the
relationships).
The operational transformation represents the definition
of a unidirectional transformation that is expressed
imperatively. This defines a signature indicating the models
involved in the transformation and defines an input operation
for its implementation (called main). An operational
transformation can be instantiated as an entity with
properties and operations, such as a class.
III.
AUTOMATION OF HORIZONTAL TRANSFORMATION
The first activity of the STREAM process presents some
transformation rules that can be defined precisely using the
QVT (Query / View / Transformation) transformation
language [5], in conjunction with OCL (Object Constraint
Language) [9] to represent constraints.
The transformation process requires the definition of
transformation rules and metamodels for the source and
target languages. The first STREAM activity uses the
HTRs, which aim to improve the modularity of the i* models
and have the i* language as source and target language.
The rules were defined in QVTO and executed through a
plugin for the Eclipse platform. Transformations were
specified based on the i* language metamodel considered by
the iStarTool. In QVT, it is necessary to establish a reference
to the metamodel to be used.
As explained in section II, the steps of the first activity of
the STREAM process (Refactor Models Requirements) are:
Analysis of internal elements; Application of horizontal
transformation rules, and Evaluation of i* models.
The Horizontal Transformation Rules activity takes as
input two artefacts: the i* model and the selection of internal
elements. The former is the i* system model, and the latter
is the choice of elements to be modularized made by
Engineer Requirements. The output artefact produced by the
activity is refactored and more modularized i* model.
Modularization is performed by a set of horizontal
transformation rules. Each rule performs a small and located
transformation that produces a new model that decomposes
the original model. Both the original and the produced model
are described in i*. Thus, the four horizontal transformation
rules proposed by [8] are capable of implementation.
First the analyst uses the iStarTool to produce the i*
requirements model. Then the HTR1 can be performed
manually by him/her also using the iStarTool. The analyst
may choose to move the sub-graph for a new actor or an
existing actor, and then moves the sub-graph. This delegation
must ensure that new actors and the original actor have a
relationship of dependency. Thus, the original model and the
final model are supposed to be semantically equivalent.
Upon completion of HTR1, the artefact generated is used in
automated transformations that perform all other HTRs at
once. This is the case if the obtained model is syntactically
wrong. Table 1 describes the different types of relationship
between the components that have been moved to another
actor and a member of the actor to which it belonged. If the
relationship is a means-end rule, HTR2 should be applied.
While if the relationship is a contribution, HTR3 is used. In
the situation where tasks decomposition is present, HTR4
is recommended .
In the next section we detail how each of these HTRs was
implemented in QVTO.
A. HTR2- Move a means-end link across the actor's
boundary
If after applying the HTR1, there is a means-end link
crossing the actors‟ boundaries, the HTR2 corrects this
syntax error since means-end links can exist only inside the
actor‟s boundary. The means-end link is usually used to
connect a task (means) to a goal (end). Thus, the HTR2 make
a copy of the task inside the actor who has the goal, in such
way that the means-end link is now inside of the actor‟s
boundary that has the goal (Actor X in Table 1). After that,
the rule establishes a dependency from that copied task to the
task inside of the new actor (Actor Z in Table 1).
To accomplish this rule, the HTR2 checks if there is at
least a means-end link crossing the actors‟ boundaries (line 7
of the code present in Table 2). If so, it then checks if this
means-end link has as source and target attributes elements
present in the boundary of different actors. If this condition
holds (line 10), the HTR2 creates a copy of the source
element inside the boundary of the actor which possesses the
target element of the means-end link (atorDaHora variable
in line 19). A copy of the same source element is copied
outside the actors‟ boundaries to become a dependum (line
18). Then, a dependency is created from the element copied
inside the actor to the dependum element (line 20) and from
the dependum element to the original source element of the
means-end link that remained inside the other actor (line 21).
The result is the same presented in Table 1 for HTR2. The
source code in QVTO for HTR2 is presented in Table 2.
TABLE II.
HTR2 DESCRIBED IN QVTO
20
actor := atorDaHora.name;
};
self.links += object DependencyLink {
source := atorDaHora.elements->last();
target := self.elements->last();
name := "M";
type := DependencyLinkType::COMMITED;
21
};
self.links += object DependencyLink {
source := self.elements->last();
1 actorResultAmount :=
target := otherActor.meansEnd-
oriModel.rootObjects()[Model].actors.name->size();
2 while(actorResultAmount > 0){
3
>at(meansend).source;
name := "M";
if(self.actors-
type := DependencyLinkType::COMMITED;
>at(actorResultAmount).type.=(ActorType::ACTORBOUN
DARY)) then {
4
atoresBoundary += self.actors-
};
} endif;
22
>at(actorResultAmount);
5
var meansend := self.actors>at(actorResultAmount).meansEnd->size();
6
var atorDaHora := self.actors>at(actorResultAmount);
7
while(meansend > 0) {
8
var sourceDaHora := atorDaHora.meansEnd-
>at(meansend).source.actor;
9
var targetDaHora := atorDaHora.meansEnd>at(meansend).target.actor;
10
if(sourceDaHora <> targetDaHora) then {
11
var atoresBoundarySize := atoresBoundary-
>size();
12
var otherActor : Actor;
13
14
while(atoresBoundarySize > 0) {
if(atoresBoundary-
>at(atoresBoundarySize).name <> atorDaHora.name)
then {
15
otherActor :=
atoresBoundary-
>at(atoresBoundarySize);
} else {
16
otherActor :=
atorDaHora;
} endif;
17
atoresBoundarySize :=
atoresBoundarySize - 1;
18
};
self.elements += object Element{
name := atorDaHora.meansEnd-
>at(meansend).source.name;
type := atorDaHora.meansEnd>at(meansend).source.type;
19
};
atorDaHora.elements += object Element{
name := atorDaHora.meansEnd-
>at(meansend).source.name;
type := atorDaHora.meansEnd>at(meansend).source.type;
meansend := meansend - 1;
};
} endif;
23 actorResultAmount := actorResultAmount - 1;
};
B. HTR3- Move a contribution link across the actor's
boundary
HTR3 copies a softgoal that was moved to its source
actor, if this softgoal is a target in a contribution link with
some element in his initial actor. The target of the link is
moved from the softgoal to its copy in the initial actor. This
softgoal is still replicated as a dependum of a dependence
link from the original softgoal to its copy.
If an element of some actor has a contribution link with a
softgoal that is within the actor that was created or received
elements in HTR1, then this softgoal will be copied into the
actor that has an element that has a contribution link with
this softgoal. The target of the contribution link becomes that
copy. This softgoal is also copied as a dependum of a
softgoal dependency in its original copy.
In order to accomplish this rule, we analyse if any
contribution link has the source and target attributes with
elements present in different actors. If the actor element
present in the source or the target is different from the actor
referenced in attribute of the element, then this element
(softgoal) is copied to the actor present in source or target
that has the different name of the actor analysed. The target
attribute of the contribution link shall refer to this copy. This
same softgoal is also copied to the modelling stage and
creates a dependency from the softgoal copy to original
softgoal with to the copied softgoal to the stage as
dependum. The target of this dependence is the copy and the
source is the original softgoal.
C. HTR4- Move a task-decomposition link across the
actor's boundary
HTR4 replicates an element that is the decomposition of
a task into this other actor as dependum a dependency link
between this element and the task, and removes the
decomposition link.
If an some actor's element is task decomposition within
the actor that was created or received elements in HTR1,
then that decomposition link is removed, and a copy of this
element will be created and placed during the modelling
stage as a dependum of a dependency between the task in
the actor created or received elements in HTR1 and the
element present in another actor that was the decomposition
of this task. The target of this dependence is the element and
the source is the task.
In order to accomplish this rule, we analyse if any
decomposition task link has source and target attributes with
elements present in different actors. If the actor of the
element present in the source or target is different from the
referenced actor in the moves attribute of element, then that
element is copied during the modelling stage to create a
dependency from the referenced element as the source of
decomposition link to the element referenced as the target,
i.e., a dependency from the original element of the task, with
the copied element to the stage as a dependum. The target of
this dependence is the task and the source is the element. The
decomposition link is removed.
IV.
AUTOMATION OF VERTICAL TRANSFORMATIONS
The second STREAM activity (Generate Architectural
Solutions) uses transformation rules to map i* requirements
models into an initial architecture in Acme. As these
transformations have different source and target languages,
they are called vertical transformations.
In order to facilitate the understanding of de VTRs as
well as the description of them, we separate the vertical
transformation rules in four rules [14].
Below we detail how each of these VTRs was
implemented.
A. VTR1- Mapping the i* actors in Acme components
In order to describe this first VTR it is necessary to
obtain the quantity of actors present in the artefact developed
in the first activity. From this, we create the same quantity of
Acme components (line 3 of code present in Figure 1),
giving the same actors name. The Figure 1 shows an excerpt
of QVTO code for VTR1.
1 while(actorsAmount > 0) {
2 result.acmeElements += object Component{
3
name := self.actors.name->at(actorsAmount);
}
4 actorsAmount := actorsAmount - 1;
}
Figure 1. Excerpt of the QVTO code for VTR1
The XMI output file will contain the Acme components
within the system represented by the acmeElements tag
(line 1 of code present in Figure 2), an attribute of that tag,
xsi:type (line 1), that contain information that is a
component element and the attribute name the element name,
as depicted in Figure 2. Figure 3 shows graphically a
component.
1 <acmeElements xsi:type="Acme:Component"
name="Advice Receiver">
…
2 </acmeElements>
Figure 2. XMI tag of component
Figure 3. Acme components linked by connector
However, an Acme component has other attributes, not
just the name, so it is also necessary to perform the VTR3
and VTR4 rules to obtain the other necessary component
attributes.
B. VTR2- Mapping the i* dependencies in Acme connectors
Each i* dependency creates two XMI tags. One captures
the link from depender to the dependum and the other
defines the link from the dependum to the dependee.
1 while(dependencyAmount > 0) {
2
if(self.links.source>includes(self.links.target->at(dependencyAmount))
and self.actors.name->excludes(self.links.target>at(dependencyAmount).name)) then {
3
result.acmeElements += object Connector{
4
name := self.links.target.name>at(dependencyAmount);
5
roles += object Role{
6
name := "dependerRole";
};
7
roles += object Role{
8
name := "dependeeRole";
};
};
} endif;
9
dependencyAmount := dependencyAmount - 1;
};
Figure 4. Excerpt of the QVTO code for VTR2
As seen in Figure 4, for the second vertical rule, which
transforms i* dependencies to Acme connectors (line 3 of
code present in Figure 4), each i* dependency creates two
tags in XMI, one captures the connection from the depender
to the dependum (line 5) and another defines the connection
from the dependee to the dependum (line 7).
In order to map these dependencies into Acme
connectors it is necessary to recover the two dependencies
tags, observing that the have the same dependum, i.e., the
target of a tag must be equal to the source of another tag,
which can characterize the dependum. However, they should
not consider the actor which plays the role of depender
(source) in some dependency and dependee (target) in
another. Once this is performed, there are only dependums
(intentional elements) left. For each dependum, one Acme
connector is created (line 1 of code present in Figure 5).
The connector created receives the name of the
intentional element that represents the dependum of the
dependency link. Two roles are created within the connector,
one named dependerRole and another named dependeeRole.
The XMI output file will contain the connectors
represented by tags (see Figure 5).
1 <acmeElements xsi:type="Acme:Connector"
name="Connector">
2
<roles name="dependerRole"/>
3
<roles name="dependeeRole"/>
4 </acmeElements>
Figure 5. Connector in XMI
C. VTR3- Mapping depender actors as required port of
Acme connector
With the VTR3, we map all depender actors (source from
some dependency) into a required port of an Acme
connector. Thus, we list all model‟s actors that are source in
some dependency (line 2 of code present in Figure 6).
Furthermore, we create an Acme port for each depender
actor (line 3). Each port created has a name and a property
(lines 4 and 5), the name is assigned randomly, just to help
to control them. The property must have a name and a value,
the property name is “Required” once we are creating the
required port, as figured in Figure 6.
1 while(dependencyAmount > 0) {
2
if(self.actors.name>includes(self.links.source>at(dependencyAmount).name) and
self.actors.name>at(actorsAmount).=(self.links.source>at(dependencyAmount).name)) then {
3
ports += object Port{
4
name := "port"+countPort.toString();
5
properties := object Property {
6
name := "Required";
7
value := "true"
};
};
} endif;
8
dependencyAmount := dependencyAmount - 1;
9
countPort := countPort + 1;
};
Figure 6. Excerpt of the QVTO code for VTR3
The XMI output file will contain within the component tag
(line 1 of code present in Figure 7) the tags of the ports.
Inside the port‟s tag there will be a property tag with the
name attribute assigned as Required and the attribute value
set true (lines 2 to 4).
Figure 7 presents an example of a required port in XMI,
while Figure 3 shows the graphic representation of the
required port (white).
1 <acmeElements xsi:type="Acme:Component"
name="Component">
2
<ports name="port8">
3
<properties name="Required" value="true"/>
4
</ports>
5 </acmeElements>
Figure 7. Example of a required port in XMI
D. VTR4- Mapping dependee actors as provided port of
Acme connector
VTR4 is similar to VTR3. We map all dependee actors
(target from some dependency) as a provided Acme port of a
connector. Thus, we list all model‟s actors that are target in
some dependency (line 2 of code present in Figure 8). We
create an Acme port for each dependee actor. It has a name
and property, the name is assigned randomly, simply to
control them (line 4). The property must have a name and a
value. The property name is set to “Provided” once we are
creating the provided port. Figure 8 presents an QVTO
excerpt code for the provided port.
The XMI output file will contain within the component a
tag to capture the ports. Inside the port‟s tag that are
provided there will be a property tag with the name attribute
assigned as Provided (line 3 of code present in Figure 9).
While the value attribute is set to true and the type attribute
as boolean.
1 <acmeElements xsi:type="Acme:Component"
name="Advice Giver">
2
<ports name="port17">
3
<properties name="Provided" value="true"
type="boolean"/>
5
</ports>
6 </acmeElements>
Figure 9. Provided port in XMI
Figure 3 shows the graphic representation of the provided
port (black).
V.
RUNNING EXAMPLE
BTW (By The Way) [10] is a route planning system that
helps users with advices on a specific route searched by
them. The information is posted by other users and can be
filtered for the user to provide only the relevant information
1 while(dependencyAmount > 0) {
2 if(self.actors.name>includes(self.links.target>at(dependencyAmount).name) and
self.actors.name>at(actorsAmount).=(self.links.target>at(dependencyAmount).name)) then {
3
ports += object Port{
4
name := "port"+countPort.toString();
5
properties := object Property{
6
name := "Provided";
7
value := "true";
8
type := "boolean";
};
};
} endif;
9
countPort := countPort + 1;
};
Figure 8. Creation of Provided Port
about the place he wants to visit. BTW was an awarded
projected presented at the ICSE SCORE competition held in
2009 [11].
In order to apply the automated rules in i* models of this
example, it necessary to perform the following steps:
1. Create the i* requirements model using the iStarTool;
2. Use the three heuristics defined by STREAM to guide
the selection of the sub-graph to be moved from an actor to
another;
3. Manually apply the HTR1, but with support of the
iStarTool. The result is an i* model with syntax errors that
must be corrected using the automated transformation rules;
4. Apply the automated HTR2, HTR3 and HTR4 rules.
After step 1, we identified some elements inside the
BTW software actor that are not entirely related to the
application domain of the software and these elements can be
placed on other (new) software actors. In fact, the sub-graphs
that contain the "Map to be Handled", "User Access be
Controlled", and "Information be published in map"
elements can be viewed as independents of the application
domain. To facilitate the reuse of the first sub-graph, it will
be moved to a new actor named "Handler Mapping". The
second sub-graph will be moved to a new actor named "User
decomposition and contribution links crossing the actors‟
boundaries, meaning that the model is syntactically incorrect
and must be corrected by the automated HTRs.
In order to apply the HTR2, HTR3 and HTR4, we only
need to execute a single QVTO file. Thus, with the eclipse
configured to QVT transformations, along with the
metamodel of i* language referenced in the project and the
input files referenced in the run settings, the automated rules
will be applied simultaneously by executing the QVTO
project.
Figure 11. BTW i* SR diagram and selected elements
Figure 10. BTW i* diagram and selected elements
Access Controller" while the third sub-graph will be moved
to a new actor called Information Publisher.
Steps 1 and 2 are performed using the iStarTool. This
tool generates two types of files (extensions): the file
"istar_diagram" has information related to the i*
requirements model; the "istar" file has information related
to the i* modelling language metamodel. Since the file
"istar" is a XMI file, we changed its type (extension) to
"xmi". XMI files are used as input and output files by the
automated rules (HTR2, HTR3 and HTR4). The BTW i*
model and the elements to be moved to other actors are
shown in Figure 10. Figure 11 depicts the BTW i* model
after applying HTR1. Note that there are some task-
After applying the HTRs, a syntactically correct i* model
is produced. In this model, the actors are expanded, but in
order to apply the vertical transformation rules, it is
necessary to contract all the actors (as shown in Figure 12) to
be used as input in the second STREAM activity (Generate
Architectural Solutions).
Moreover, when applying the VTRs, we only need to
execute a single QVTO file. The VTRs are executed
sequentially and the analyst will visualize just the result
model [15].
Figure 13 presents the graphical representation of the
XMI model generated after the application of the VTRs. This
XMI is compatible with the Acme metamodel.
Figure 11. BTW i* model after performing HTR1
Figure 12. BTW i* model after applying all HTRs
Figure 13. BTW Acme Model obtained from the i* model
VI.
RELATED WORKS
Coelho et al. proposes an approach to relate aspect
oriented goal models (described in PL-AOV-Graph) and
architectural models (described in the PL-AspectualACME)
[12]. It defines the mapping process between these models
and a set of transformations rules between their elements.
The MaRiPLA (Mapping Requirements to Product Line
Architecture) tool automates this transformation, which is
implemented using the Atlas Transformation Language
(ATL) transformation language.
Medeiros et al. presents a MaRiSA-MDD, a strategy
based on models that integrate aspect-oriented requirements,
architectural and detailed design, using the languages AOVgraph, AspectualACME and aSideML, respectively [13].
MARISA-MDD defines, for each activity, models (and
metamodels) and a number of model transformations rules.
These transformations were specified and implemented in
ATL. However, none of these works relied on i*, our source
language, which has much larger community of adopters
than AOV Graph.
VII. CONCLUSION
This paper presented the automation of most of the
transformation rules that support the first and second
STREAM activities, namely Refactor Requirements Models
and Derive Architectural Solutions [1].
In order to decrease the time and effort required to
perform these STREAM activities, as well as to minimize
the errors introduced by the manual execution of the rules,
we proposed the use of the QVTO language to automatize
the execution of seven out of the eight STREAM
transformation rules.
The input and output models of the Refactor
Requirements Models activity are compatible with the
iStarTool.
While the ones generated by the Derive
Architectural Solutions activity are compatible with the
AcmeStudio tool.
The iStarTool was used to create the i* model and to
perform the HTR1 manually. The result is the input file to be
processed by the automated transformation rules (HTR2,
HTR3 and HTR4). Both the input and output files handled
by the transformation process are XMI files.
The STREAM transformation rules were defined in
QVTO and an Eclipse based tool support was provided to
enable their execution. In order to illustrate the use of the
automated transformation rules the automated rules were
used in the BTW example [10].
The output of the execution of the VTRs is a XMI file
with the initial Acme architectural model. Currently, the
AcmeStudio tool is only capable of reading XMI files, since
it was designed to only process files described using the
Acme textual language. As a consequence, the XMI file
produced by the VTRs currently cannot be graphically
displayed. Hence, we still need to define new transformation
rules to generate a description in Acme textual language
from the XMI file already produced.
Moreover, more case studies are still required to assess
the benefits and identify the limitations of our approach. For
example we plan to run an experiment to compare the time
necessary to perform the first two STREAM activities
automatically against an ad-hoc way.
VIII. ACKNOWLEDGEMENTS
This work has been supported by CNPq, CAPES and
FACEPE.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
J. Castro, M. Lucena, C. Silva, F. Alencar, E. Santos, J. Pimentel,
"Changing Attitudes Towards the Generation of Architectural
Models", Journal of Systems and Software March 2012: Vol 85. pp.
463-479.
Object Management Group. (January 2011). QVT 1.1. Meta Object
Facility (MOF) 2.0. Query/View/Transformation Specification.
Available in: <http://www.omg.org/spec/QVT/1.1/>. Acessed: April
2013.
E. Yu, "Modelling Strategic Relationships for Process
Reengineering". Tese (Doutorado). University of Toronto:
Department of Computer Science, 1995.
ACME. Acme. Acme - The Acme Architectural Description
Language and Design Environment., 2009. Available in:
<http://www.cs.cmu.edu/~acme/index.html>. Accessed: April 2013.
OMG. QVT 1.1. Meta Object Facility (MOF) 2.0
Query/View/Transformation Specifica-tion, 01 January 2011.
Available em: <http://www.omg.org/spec/QVT/1.1/>. Accessed:
April 2013.
A. Malta, M. Soares, E. Santos, J. Paes, F. Alencar and J. Castro,
"iStarTool: Modeling requirements using the i* framework". IStar 11,
August 2011.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
ECLIPSE GMF. GMF - Graphical Modelling Framework. Available
in: <http://www.eclipse.org/modeling/gmf/ >. Accessed: April 2013.
J. Pimentel, M. Lucena, J. Castro, C. Silva, E. Santos, and F. Alencar,
“Deriving software architectural models from requirements models
for adaptive systems: the STREAM-A approach”, Requirements
Engineering, vol. 17, no. 4, pp. 259–281, June. 2012.
OMG. OCL 2.0. Object Constraint Language: OMG Available
Specification,
2001.
Available
in:
<http://www.omg.org/spec/OCL/2.0/>. Accessed: April 2013.
J. Pimentel, C. Borba and L. Xavier, "BTW: if you go, my advice to
you
Project",
2010.
Available
in:
<http://jaqueira.cin.ufpe.br/jhcp/docs>. Accessed: April 2013.
SCORE 2009. The Student Contest on Software Engineering SCORE
2009,
2009.
Available
in:
<http://score.elet.polimi.it/index.html>. Accessed: April 2013.
K. Coelho, „From Requirements to Architecture for Software Product
Lines: a strategy of models transformations” (In Portuguese: Dos
Requisitos à Arquitetura em Linhas de Produtos de Software: Uma
Estratégia de Transformações entre Modelo). Dissertation (M.Sc.).
Centro de Ciências Exatas e da Terra: UFRN, Brazil, 2012..
A. Medeiros, “MARISA-MDD: An Approach to Transformations
between Oriented Aspects Models: from requirements to Detailed
Project” (In Portuguese: MARISA-MDD: Uma Abordagem para
Transformações entre Modelos Orientados a Aspectos: dos Requisitos
ao Projeto Detalhado). Dissertation (M.S.c). Center for Science and
Earth: UFRN, Brazil, 2008.
M. Soares, “Automatization of the Transformation Rules on the
STREAM process” (In Portuguese: Automatização das Regras de
Transformação do Processo STREAM). Dissertation (M.Sc.). Center
of Informatic: UFPE, Brazil, 2012.
M. Soares, J. Pimentel, J. Castro, C. Silva, C. Talitha, G. Guedes, D.
Dermeval, “Automatic Generation of Architectural Models From
Goals Models”, SEKE 2012: 444-447.
Eclipse. Available in: <http://eclipse.org/>. Acessed: April 2013.
An automatic approach to detect traceability links using fuzzy logic
Andre Di Thommazo
Thiago Ribeiro, Guilherme Olivatto
Instituto Federal de São Paulo, IFSP
Universidade Federal de São Carlos, UFSCar
São Carlos, Brazil
[email protected]
Instituto Federal de São Paulo, IFSP
São Carlos, Brazil
{guilhermeribeiro.olivatto, thiagoribeiro.d.o}
@gmail.com
Vera Werneck
Universidade do Estado do Rio de Janeiro, UERJ
Rio de Janeiro, Brazil
[email protected]
Abstract – Background: The Requirements Traceability
Matrix (RTM) is one of the most commonly used ways to
represent requirements traceability. Nevertheless, the
difficulty of manually creating such a matrix motivates
the investigation into alternatives to generate it
automatically. Objective: This article presents one
approach to automatically create the RTM based on fuzzy
logic, called RTM-Fuzzy, which combines two other
approaches, one based on functional requirements' entry
data – called RTM-E – and the other based on natural
language processing – called RTM-NLP. Method: To
create the RTM based on fuzzy logic, the RTM-E and
RTM-NLP approaches were used as entry data for the
fuzzy system rules. Aimed at evaluating these approaches,
an experimental study was conducted where the RTMs
created automatically were compared to the reference
RTM (oracle) created manually based on stakeholder
knowledge. Results: On average the approaches matched
the following results in relation to the reference RTM:
RTM-E achieved 78% effectiveness, RTM-NLP 76%
effectiveness
and
the
RTM-Fuzzy
83%
effectiveness. Conclusions: The results show that using
fuzzy logic to combine and generate a new RTM offered
an enhanced effectiveness for determining the
requirement’s dependencies and consequently the
requirement’s traceability links.
Keywords- component; requirements traceability; fuzzy
logic; requirements traceability matrix.
I.
INTRODUCTION
Nowadays, the software industry is still challenged to
develop products that meet client expectations and yet
respect delivery schedules, costs and quality criteria.
Studies performed by the Standish Group [1] showed
that the quantity of projects in 2010 which finished
successfully whilst respecting the schedule, budget and,
principally, meeting the client’s expectations is only 37%.
Another study performed previously by the same institute
[2] found that the three most important factors to define
whether a software project was successful or not are: user
specification gaps, incomplete requirements, and constant
changes in requirements. Duly noted, these factors are
SandraFabbri
Universidade Federal de São Carlos, UFSCar
São Carlos, Brazil
[email protected]
directly related to requirements management. According to
Salem [3], the majority of software errors found are
derived from errors in the requirements gathering and on
keeping pace with their evolution throughout the software
development process.
One of the main features of requirements management
is the requirements traceability matrix (RTM), which is
able to record the existing relationship between the
requirements on a system and, due to its importance, is the
main focus of many studies. Sundaram, Hayes, Dekhtyar
and Holbrook [4], for instance, consider traceability
determination essential in many software engineering
activities. Nevertheless, such determination is a time
consuming and error prone task, which can be facilitated if
computational support is provided. The authors claim that
the use of such automatic tools can significantly reduce the
effort and costs required to elaborate and maintain
requirements traceability and the RTM, and go further to
state that such support is still very limited in existing tools.
Among the ways to automate traceability, Wang, Lai
and Liu [5] highlight that current studies make use of a
spatial vector model, semantic indexing or probability
network models. Regarding spatial vector modeling, the
work of Hayes, Dekhtyar and Osborne [6] can be
highlighted and it is going to be presented in detail in
Section IV-B. Related to semantic indexing, Hayes,
Dekhtyar and Sundaram [7] used the ideas proposed by
Deerwester, Dumais, Furnas, Landauer and Harshman
from Latentic Semantic Indexing (LSI) [8] in order to also
automatically identify traceability. When LSI is in use, not
only is the word frequency taken into consideration, but in
addition the meaning and context used in their
construction. With respect to the Network Probability
model, Baeza-Yates, Berthier and Ribeiro-Neto [9] use
ProbIR (Probabilistic Information Retrieval) to create a
matrix in which the dependency between each term is
mapped in relation to the other document terms. All the
quoted proposals are also detailed by Cleland-Huamg,
Gotel, Zisman [10] as possible approaches for traceability
detection.
As traceability determination involves many
uncertainties this activity poses not to be trivial, not even
to the team involved in the requirements gathering.
Therefore, it is possible to achieve better effectiveness in
the traceability link identification if we can use a technique
that can handle uncertainties, like fuzzy logic.
Given the aforementioned context, the focus of this
paper is to present one approach to automatically create the
RTM based on fuzzy logic, called RTM-Fuzzy. This
approach combines two approaches: one based on
functional requirements (FR) entry data – called RTM-E –
that is effective on CRUD FR traceability and other based
on natural language processing – called RTM-NLP – that
is effective on more descriptive FRs. The motivation of the
RTM-Fuzzy is to join the good features of the two others
approaches. The main contribution of this paper is to
present the fuzzy logic approach, once it has equal or
better effectiveness than the other ones (RTM-E and RTMNLP) singly.
The three proposed approaches were evaluated by an
experimental study to quantify the effectiveness of each. It
is worth mentioning that the RTM-E and RTM-NLP
approaches had already provided satisfactory results in a
previous experimental study [11]. In the re-evaluation in
this paper, RTM-E had similar results and RTM-NLP (that
was modified and improved) had a better effectiveness
than the results of the first evaluation [11]. To make the
experiment possible, the three RTM automatic generation
approaches were implemented in the COCAR tool [12].
This article is organized as follows: in Section II the
requirements management, traceability and RMT are
introduced; Section III presents a brief definition of fuzzy
logic theory; in Section IV, the three RMT automatic
creation approaches are presented and exemplified by use
of the COCAR tool; Section V shows the experimental
study performed to evaluate the effectiveness of the
approaches; conclusions and future work are discussed in
Section VI.
II.
REQUIREMENTS MANAGEMENT TECHNIQUES
Requirements management is an activity that should be
performed throughout the whole development process,
with the main objective of organizing and storing all
requirements as well as managing any changes to them
[13][14].
As requirements are constantly changing, managing
them usually becomes a laborious and extensive task, thus
making relevant the use of support tools to conduct it [5].
According to the Standish Group [15], only 5% of all
developed software makes use of any requirements
management tool, which can partially explain the huge
problems that large software companies face when
implementing effective requirements management and
maintaining its traceability. Various authors emphasize the
importance of tools for this purpose [13][14][16][17].
Zisman and Spanoudakis [14], for instance, consider the
use of requirements management tools to be the only way
for successful requirements management.
Two important concepts for requirements management
are requirements traceability and a traceability matrix,
which are explained next.
A. Requirements traceability
Requirements traceability concerns the ability to
describe and monitor a requirement throughout its lifecycle
[18]. Such requirement control must cover all its existence
from its source – when the requirement was identified,
specified and validated – through to the project phase,
implementation and ending at the product’s test phase.
Thus traceability is a technique that allows identifying and
visualizing the dependency relationship between the
identified requirements and the other requirements and
artifacts generated throughout the software’s development
process. The dependency concept does not mean,
necessarily,
a
precedence
relationship
between
requirements but, instead, how coupled they are to each
other with respect to data, functionality, or any other
perspective. According to Guo, Yang, Wang, Yang and Li
[18], requirements traceability is an important
requirements management activity as it can provide the
basis to requirements evolutional changes, besides directly
acting on the quality assurance of the software
development process.
Zisman and Spanadousk [14] consider two kinds of
traceability:
• Horizontal: when the relationships occur between
requirements from different artifacts. This kind of
traceability links a FR to a model or a source code,
for example.
•
Vertical: when the traceability is analyzed within
the same artifact, like the RD for instance. By
analyzing the FRs of this artifact it is possible to
identify their relationships and generate the RTM.
This type of traceability is the focus of this paper.
Munson and Nguyen [19] state that traceability
techniques will only be better when supported by tools that
diminish the effort required to execute them.
B. Requirement Traceability Matrix - RTM
According to Goknil, Kurtev, Van den Berg and
Veldhuis [17], despite existing various estudies treating
traceability between requirements and other artifacts
(horizontal traceability), only minor attention is given to
the requirements relationship between themselves, i.e. their
vertical traceability. The authors also state that this
relationship influences various activities within the
software development process, such as requirements
consistency verification and change management. A
method of mapping such a relationship between
requirements is RTM creation. In addition, Cuddeback,
Dekhtyar and Hayes [20] state that a RTM supports many
software engineering validation and verification activities,
like change impact analysis, reverse engineering, reuse,
and regression tests. In addition, they state that RTM
generation is laborious and error prone, a fact that means,
in general, it is not generated or updated.
Overall, RTM is constructed as follows: each FR is
represented in the i-eseme line and in the i-eseme column
of the RTM, and the dependency between them is recorded
in the cell corresponding to each FR intersection [13].
Several authors [13] [17] [18] [19] [21] debate the
importance and need of the RTM in the software
development process, once such matrix allows predicting
the impact that a change (or the insertion of a new
requirement) has on the system as a whole. Sommerville
[13] emphasizes the difficulty of obtaining such a matrix
and goes further by proposing a way to subjectively
indicate not only whether the requirements are dependent
but how strong such a dependency is.
III.
FUZZY LOGIC
Fuzzy logic was developed by Zadeh [22], and
proposes, instead of simply using true or false, the use of a
variation of values between a complete false and an
absolute true statement. In classic set theory there are only
two pertinence possibilities for an element in relation to a
set as a whole: the element pertains or does not pertain to a
set [23]. In fuzzy logic the pertinence is given by a
function whose values pertain to the real closed interval
between 0 and 1. The process of converting a real number
into its fuzzy representation is called “Fuzzyfication”.
Another important concept in fuzzy logic is related to the
rules that use linguistic variables in the execution of the
decision support process. The linguistic variables are
identified by names, have a variable content and assume
linguistic values, which are the names of the fuzzy sets
[23]. In the context of this work, the linguistic variables are
the traceability obtained by the three RTM generation
approaches and may assume the values (nebulous sets)
“non-existent”, “weak” or “strong”, which will be
represented by a pertinence function. This process is
detailed in Section IV C.
Fuzzy logic has been used in many software
engineering areas and, specifically in the requirements
engineering area, the work of Ramzan, Jaffar, Iqbal,
Anwar, and Shahid [24] and Yen, Tiao and Yin [25] can be
highlighted. The former conducts requirements
prioritization based on fuzzy logic and the later uses fuzzy
logic to aid the collected requirements’ precision analysis.
In the metrics area, MacDonell, Gray and Calvet [26] also
use fuzzy logic to propose metrics to the software
development process and in the reuse area, Sandhu and
Singh [27] likewise use fuzzy logic to analyze the quality
of the created components.
IV.
APPROACHES TO RTM GENERATION
The three approaches were developed aiming to
generate the RTM automatically. The first one - RTM-E –
is effective to detect traceability links in FRs that have the
same entry data, specially like CRUDs FRs. The second
one – RTM-NLP – is appropriate to detect traceability
links in FRs that have a lot of knowledge in the text
description. The third one – RTM-Fuzzy – combines the
previous approaches trying to extract the best of each one.
These approaches only take into consideration the
software FRs and establish the relationship degree between
each pair of them.
The RTM names were based on each approach taken.
The approach called RTM-E had its traceability matrix
named RTMe, the RTM-NLP’s RTM was called RTMnlp,
whereas the RTM-Fuzzy’s RTM was called RTMfuzzy.
The quoted approaches are implemented in the
COCAR tool, which uses a template [28] for requirements
data entry. The RD formed in the tool can provide all data
necessary to evaluate the approaches. The main objective
of such a template is to standardize the FR records, thus
avoiding inconsistencies, omissions and ambiguities.
One of the fields found in this template (which makes
the RTM implementation feasible) is called Entry, and it
records in a structured and organized way the data used in
each FR. It is important noting that entry data should be
included with the full description, i.e. “client name” or
“user name” and not only “name”, avoiding ambiguity.
Worth mentioning here is the work of Kannenberg and
Saiedian [16], which considers the use of a tool to
automate the requirements recording task highly desirable.
Following, the approaches are presented. Aiming to
exemplify the RTM generation according to them, a real
application developed for a private aviation company was
used in a case study. An example of how each approach
calculates the dependence between a pair of FRs is
presented at the end of each sub-session (IV-A, IV-B and
IV-C). The system’s purpose is to automate the company’s
stock control. As the company is located in several cities,
the system manages the various stock locations and the
products being inserted, retrieved or reallocated between
each location. The system has a total of 14 FRs, that are
referrenced in the first line and first column of the RTMs
generated. In Section V the results of this case study ARE
compared with the results of the experimental study.
A. RMT-E Approach: RTM generation based on input
data
In the RMT-E approach, the dependency relationship
between FRs is determined by the percentage of common
data between FR pairs. This value is obtained through the
Jaccard Index calculation [29], which compares the
similarity and/or diversity degree between the data sets of
each pair. Equation 1 represents this calculation.
(1)
The equation numerator is given by the quantity of data
intersecting both sets (A and B), whereas the denominator
corresponds to the quantity associated to the union
between those sets.
The RTM-E approach defines the dependency between
two FRs, according to the following: considering FRa the
data set entries for a functional requirement A and FRb the
data set entries for a functional requirement B, their
dependency level can be calculated by Equation 2:
(2)
Thus according to the RTM-E approach, each position
(i,j) of the traceability matrix RTM(i,j) corresponds to
values from Equation 3:
(3)
Positions on the matrix’s main diagonal are not
calculated once they indicate the dependency of the FRs to
themselves. Besides, the generated RTM is symmetrical,
i.e. RTM(i,j) has the same value as RTM(j,i).
Implementing such an approach in COCAR was
possible because the requirements data are stored in an
atomic and separated way, according to the template
mentioned before. Each time entry data is inserted in a
requirement data set it is automatically available and can
be used as entry data for another FR. Such implementation
avoids data ambiguity and data inconsistency.
It is worth noting that initiatives using FR data entries
to automatically determine the RTM were not found in the
literature. Similar initiatives do exist to help determine
traceability links between other artifacts, mainly models
(for UML diagrams) and source codes, like those found in
Cysneiros and Zisman [30].
In Figure 1, the matrix cell highlighted by the rectangle
indicates the level of dependency between FR3 and FR5.
In this case, FR3 is related to products going in and out
from a company’s stock (warehouse) and FR5 is related to
an item transfer from one stock location to another. As
both FRs deal with stock items, it is expected that they
present a relationship.
The input data of FR3 are: Contact, Transaction Date,
Warehouse, Quantity, Unit Price and User. The input data
of FR5 are: Contact, Transaction Date, Warehouse,
Quantity, Unit Price, User, Origin Warehouse, Destination
warehouse and Status. As the quantity of elements of the
intersection between the input data of FR3 and FR5
(n(FR3 ∩ FR5)) is equal to 6, and the quantity of elements
of the union set (n(FR3 ∪ FR5)) is equal to 9, the value
obtained from Equation 4, that establishes the dependency
relationship between FR3 and FR5 is:
(4)
The 66.67% dependency rate is highlighted in Figure 1,
which is the RTM-E built using the aforementioned
approach. It is worth mentioning that the colors indicate
each FR dependency level as follows: green for a weak
dependency level and red for a strong dependency level.
Where there is no relationship between the FRs, no color is
used in the cell.
Figure 2 illustrates the intersection and union sets when
the RTM-E approach is applied to FR3 and FR5 used so
far as example. Also worth mentioning is that the COCAR
tool presents a list of all input data entries already in place,
in order to minimize requirements input errors such as the
same input data entry with different names.
The determination of the dependency levels was
carried out based on the application taken as an example
(stock control), and from two further RDs from a different
scope. Such a process was performed in an interactive and
iterative way, adjusting the values according to the
detected traceability between the three RD. The levels
obtained were: “no dependence” where the calculated
value was 0%; “weak” for values between 0% and 50%;
and “strong” for values above 50%.
B. RMT-NLP approach: RTM generation based on
natural language processing
Even though there are many initiatives that make use of
NLP to determine traceability in the software development
process, as mentioned previously few of them consider
traceability inside the same artifact [17]. In addition, the
proposals found in the literature do not use a requirements
description template and do not determine dependency
levels as in this work. According to Deeptimahanti and
Sanyal [31], the use of NLP in requirements engineering
does not aim at text comprehension itself but, instead, at
extracting embedded RD concepts. This way, the second
approach to establish the RTM uses concepts extracted
from FRs using NLP techniques to determine the FR’s
traceability. Initially, NLP was only applied to the field
that describes the processing (actions) of the FR, and such
a proposal was evaluated using an experimental study [11].
It was noted that many times, as there is a field for entry
data input in the template (as already pointed in the RTME proposal), the analysts did not record the entry data once
again in the processing field, thus reducing the similarity
value. With such a fact in mind, this approach has been
improved and all text fields contained in the template were
used. This way, the template forces the requirements
engineer to gather and enter all required data, and all this
information is used by the algorithm that performs the
similarity calculation. As it will be shown in this work’s
sequence, this modification had a positive impact on the
approach’s effectiveness.
To determine the dependency level between the
processing fields of two FRs, the Frequency Vector and
Cosine Similarity methods [32] are used. Such a method is
able to return a similarity percentage between two text
excerpts.
Figure 1 – Resultant RTM generated using the RTM-E approach.
Figure 2 – An example of the union and intersection of entry data between two FRs.
With the intention of improving the process’
efficiency, text pre-processing is performed before
applying the Frequency Vector and Cosine Similarity
methods in order to eliminate all words that might be
considered irrelevant, like articles, prepositions and
conjunctions (also called stopwords) (Figure 3-A). Then, a
process known as stemming (Figure 3-B) is applied to
reduce all words to their originating radicals, leveling their
weights in the text similarity determination. After the two
aforementioned steps, the method calculates the similarity
between two FR texts (Figure 3-C) using the template’s
processing fields, thus identifying, according to the
technique, similar ones.
The first step for Vector Frequency and Cosine
Similarity calculation is to represent each sentence in a
vector with each position containing one word from the
sentence. The cosine calculation between them will
determine the similarity value. As an example, two FRs
(FR1 and FR2) described respectively by sentences S1 and
S2 are taken. The similarity calculation is carried out as
follows:
1) S1 is represented in vector x and S2 in vector y. Each
word will use one position in each vector. If S1 has p
words, vector x will also initially have p positions. In
the same way, if S2 has q words, vector y will also
have q positions.
3) All vectors are alphabetically reordered.
4) Vectors have their terms searched for matches on the
other and, when the search fails, the word is included
in the “faulting” vector with 0 as its frequency. At the
end of this step, both vectors will have not matched
words included and the same number of positions.
5) With the adjusted vectors, the similarity equation –
sim(x,y) (Equation 5) – must be applied between
vectors x and y, considering n as the number of
positions found in the vectors.
(5)
Considering the same example used to illustrate the
RTM-E approach (private aviation stock system) the
RTMnlp was generated (Figure 4), evaluating the
similarity between FR functionalities inserted into
COCAR inside the “processing” attribute of the already
mentioned template. After applying pre-processing
(stopwords removal and stemming), and the steps depicted
earlier for calculating the Frequency Vector and Cosine
Similarity, the textual similarity between FR3 and FR5
(related to product receive in stock and product transfer
between stocks, respectively) was determined as 88.63%
(Figure 3-D). This high value does make sense in this
relationship, once the texts describing both requirements
are indeed very similar.
2) As the vector cannot have repeated words,
occurrences are counted to determine each word’s
frequency to be included in the vector. At the end, the
vector should contain a single occurrence of each
word followed by the frequency that such word
appears in the text.
Entry
FR3 Text
FR5 Text
Pre-processing
Remove Stopwords
(articles, prepositions,
conjunctions)
Stemming – reduce
used words to their
radicals
Frequency Vector
and Cosine
Similarity [32]
Dependency
between FR3 and
FR5 (88.63%)
A
B
C
D
Figure 3 – Steps to apply the RTM-NLP approach.
Figure 4 – Resultant RTM generated using the RTM-NLP approach.
As in the RTM-E, the dependency level values had
been chosen in an interactive and iterative way based on
the data provided by the example application (stock
control) and two more RDs from different scopes. The
levels obtained were: “no dependence” where the value
was between 0% and 40%; “weak” for values between
40% and 70%; and “strong” for values above 70%.
C. RTM-Fuzzy approach: RTM generation based on
fuzzy logic
The purpose of this approach is to combine those
detailed previously using fuzzy logic, so that we can
consider both aspects explored so far – the relationship
between the entry data manipulated by the FRs (RTM-E)
and the text informed in the FRs (RTM-NLP) – to create
the RTM.
In the previously presented approaches, the
dependency classification between two FRs of “no
dependence”, “weak”, and “strong” is determined
according to the approach’s generated value related to the
values set for each of the created levels. One of the
problems with this approach is that the difference between
the classification in one level and another can be
miniscule. For instance, if the RTM-NLP approach
generates a value of 39.5% for the dependency between 2
FRs, this would not indicate any dependency between the
FRs, whereas a value of 40.5% would already indicate a
weak dependency. Using the fuzzy logic, this problem is
minimized as it is possible to work with a nebulous level
between those intervals through the use of a pertinence
function.
As seen earlier, this conversion from absolute values to
its fuzzy representation is called fuzzification, used for
creating the pertinence functions. In the pertinence
functions, the X axis represents the dependency percentage
between FRs (from 0% to 100%), and the Y axis
represents the pertinence level, i.e. the probability of
belonging to a certain fuzzy set (“no dependence”, “weak”
or “strong”), which can vary from 0 to 1.
Figure 5 illustrates the fuzzy system adopted, with
RTMe and RTMnlp as the entry data. Figures 6 and 7
present, respectively, the pertinence function of the RTME and RTM-NLP approaches, and the X axis indicates the
dependency percentage calculated in each approach. The Y
axis indicates the pertinence degree, ranging from 0 to 1.
The higher the pertinence value, the bigger the chance of it
being in one of the possible sets (“no dependence”,
“weak”, or “strong”). There ranges of values exist in which
the pertinence percentage can be higher for one set and
low for the other (for example the range with a dependence
percentage between 35% and 55% in Figure 6).
Table I indicates the rules created for the fuzzy system.
Such rules are used to calculate the output value, i.e. the
RTMfuzzy determination. These rules were derived from
the authors’ experience through an interative and iterative
process.
Figure 5 – Fuzzy System
no dependence
weak
strong
Figure 6 – Pertinence function for RTM-E
Figure 8 shows the output pertinence function in the
same way as shown in Figures 6 and 7, where the X axis
indicates the RTMfuzzy dependence percentage and the Y
axis indicates the pertinence degree between 0 and 1.
no dependence
weak
strong
V.
EXPERIMENTAL STUDY
To evaluate the effectiveness of the proposed
approaches, an experimental study has been conducted
following the guidelines below:
- Context: The experiment has been conducted in the
context of the Software Engineering class at UFSCar,
Federal University of São Carlos, as a volunteer extra
activity. The experiment consisted of each pair of students
conducting requirements gathering on a system involving
a real stakeholder. The final RD had to be created in the
COCAR tool.
weak dependence
Figure 7 – Pertinence function for RTM-NLP
TABLE I – RULES USED IN FUZZY SYSTEM
if
if
if
if
if
if
if
if
if
Antecedent
RTM-E = “no dependence” AND
RTM-NLP = “no dependence”
RTM-E = “weak” AND RTMNLP = “weak”
RTM-E = “no dependence” AND
RTM-NLP = “strong”
RTM-E = “strong” AND RTMNLP = “strong”
RTM-E = “no dependence” AND
RTM-NLP = “weak”
RTM-E = “weak” AND RTMNLP = “no dependence”
RTM-E = “no dependence” AND
RTM-NLP = “strong”
RTM-E = “strong” AND RTMNLP = “weak”
RTM-E = “strong” AND RTMNLP= “no dependence”
then
then
then
then
then
then
then
then
then
Consequent
“no dependence”
“weak
dependence”
“weak
dependence”
“strong
dependence”
“no dependence”
“weak
dependence”
“weak
dependence”
“strong
dependence”
“weak
dependence”
To exemplify the RTM-Fuzzy approach, the same
aviation company stock system is used in the other
approaches. The selected FRs to be used in the example
are FR3, related to data insertion in a stock (and already
used in the other examples), and FR7, related to the report
generation on stock. Such a combination was due to the
fact they do not have common entry data and, therefore,
there is no dependency between them. Despite this, RTMNLP indicates a strong dependency (75.3%) between these
requirements. This occurs because both FRs deal with the
same set of data (although they do not have common entry
data) and a similar scope, thus explaining their textual
similarity. It can be observed in Figure 8 that RTM-E
shows no dependency, whereas RTM-NLP shows a strong
dependency (treated in the third rule). In the fuzzy logic
processing (presented in Figure 9) and after applying
Mandami’s inference technique, the resulting value for the
entries is 42.5. Looking at Figure 8, it can be concluded
that this value corresponds to a “weak” dependence, with 1
as the pertinence level.
In this way, the cell corresponding to the intersection
of FR3 and FR7 of the RTMfuzzy has as the value “weak”.
no dependence
weak
strong
42.5
Figure 8 – Pertinence functions for the Fuzzy System output.
Figure 9 – RTM-Fuzzy calculation
- Objective: Evaluate the effectiveness of the RTM-E,
RTM-NLP, and RTM-Fuzzy approaches in comparison to
a reference RTM (called RTM-Ref) and constructed by
the detailed analysis of the RD. The RTM-Ref creation is
detailed next.
- Participants: 28 graduation students on the Bachelor
Computer Sciences course at UFSCar
- Artifacts utilized: RD, with the following characteristics:
• produced by a pair of students on their own;
• related to a real application, with the
participation of a stakeholder with broad
experience in the application domain;
• related to information systems domain with basic
creation, retrieval, updating and deletion of data;
inspected by a different pair of students in order
to identify and eliminate possible defects;
• included in the COCAR tool after identified
defects are removed.
- RTM-Ref:
• created from RD input into the COCAR tool;
requirements with the stakeholders. In an attempt to
minimize this risk, known domain systems were used as
well as RD inspection activities. The latter was conducted
based on a defect taxonomy commonly adopted in this
context and which considers inconsistencies, omissions,
and ambiguities, among others. Another risk is the fact
that RTM-Ref had been built by people who did not have
direct contact with the stakeholder, and therefore this
matrix could be influenced by the eventual problems
possessed by their RDs. To minimize this risk, whenever
any doubts were found when determining whether a
relationship was occurring or not, the students’ help was
solicited. In some cases a better comprehension of the
requirements along with the stakeholder was necessary,
which certainly minimized errors when creating the RTMRef.
built based on the detailed reading and analysis
of each FR pair, determining the dependency
between them as “no dependence”, “weak”, or
“strong”;
recorded in a spreadsheet so that the RTM-Ref
created beforehand could be compared to the
RTMe, RTMnlp and RTMfuzzy for each system;
built by this work’s authors, who were always in
touch with the RD’s authors whenever a doubt
was found.
Every dependency (data, functionality or
predecessor) was considered as a link.
•
•
•
•
- Metrics: the metric used was the effectiveness of the
three approaches with regard to the coincidental
dependencies found by each approach in relation to the
RTM-Ref. The effectiveness is calculated by the relation
between the quantity of dependencies correctly found in
each approach, against the total of all dependencies that
can be found between the FRs. Considering a system
consisting of n FRs, the total quantity of all possible
dependencies (T) is given by Equation 6:
- Results: The results of the comparison between the data
in RTMe, RTMnlp, and RTMfuzzy are presented in Table
II. The first column contains the name of the specified
system; the second column contains the FR quantity; the
third provides the total number of possible dependencies
between FRs that may exist (being “strong”, “weak” or
“no dependence”), and the formula for which was shown
in Figure 6. The fourth, sixth and eighth columns contain
the total number of coincidental dependencies between the
RTMe, RTMnlp and RTMfuzzy matrices. Exemplifying: if
the RTM-Ref has determined a “strong” dependency in a
cell and the RTM-E approach also registered the
dependency as “strong” in the same position, a correct
relationship is determined. The fifth, seventh and ninth
columns represent the effectiveness of the RTM-E,
RTMnlp, and RTMfuzzy approaches, respectively,
calculated by the relation between the quantity of correct
dependencies found by the approach and the total number
of dependencies that could be found (third column).
(6)
Therefore, the effectiveness rate is given by Equation
7:
(7)
- Threads to validity: The experimental study
conducted poses some threads to the validity, mainly in
terms of the students’ inexperience to identify the
TABLE II – EXPERIMENTAL STUDY RESULTS
RTM-E
System
Req
Qty
RTM-NLP
RTM-Fuzzy
# of possible
dependencies
correct
effect.
correct
effect.
correct
effect.
77%
138
81%
143
84%
1
Zoo
19
171
131
2
Habitation
24
276
233
84%
205
74%
241
87%
3
Student Flat
28
378
295
78%
325
86%
342
90%
4
Taxi
15
105
82
78%
77
73%
85
81%
5
Clothing Store
27
351
295
84%
253
72%
296
84%
6
Freight
16
120
98
82%
85
71%
102
85%
7
Court
24
276
204
74%
181
66%
212
77%
8
Finantial Control
17
136
94
69%
101
74%
101
74%
78%
129
75%
137
80%
9
Administration
19
171
134
10
Book Store
19
171
129
75%
145
85%
147
86%
11
Ticket
15
105
88
84%
91
87%
94
90%
12
Movies
16
120
88
73%
82
68%
97
81%
13
Bus
15
105
72
69%
78
74%
81
77%
14
School
15
105
82
78%
77
73%
84
80%
- Analysis of Results: The statistical analysis has been
conducted using SigmaPlot software. By applying the
Shapiro-Wilk test it could be verified that the data were
following a normal distribution, and the results shown
next are in the format: average ± standard deviation. To
compare the effectiveness of the proposed approaches
(RTM-E, RTM-NLP and RTM-Fuzzy) variance analysis
(ANOVA) has been used for post-test repeated
measurements using the Holm-Sidak method. The
significance level adopted is 5%. The RTM-Fuzzy
approach was found to be the most effective with (82.57%
± 4.85), whereas the RTM-E approach offered (77.36% ±
5.05) and the RTM-NLP obtained an effectiveness level
of (75.64% ± 6.57). These results are similar to the results
of the real case study presented in Section IV (company’s
stock control). In the case study, the RTM-Fuzzy
effectiveness was 81.69%, the RTM-E effectiveness was
78.06% and the RTM-NLP effectiveness was 71.72%.
In this experimental study, the results found for the
RTM-E approach were similar to those found in a
previous study [11]. Despite that, in the previous study the
RTM-NLP only presented an effectiveness level of 53%,
which lead us to analyze and modify this approach and the
improvements were already in place when this work was
evaluated. Even with such improvements, the approach
still generates false positive cases, i.e. non-existing
dependencies between FRs. According to Sundaram,
Hayes, Dekhtyar and Holbrook [4] the occurrence of false
positive is an NLP characteristic, although this type of
processing can easily retrieve the relationship between
FRs. In the RTM-NLP approach, the reason for it
generating such false positive cases is the fact that, many
times, the same words are used to describe two different
FRs, thus indicating a similarity between the FRs which is
not a dependency between them. Examples of word that
can generate false positives are “set”, “fetch” and “list”.
Solutions to this kind of problem are being studied in order
to improve this approach. One of the alternatives is the use
of a Tagger to classify each natural language term in its
own grammatical class (article, preposition, conjunction,
verb, substantive, or adjective). In this way verbs could
receive a different weight from substantives in similarity
calculus. A preliminary evaluation of this alternative was
manually executed, generating better effectiveness in true
relationship determination.
In the RTM-E data analysis, false positives did not
occur. The dependencies found, even the weak ones, did
exist. The errors influencing this approach were due to
relationships that should have been counted as “strong”
being counted as “weak”. This occurred because many
times the dependency between two FRs was related to data
manipulated by both, regardless of them being entry or
output data. This way, the RTM-E approach is also being
evaluated with the objective to incorporate improvements
that can make it more effective. As previously mentioned,
if a relation was found as “strong” in RTM-Ref and the
proposed approach indicated that the relation was “weak”,
an error in the experiment’s traceability was counted. In
the case relationships indicating only if “there is” or “there
is not” a traceability link were generated, i.e. without using
the “weak” or “strong” labels, the effectiveness determined
would be higher. In such a case the Precision and Recall
[10] metrics could be used, given that such metrics only
take in account the fact that a dependency exists and not
their level (“weak” or “strong”).
In relation to the RTM-Fuzzy approach, the results
generated by it were always the same or higher than the
results found by the RMT-E and RTM-NLP approaches
alone. Nevertheless, with some adjustments in the fuzzy
system pertinence functions, better results could be found.
Such an adjustment is an iterative process, depending on
an evaluation each time a change is done. A more
broadened research could, for instance, introduce
intermediary levels between linguistic variables as a way
to map concepts that are hard to precisely consider in the
RTM. To improve the results, genetic algorithms can make
a more precise determination of the parameters involved in
the pertinence functions.
VI. CONCLUSIONS AND FUTURE WORK
This paper presented an approach based on fuzzy logic
– RTM-Fuzzy – to automatically generate the requirements
traceability matrix. Fuzzy logic was used to treat those
uncertainties that might negatively interfere in the
requirements traceability determination.
The RTM-Fuzzy approach was defined based on two
other approaches also presented in this paper: RTM-E,
which is based on the percentage of entry data that two
FRs have in common, and RTM-NLP, which uses NLP to
determine the level of dependency between requirements.
From the three approaches presented, it is worth
pointing that there are already some reported proposals in
the literature using NLP for traceability link determination,
mainly involving different artifacts (requirements and
models, models and source-code, or requirements and test
cases). Such a situation is not found in RTM-E, for which
no similar attempt was found in the literature.
All approaches were implemented in the COCAR
environment, so that the experimental study could be
performed to evaluate the effectiveness of each approach.
The results showed that RTM-Fuzzy presented a superior
effectiveness compared to the other two. This transpired
because the RTM-Fuzzy uses the results presented in the
other two approaches but adds a diffuse treatment in order
to perform more flexible traceability matrix generation.
Hence the consideration of traceability matrix
determination is a difficult task, even for specialists, and
using the uncertainties treatment provided by fuzzy logic
has shown to be a good solution to automatically
determine traceability links with enhanced effectiveness.
The results motivate the continuity of this research, as
well as further investigation into how better to combine the
approaches for RTM creation using fuzzy logic.
The main contributions of this particular work are the
incorporation of the COCAR environment, and correspond
to the automatic relationship determination between FRs.
This facilitates the evaluation of the impact that a change
in a requirement can generate on the others. New studies
are being conducted to improve the effectiveness of the
approaches. As future work, it is intended to improve the
NLP techniques used by considering the use of a tagger
and the incorporation of a glossary for synonym treatment.
Another investigation to be done regards how an RTM can
aid the software maintenance process, more specifically,
offer support for regression test generation.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Standish Group, CHAOS Report 2011, 2011. Available at:
http://www1.standishgroup.com/newsroom/chaos_2011.php Last
accessed March 2012.
Standish Group, CHAOS Report 1994, 1994. Available at:
http://www.standishgroup.com/sample_research/chaos_1994_2.ph
p Last accessed February 2007.
A.M. Salem, "Improving Software Quality through Requirements
Traceability Models", 4th ACS/IEEE International Conference on
Computer Systems and Applications (AICCSA 2006), Dubai,
Sharjah, UAE, 2006.
S.K.A. Sundaram, J.H.B. Hayes, A.C. Dekhtyar, E.A.D. Holbrook,
"Assessing Traceability of Software Engineering Artifacts", 18th
International IEEE Requirements Engineering Conference, Sydney,
Australia, 2010.
X. Wang, G. Lai, C. Liu, "Recovering Relationships between
Documentation and Source Code based on the Characteristics of
Software Engineering", Electronic Notes in Theoretical Computer
Science, 2009.
J.H. Hayes, A. Dekhtyar, J. Osborne, "Improving Requirements
Tracing via Information Retrieval", Proceedings of 11th IEEE
International Requirements Engineering Conference, IEEE CS
Press, Monterey, CA, 2003, pp. 138–147.
J.H. Hayes, A. Dekhtyar, S. Sundaram, "Advancing Candidate
Link Generation for Requirements Tracing: The Study of
Methods", IEEE Transactions on Software Engineering, vol. 32,
no. 1, January 2006, 4–19.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R.
Harshman, "Indexing by Latent Semantic Analysis", Journal of the
American Society for Information Science, vol. 41, no. 6, 1990, pp.
391–407.
R. Baeza-Yates, A..Berthier, A. Ribeiro-Neto, Modern Information
Retrieval. ACM Press / Addison-Wesley, 1999.
J. Cleland-Huang, O. Gotel, A. Zisman, Software and Systems
Traceability. Springer, 2012, 491 p.
A. Di Thommazo, G. Malimpensa, G. Olivatto, T. Ribeiro, S.
Fabbri, "Requirements Traceability Matrix: Automatic Generation
and Visualization", Proceedings of the 26th Brazilian Symposium
on Software Engineering, Natal, Brazil, 2012.
A. Di Thommazo, M.D.C. Martins, S.C.P.F. Fabbri,
“Requirements Management in COCAR Enviroment” (in
Portuguese), WER 07: Workshop de Engenharia de Requisitos,
Toronto, Canada, 2007.
I. Sommerville, Software Engineering. 9th ed. New York, Addison
Wesley, 2010.
A. Zisman, G. Spanoudakis, "Software Traceability: Past, Present,
and Future", Newsletter of the Requirements Engineering
Specialist Group of the British Computer Society, September 2004.
Standish Group, CHAOS Report 2005, 2005. Available at:
http://www.standishgroup.com/sample_research/PDFpages/q3spotlight.pdf Last accessed February 2007.
A. Kannenberg, H. Saiedian, "Why Software Requirements
Traceability Remains a Challenge", CrossTalk: The Journal of
Defense Software Engineering, July/August 2009.
[17] A. Goknil, I. Kurtev, K. Van den Berg, J.W. Veldhuis, "Semantics
of Trace Relations in Requirements Models for Consistency
Checking and Inferencing", Software and Systems Modeling, vol.
10, iss. 1, February 2011.
[18] Y. Guo, M. Yang, J. Wang, P. Yang, F. Li, "An Ontology based
Improved Software Requirement Traceability Matrix", 2nd
International Symposium on Knowledge Acquisition and
Modeling, KAM, Wuhan, China, 2009.
[19] E.V. Munson, T.N. Nguyen, "Concordance, Conformance,
Versions, and Traceability", Proceedings of the 3rd International
Workshop on Traceability in Emerging Forms of Software
Engineering, Long Beach, California, 2005.
[20] D. Cuddeback, A. Dekhtyar, J.H. Hayes, "Automated
Requirements Traceability: The Study of Human Analysts",
Proceedings of the 2010 18th IEEE International Requirements
Engineering Conference (RE2010), Sydney, Australia, 2010.
[21] IBM, Ten Steps to Better Requirements Management. Available at:
http://public.dhe.ibm.com/common/ssi/ecm/en/raw14059usen/RA
W14059USEN.PDF Last accessed March 2012.
[22] L.A. Zadeh, "Fuzzy Sets", Information Control, vol. 8, pp. 338–
353, 1965.
[23] A.O. Artero, "Artificial Intelligence - Theory and Practice",
Livraria Fisica, 2009, 230 p.
[24] M. Ramzan, M.A. Jaffar, M.A. Iqbal, S. Anwar, A.A. Shahid,
"Value based Fuzzy Requirement Prioritization and its Evaluation
Framework", 4th International Conference on Innovative
Computing, Information and Control (ICICIC), 2009.
[25] J. Yen, W.A. Tiao, "Formal Framework for the Impacts of Design
Strategies on Requirements", Proceedings of the Asian Fuzzy
Systems Symposium, 1996.
[26] S.G. MacDonell, A.R. Gray, J.M. Calvert, "FULSOME: A Fuzzy
Logic Modeling Tool for Software Metricians", Annual
Conference of the North American Fuzzy Information Processing
Society (NAFIPS), 1999.
[27] P.S. Sandhu, H. Singh, "A Fuzzy-Inference System based
Approach for the Prediction of Quality of Reusable Software
Components", Proceedings of the 14th International Conference on
Advanced Computing and Communications (ADCOM), 2006.
[28] K.K. Kawai, "Guidelines for Preparation of Requirements
Document with Emphasis on the Functional Requirements" (in
Portuguese), Master Thesis, Universidade Federal de São Carlos,
São Carlos, 2005.
[29] R. Real, J.M. Vargas, “The Probabilistic Basis of Jaccard's Index
of Similarity”, Systematic Biology, vol. 45, no. 3, pp.380-385,
1996.
Avalilable
at:
http://sysbio.oxfordjournals.org/content/45/3/380.full.pdf
Last
accessed November 2012.
[30] G. Cysneiros, A. Zisman, "Traceability and Completeness
Checking for Agent Oriented Systems", Proceedings of the 2008
ACM Symposium on Applied Computing, New York, USA, 2008.
[31] D.K. Deeptimahanti, R. Sanyal, "Semi-automatic Generation of
UML Models from Natural Language Requirements", Proceedings
of the 4th India Software Engineering Conference 2011 (ISEC'11),
Kerala, India, 2011.
[32] G. Salton, J. Allan, "Text Retrieval Using the Vector Processing
Model", 3rd Symposium on Document Analysis and Information
Retrieval, University of Nevada, Las Vegas, 1994.
Determining Integration and Test Orders in the
Presence of Modularization Restrictions
Wesley Klewerton Guez Assunção1,2 , Thelma Elita Colanzi1,3 , Silvia Regina Vergilio1 , Aurora Pozo1
1
DINF - Federal University of Paraná (UFPR), CP: 19081, CEP: 81.531-980, Curitiba, Brazil
COINF - Technological Federal University of Paraná (UTFPR), CEP: 85.902-490, Toledo, Brazil
3
DIN - Computer Science Department - State University of Maringá (UEM), CEP: 87.020-900, Maringá, Brazil
Email: {wesleyk, thelmae, silvia, aurora}@inf.ufpr.br
2
Abstract—The Integration and Test Order problem is very
known in the software testing area. It is related to the determination of a test order of modules that minimizes stub creation effort,
and consequently testing costs. A solution approach based on
Multi-Objective and Evolutionary Algorithms (MOEAs) achieved
promising results, since these algorithms allow the use of different
factors and measures that can affect the stubbing process, such
as number of attributes and operations to be simulated by the
stub. However, works based on such approach do not consider
different modularization restrictions related to the software development environment. For example, the fact that some modules
can be grouped into clusters to be developed and tested by
independent teams. This is a very common practice in most
organizations, particularly in that ones that adopt a distributed
development process. Considering this fact, this paper introduces
an evolutionary and multi-objective strategy to deal with such
restrictions. The strategy was implemented and evaluated with
real systems and three MOEAs. The results are analysed in order
to compare the algorithms performance, and to better understand
the problem in the presence of modularization restrictions. We
observe an impact in the costs and a more complex search, when
restrictions are considered. The obtained solutions are very useful
and the strategy is applicable in practice.
Index Terms—Software testing; multi-objective evolutionary
algorithms; distributed development.
I. I NTRODUCTION
The Integration and Test Order problem is concerning to the
determination of a test sequence of modules that minimizes
stubbing costs in the integration testing. The test is generally
conducted in different phases. For example, the unit testing
searches for faults in the smallest part to be tested, the module.
In the integration test phase the goal is to find interaction
faults between the units. In many cases, there are dependency
relations between the modules, that is, to test a module A
another module B needs to be available. When dependency
cycles among modules exist it is necessary to break the cycle
and to construct a stub for B. However, the stubbing process
may be expensive and to reduce stubbing costs we can find
in the literature several approaches. This is an active research
topic that was recently addressed in a survey [1].
The most promising results were found by the search-based
approach with Multi-Objective and Evolutionary Algorithms
(MOEAs) [2]–[7]. These algorithms offer a multi-objective
treatment to the problem. They use Pareto’s dominance concepts to provide the tester a set of good solutions (orders)
that represent the best trade-off among different factors (objectives) to measure the stubbing costs, such as, the number of
operations, attributes, methods parameters and outputs, which
are necessary to emulate the stub behaviour.
The use of MOEAs to solve the Integration and Test Order
problem in the object-oriented context was introduced in our
previous work [2]. After achieving satisfactory results, we
applied MOEAs in the aspect-oriented context with different
number of objectives [4], [6] and using different strategies to
test aspects and classes [7] (based on the study of Ré and
Masiero [8]). In [3], the approach was generalized and named
MOCAITO (Multi-objective Optimization and Coupling-based
Approach for the Integration and Test Order problem). MOCAITO is an approach that solves the referred problem by
using MOEAs and coupling measures. It is suitable to any
type of unit to be integrated in different contexts, including
object and aspect-oriented software, component-driven, software product line, and service-oriented contexts. The units to
be tested can be components, classes, aspects, services and
so on. The steps include: i) the construction of a model to
represent the dependencies between the units; ii) the definition
of a cost model related to the fitness functions and objectives;
iii) the multi-objective optimization, i.e., the application of the
algorithms; and iv) the selection of a test order to be used by
the tester. MOCAITO was implemented and evaluated in the
object and aspect-oriented contexts and presented better results
when compared with other existing approaches.
However, we observe a limitation for MOCAITO and all
approaches found in the literature. In practice there may be
different restrictions related to the software development that
are not considered by existing approaches. For example, some
relevant restrictions are related to software modularization.
Modularity is an important design principle that allows the
division of the software in modules. Modularity is useful
for dealing with complexity, improves comprehension, eases
reuse, and reduces development efforts. Furthermore, it facilitates the management in a distributed development [9],
[10]. In this kind of development, generally clusters of related
modules are developed and tested at separate locations by
different teams. In a posteriori stage all the sets are then
integrated. In some cases, these teams may be members of the
same organization; in other cases, collaboration or outsourcing
involving different organizations may exist.
The dependencies between modules across different clusters
make the integration testing more difficult to perform. To
determine an order that implies a minimum cost is in most
situations a very hard task for software engineers without
using an automated strategy. Considering this fact, this paper
introduces a strategy to help in this task and to determine
the best module orders to the Integration and Test Order
problem in the presence of modularization restrictions. The
strategy is based on evolutionary optimization algorithms and
is implemented in the MOCAITO approach. We implemented
the MOEAs NSGA-II, SPEA2 and PAES, traditionally used in
related work, and introduce evolutionary operators to consider
that some modules are developed and tested together, and
thus these modules need to appear as a cluster in the solution
(order). Moreover, four measures (objectives) are used.
Determining the orders in the presence of restrictions imposes some limitations in the search space, and consequently
impacts the performance of the MOEAs. So, it is important
to evaluate the impact of modularization restrictions during
the integration testing. To evaluate this impact, we conducted
experiments by applying two strategies: a strategy with and
another one without software modularization restrictions. The
experiment uses eight real systems and two development
contexts: object and aspect-oriented ones.
The paper is organized as follows. Section II reviews previous related researches, including the approach MOCAITO.
Section III introduces the proposed strategy for the integration
problem in the presence of modularization restrictions and
shows how the algorithms were implemented. Section IV contains the experimental evaluation setting. Section V presents
and analyses the obtained results. Finally, Section VI concludes the paper and points out future research works.
II. R ELATED W ORK
The Integration and Test Order problem has been addressed
in the literature by many works [1] in different software
development contexts: object and aspect-oriented software,
component-driven development, software product lines and
service-oriented systems. The existing approaches are generally based on graphs where the nodes represent the units to be
integrated and the edges the relationships between them [11].
The goal is to find an order for integrating and testing the
units that minimizes stubbing costs. At this end, several
optimization algorithms have been applied, as well as, different
cost measures. The called traditional approaches [11]–[15]
are based on classical algorithms, which provides an exact
solution, not necessarily optima. Metaheuristics, search-based
techniques, such as Genetic Algorithms (GAs), provide better
solutions since avoid local optima [16]. The multi-objective
algorithms offer a better treatment to the problem that is in
fact dependent on different and conflicting measures [2], [4].
However, we observe that all of the existing approaches
and studies have a limitation. They do not consider and were
not evaluated with real restrictions associated to the software
development, such as modularization restrictions and groups
of modules that are developed together. To introduce a strategy
that consider the problem in the presence of such restrictions
is the goal of this paper. To this end, the strategy was
proposed and implemented to be used with the multi-objective
approach. This choice is justified by studies conducted in the
works described above showing that, independently of the
development context, multi-objective approaches present better
results. Pareto’s concepts are used to determine a set of good
and non-dominated solutions to the problem. A solution is
considered non-dominated according to its associated objective
values. At least one of them needs to be better than the
corresponding values of all other solutions, and the remaining
values need to be at least equal.
Then, the solution that deals with modularization restrictions
is proposed to be used with MOCAITO [3]. The main reason
to do this is that this approach is generic and can be used in
different software development contexts. MOCAITO is based
on multi-objective optimization of coupling measures, which
are used as objectives for the algorithms. The steps of such
approach are presented in Figure 1.
First of all, a dependency model that represents the dependency relations among the units to be integrated is built.
This allows MOCAITO application in different development
contexts, with different kind of units to be integrated. Example
of such model, used in our work, is the ORD (Object Relation
Diagram [11]), and its extension for the aspect-oriented context [12]. In these diagrams, the vertexes represent the modules
(classes or aspects), and the edges represent the relations that
can be: association, cross-cutting association, use, aggregation,
composition, inheritance, inter-type declarations, and so on.
Other step is the definition of a cost model. This model is
generally based on software measures, used as fitness function
(objectives) by the optimization algorithms. Such measures
are related to the stubbing process costs. MOCAITO was
evaluated with different numbers of objectives, traditionally
considered in the literature, and based on four coupling
measures. Then, considering that mi and mj are two coupled
modules, the used coupling measures are defined as follows:
- Attribute Coupling (A): The maximum number of attributes
to be emulated in stubs related to the broken dependencies [16]. A is represented by a matrix AM (i, j), where rows
and columns are modules and i depends on j. For a given
test order t with n modules and a set of d dependencies to be
broken, considering that k is any module included before the
module i, A is calculated
to:
Pn
Pn according
A(t) = i=1 j=1 AM (i, j); j 6= k
- Operation Coupling (O): The maximum number of operations to be emulated in stubs related to the broken dependencies [16]. O is represented by a matrix OM (i, j), where rows
and columns are modules and i depends on j. Then, for a
given test order t with n modules and a set of d dependencies
to be broken, considering that k is any module included before
the module i, O is computed
Pn Pnas defined by:
O(t) = i=1 j=1 OM (i, j); j 6= k
Dependency
information
Construction of the
dependency model
Dependency
model
Rules
Legend
Artifact
Constraints
Cost
information
Definition of the
cost model
Multi-objective
optimization
Test
orders
Cost
model
Order
selection
Steps
User Information
Selected
test order
Fig. 1. MOCAITO Steps (extracted from [3])
- Number of distinct return types (R): Number of distinct
return types of the operations locally declared in the module
mj that are called by operations of the module mi . Returns
of type void are not counted, since they represent the absence
of return. Similarly to
previous, this measure is given by:
Pnthe P
n
R(t) = i=1 j=1 RM (i, j); j 6= k
- Number of distinct parameter types (P): Number of distinct
parameters of the operations locally declared in mj that
are called by operations of mi . When there is overloading
operation, the number of parameters is equals to the sum
of all distinct parameter types among all implementations of
each overloaded operation. So, it is considered the worst case,
represented by situations in which the coupling consists of
calls to all implementation of a given operation. This measure
is given by:
Pn Pn
P (t) = i=1 j=1 P M (i, j); j 6= k
After this, multi-objective algorithms are applied. They can
work with constraints given by the tester, which can make
an order invalid. Some constraints, adopted in the approach
evaluation [3], are not to break inheritance and inter-type
declarations dependencies. These dependencies are complex
to simulate, so to deal with these types of constraints the
dependent modules are preceded by the required modules. The
treatment involves to check the test order from the first to
the last module, and if a precedence constraint is broken, the
module in question is placed at the end of the order. As output,
the algorithms generate a set of solutions for the problem that
have the best trade-off among the objectives. The orders can
be ranked according to some priorities (rules) of the tester.
In [3], MOCAITO was evaluated in object and aspectoriented contexts with different number of objectives and with
three MOEAs: NSGA-II [17], SPEA2 [18] and PAES [19].
These three MOEAs were chosen because they are well
known, largely applied and adopt different evolution and diversification strategies [20]. Moreover, to know which algorithm
is more suitable to solve a particular problem is a question
that needs to be answered by means of experimental results.
The main results found are: there is no difference among
the algorithms for simple systems and contexts; SPEA2 was
the most expensive with the greatest runtime; NSGA-II was
the most suitable in a general case (considering different
quality indicators and all systems); PAES presented better
performance for more complex systems.
However, the approach was not evaluated taking into account some real restrictions, generally associated to the soft-
ware development, and mentioned in the last section. Most
organizations nowadays adopted the distributed development,
but in spite of this, we observe in the survey of the literature [1]
that related restrictions are not considered by studies that deal
with the Integration and Test Order problem.
In this sense, the main contribution of this paper is to introduce and evaluate a strategy based on optimization algorithms
to solve the referred problem in the presence of modularization
restrictions. In fact, in the presence of such restrictions a
new problem (variant of the original one) emerges, which
presents new challenges related to task allocation versus the
establishment of an integration and test order. For this, a
novel solution strategy is necessary and proposed, including
a new representation and new genetic operators. This new
solution strategy is evaluated with the MOCAITO approach,
by using its cost and dependency models. However, it could be
applied with other approaches. So, the next section describes
implementation aspects of the strategy to deal with such
modularization restrictions.
III. W ORKING WITH M ODULARIZATION R ESTRICTIONS
A restriction is a condition that a solution is required to
satisfy. In the context of this study, restrictions are related to
modularization. We consider that some modules are grouped
by the software engineer, forming a cluster. Modules in a cluster must be developed and tested together. Figure 2 presents an
example of modularization restrictions. Considering a system
with twelve modules identified from 1 to 12. Due to a
distributed development environment, the software engineer
determines three clusters (groups) identified by A1, A2 and
A3. To mix modules of distinct clusters is not valid, such as
it happens in Order C. Using Order C the developers of A1
need to wait developers of A3 to develop some modules to
finish and test their modules.
Orders A and B are examples of valid orders. These orders
allow teams working independently. In order A the modules of
cluster A1 are firstly developed, integrated and tested. Since
the team responsible for the modules of A1 finish their work,
the development of modules of cluster A3 can start having all
the modules of A1 available to be used in the integration test.
Similarly, when the team responsible for the cluster A2 starts
its work, the modules of A1 and A3 are already available. The
independence on the development of each cluster by different
teams also occurs in order B, since the modules of each
cluster are in sequence to be developed, integrated and tested.
Although Figure 1 shows the modules in sequence, when there
Cluster1
Cluster2
Cluster3
Genetic Operators in MECBA-Cluster
1 Child Strategy
7 8 9 6 5 4 3 2Combined
4 2 3 1 6 5 7 8 9 Child1
Cluster1
Cluster2
Cluster3
5 6 9 8 7 1 3 2 4 Child2
Cluster2
Cluster3
Cluster1
System Modules
1
2
Cluster1
3
4
5
6
7
8
9
10
11
2
8
9
Clusters of Modules
Cluster1
A1
A2
1
3
4
A3
6
7
5
10
11
Cluster2
Cluster3
1 2 3 4 6 5 7 8 9 Parent1
Cluster2
4
1
5
3
11
8
2
9
12
6
10
7
Order B
2
9
8
12
11
6
7
10
4
3
1
5
Order c
6
7
10
4
5
8
3
12
11
2
1
9
Cluster3
Cluster3
Cluster1
Cluster1
Cluster2
Cluster3
5 6 9 8 7 1 2 3 4 Child2
Cluster3
Cluster1
Cluster3
Cluster1
Cluster1
1 2 3 4
Cluster1
4 3 2 1
Crosspoint1
Cluster1
Crosspoint2
Cluster2
Cluster2
Cluster3
Cluster1
(b) Intra Cluster
Fig. 4. Crossover Operator
Cluster1
Cluster2
Cluster3
1 2 3 4 6 5 7 8 9 Parent1
their parents. As illustrated in the example of Figure 4(a), after
5 6 9 8 7 4 3 2 1 Parent2
the random selection of the cluster
to be exchanged,
Child1
Cluster2
Cluster3
Cluster1
receives Cluster2 and Cluster3 from Parent1, and Cluster1
A. Problem Representation
from Parent2. In the same way, Cluster1
Child2Cluster2
receives
Cluster2 and
Cluster3
4 3 2 from
1 6 5Parent1.
To implement a solution to the problem with restrictions, the Cluster3 from Parent2, and Cluster1
7 8 9 Child1
first point refers to the problem representation. The traditional
The Intra Cluster crossover aims at creating new solutions
5 6 9 8 7 1 2 3 4 Child2
way to deal with the problem uses as representation an array that receive clusters generatedCluster2
withCluster3
two points
Cluster1 crossover of
of integers, where each number in the array corresponds to a specific cluster. After the random selection of one cluster,
a module identifier [3]. However, a more elaborated repre- the traditional two points crossover for permutation problems
sentation is needed to consider the modules grouped into is applied. The other clusters that do not participate of the
clusters. A class Cluster, as presented in Figure 3 was crossover are copied from parents to children. As illustrated in
implemented. An object of the class Cluster is composed by Figure 4(b), Cluster1 was randomly selected, as well as, two
two attributes: (i) a cluster identifier (id) of the type integer; points are defined to crossover. Cluster2 and Cluster3 from
and (ii) an array of integers (modules) that represents
the in Parent1
are just copied to Child1, and in the same way the
Genetic Operators
MECBA-Cluster
Combined Strategy
cluster modules. An individual is composed of n Cluster
Cluster2 and Cluster3 from Parent2 are just copied to Child2.
objects, where n is the number of clusters.
During the crossover, in the evolutionary process, two
parents are selected and four children are generated, two using
Inter
Cluster
crossover,
and two, using Intra Cluster crossover.
Cluster2
Cluster3
Cluster1
Cluster2
Cluster3
ClusterCluster1
1 2 3 4 6 5 7 8 9 Parent1
42)3 Mutation:
1 6 5 7MOCAITO
2
8
9
Parent
approach applies the traditional
+id: int
+modules: ArrayList<Integer>
swap mutation [3], swapping module position in the order. But,
5 6 9 8 7 4 3 2 1 Parent2
again,
the simple application of this mutation could disperse
Cluster2
Cluster3
Cluster1
4 1 2 3 6 5 7 8 9 Child
Genetic Operators in MECBA-Cluster
Fig. 3. Class Cluster
the Cluster1
modulesCluster2
of aCluster3
cluster
across the order. So, two different
Combined Strategy
Crosspoint: Cluster1
types of mutation are implemented: (i) Inter Cluster; and (ii)
Cluster1 1 2 3 4
Intra Cluster, which are presented in Figure 5.
B. Evolutionary Operators
3 2evolutionary
1
Cluster2 Cluster2
Cluster3 Cluster3
Cluster1
Cluster2
Cluster3
The traditional way [2]–[5] to Cluster1
apply4 the
op- Cluster1 Cluster1
8 9 Parent 4 3 2 1 6 5 7 8 9 Parent
erators to the Integration and TestCrosspoint1
Order Crosspoint2
problem is the1 2 34 43 62 51 76 85 97 Parent1
same adopted in a permutation problem.
However,
with the5 6 9 8 7 4 3 2 1 Parent2
Cluster1
Cluster2
Cluster3
Cluster1
modularization restrictions a new
way
to
generate
and
dealCluster2 7Cluster3
4 1 2 3 6 5 7 8 9 Child
4 2 3 1 6 5 7 8 9 Child1
8 9 6 5 4 3 2 1 Child
Cluster1
Cluster2
Cluster3
Cluster1
Cluster2
Cluster3
with the individuals (solutions) is required. Next, we present
Crosspoint: Cluster1
5 6 9 8 operators.
(a) Inter Cluster
(b) Intra Cluster
7 1 3 2 4 Child2
the implemented crossover and Cluster2
mutation
Cluster3
Cluster1
1 2 3 4
1) Crossover: MOCAITO approach applies the two points Cluster1
Fig. 5. Mutation Operator
crossover [3]. However, a simple random selection of two Cluster1 4 3 2 1
Cluster1
Cluster2
Cluster3
points to perform the crossover could disperse the modules of Crosspoint1
Both types
of mutation are simple.
Cluster
4 3 2 1While
6 5 the
7 8Inter
9 Parent
Crosspoint2
a cluster across the order. So, considering
the modularization
mutation swaps cluster positions in the order, the Intra Cluster
Cluster1
Cluster2
Cluster3
1 were
2 3 4
6 5 7 8 9 Parent1
Cluster2 swaps
Cluster3 module positions in a cluster. Figure 5(a)
restrictions, two types of crossover
implemented:
(i) Inter Cluster1
mutation
4 2 3 1 6 5 7 8 9 Child1
7 8 9 6 5 4 3 2 1 Child
Cluster; and (ii) Intra Cluster, which
are
depicted
in
Figure
4.
illustrates
the
Inter Cluster mutation,
where
the
positions of
Cluster1
Cluster2
Cluster3
5 6 9 8 7 4 3 2 1 Parent2
Cluster3 is to
Cluster1
and
Cluster3
are
swapped.
Figure
5(b)
presents an
The goal of the Inter Cluster Cluster2
crossover
generate chil-5 6Cluster1
9 8 7 1 3 2 4 Child2
Cluster3
dren receiving the full exchange of complete cluster betweenCluster2example
ofCluster1
Intra Cluster, where after the random selection
Cluster1
Cluster2
Cluster3
4 3 2 1 6 5 7 8 9 Child1
5 6 9 8 7 1 2 3 4 Child2
Cluster2
Cluster3
Cluster1
Cluster1
Cluster2
Cluster3
1 2 3 4 6 5 7 8 9 Parent1
Cluster1
Clu
Cluster1
Clu
4 3 2 1 6
7 8 9 6 5
Cluster1
5 6 9 8 7 1 3 2 4 Child2
are no dependencies between some clusters, the development
may be performed in a parallel way. In this case, each team
could develop and test the modules of the cluster according
to the test order, later the modules would be integrated also
in accordance with the same test order.
4 1 2 3 6
Cluster3
4 2 3 1 6 5 7 8 9 Child1
(a) Inter Cluster
Fig. 2. Example of modularization restrictions
Clu
Crosspoint: Cluster1
4 3 2 1 6 5 7 8 9 Child1
Cluster2
Cluster1
4 3 2 1 6
5 6 9 8 7 4 3 2 1 Parent2
Cluster2
5 6 9 8 7 4 3 2 1 Parent2
12
Integration and Test Orders
Order A
Cluster2
1 2 3 4 6 5 7 8 9 Parent1
12
Clu
of Cluster1, the positions of Modules 1 and 3 are swapped.
During the evolutionary process both kind of mutations have
50% of probability to be chosen.
C. Repairing broken dependencies
There are two types of repairing orders with broken dependencies constraints (inheritance and inter-type declarations). In
the Intra Cluster treatment, the constraints between modules
in the same cluster are checked and the precedence is corrected
by placing the corresponding module at the cluster end.
After all the precedences of modules inside the clusters are
correct, the constraints between modules of different clusters
are checked during the Inter Cluster treatment. The precedence
is corrected by placing the cluster at the order end, thus the
dependent cluster becomes the last one of the order.
IV. E XPERIMENTAL S ETUP
The goal of the conducted experiment is to evaluate the
solution of the Integration and Test Order problem in the
presence of modularization restrictions and to answer some
questions such as: “How does the use of the modularization
restrictions impact on the performance of the algorithms?” and
“How are the usefulness and the applicability of the solutions
obtained by the proposed strategy?”. To the first research
question we followed the GQM method [21] 1 . In the case
of the second question, a qualitative analysis was performed.
The experiment was conducted using similar methodology
and same systems of related work [3]. Two strategies were
applied and compared: a strategy named here MC, which deals
with modularization restrictions using clusters, according to
the implementation described in the last section, and a strategy
M applied according to [3] without using modularization
restrictions.
A. Used Systems
The study was conducted with eight real systems; the same
ones used in [3]. Table I presents some information about
these systems, such as number of modules (classes for Java
programs; classes and aspects for AspectJ), dependencies,
LOC (Lines of Code) and clusters.
TABLE I
U SED S YSTEMS
System
Language Modules Dependencies LOC Clusters
BCEL
Java
45
289
2999
3
JBoss
Java
150
367
8434
8
JHotDraw
Java
197
809
20273
13
MyBatis
Java
331
1271
23535
24
AJHotDraw
AspectJ
321
1592
18586
12
AJHSQLDB
AspectJ
301
1338
68550
15
HealthWatcher AspectJ
117
399
5479
7
Toll System
AspectJ
77
188
2496
7
1 Due to lack of space, the GQM table is available at:
https://dl.dropboxusercontent.com/u/28909248/GQM-Method.pdf.
B. Clusters Definition
To define the clusters of modules, the separation of concerns
principle [22] was followed. Considering this principle, the
effort to develop a software, and consequently test it, became negligibly small. Following the separation of concerns
the modules should be interconnected in a relatively simple
manner presenting low coupling to other clusters. Hence, this
procedure benefits the distributed development since decreases
the interdependence between the teams.
In this way, each system was divided into clusters according
to the concerns that they realize. So each team should develop,
integrate and test one cluster that deals with one concern
present in the system. Aiming at confirming the interdependencies between the modules of the clusters, we check such
division by constructing directed graphs and considering the
inheritance and inter-type declarations dependencies, that ones
that should not be broken. The number of clusters for each
system is presented in the last column of Table I.
C. Obtaining the Solutions Sets
To analyze the result we will use different sets of solutions.
These sets are found in different ways. Below, we describe
how we obtained each solution set used:
• P Fapprox : one set P Fapprox for a system was obtained
in each run of an algorithm. Each MOEA was executed
30 times for each system in order to know the behavior
of each algorithm to solve the problem. So, at the end,
30 sets P Fapprox were obtained.
• P Fknown : this set was obtained for each system through
the union of the 30 sets P Fapprox , removing dominated
and repeated solutions. P Fknown represents the best
solutions found by each MOEA.
• P Ftrue : this set was obtained for each system through
the union of the sets P Fknown , removing dominated and
repeated solutions. P Ftrue represents the best solutions
known to the problem. This procedure to obtain the best
solutions to a problem is recommended when the ideal
set of solutions is not known [23].
D. Quality Indicators
To compare the results presented by the MOEAs, we
used two quality indicators, generally used in the MOEA
literature: (i) Coverage and (ii) Euclidean Distance from an
Ideal Solution (ED).
The Coverage (C) [19] calculates the proportion of solutions
in the Pareto Front, P Fa , which are dominated by P Fb . The
function C(P Fa , P Fb ) maps the ordered pair of (P Fa , P Fb )
into the range [0,1] according to the proportion of solutions
in P Fb that are dominated by P Fa . Similarly, we compare
C(P Fb , P Fa ) to obtain the proportion of solutions in P Fa
that are dominated by P Fb . Value 0 for C indicates that the
solutions of the former set do not dominate any element of the
latter set; on the other hand, value 1 indicates that all elements
of the latter set are dominated by elements of the former set.
The Euclidean Distance from an Ideal Solution (ED) is
used to find the closest solutions to the best objectives. It is
based on Compromise Programming [24], a technique used
to support decision maker when a set of good solutions is
available. An Ideal Solution has the minimum value of each
objective of P Ftrue , considering a minimization problem.
In this section the results are presented and evaluated to
answer the research questions. The impact of using restrictions
is analysed and the practical use of MC is addressed.
E. Parameters of the Algorithms
A. On the impact of using modularization restrictions
The same methodology adopted in [2]–[4] was adopted to
configure the algorithms. The parameters are in Table II. The
number of fitness evaluations was used as stop criterion for the
algorithms, this allows comparing the solutions with similar
computational cost. Moreover, they were executed in the same
computer and the runtime was recorded.
In this section the restrictions impact is analysed in two
ways (subsections): (i) evaluating the MOEAs performance
using MC, and (ii) comparing the strategies M and MC. At the
end, a synthesis about the impact of modularization restrictions
is presented.
1) Performance of the MOEAs using MC: The analysis
conducted in this section allows evaluating the performance of
the MOEAs when the modularization restrictions are considered. It is based on the quality indicators described previously.
Table III presents the values of the indicator C for the sets
P Fknown of each MOEA. The results show difference for
five systems: BCEL, MyBatis, AJHotDraw, AJHSQLDB and
Toll System. For BCEL, the NSGA-II solutions dominate 75%
of PAES solutions and around 60% of the SPEA2 solutions.
The SPEA2 solutions also dominate 75% of PAES solutions.
For MyBatis, PAES solutions dominated 100% NSGA-II and
SPEA2 solutions, and NSGA-II solutions dominated around
73% of SPEA2 solutions. For AJHotDraw, PAES was also
better, but SPEA2 was better than NSGA-II. For AJHSQLDB,
a similar behaviour was observed. For Toll System NSGAII and SPEA2 solutions dominate 50% of PAES solutions.
Hence, NSGA-II and SPEA2 presented the best results.
TABLE II
MOEA S PARAMETERS
Parameter
NSGA-II PAES SPEA2
Strategy MC
Population Size
300
300
300
Fitness Evaluations
60000
60000 60000
Archive Size
250
250
Crossover Rate
0,95
0,95
Inter Cluster Crossover rate
1,0
1,0
Intra Cluster Crossover rate
1,0
1,0
Mutation Rate
0,02
1
0,02
Inter Cluster Mutation Rate
0,5
0,5
0,5
Intra Cluster Mutation Rate
0,5
0,5
0,5
Strategy M
Population Size
300
300
300
Fitness Evaluations
60000
60000 60000
Archive Size
250
250
Crossover Rate
0,95
0,95
Mutation Rate
0,02
1
0,02
F. Threats to Validity
The main threats to our work are related to the evaluation
of the proposed solution. In fact an ideal evaluation should
consider similar strategies, and different kind of algorithms,
including the traditional ones. However, we have not found
a similar strategy in the literature. A random strategy could
be used, however, this strategy is proven to present the worst
results in the related literature, and the obtained results would
be obvious. Besides, the traditional approaches, not based on
evolutionary algorithms, are very difficult to adapt (some of
them impossible) to consider the modularization restrictions
and different cost measures. Hence, we think that addressing such restrictions with multi-objective and evolutionary
approaches is more promising and practical. In addition to
this, a comparison with a strategy that does not consider the
restrictions can provide insights about the usage impact.
Other threat is related to the clusters and systems used.
An ideal scenario would be consider clusters used in real
context of distributed development. To mitigate such threat we
consider as a criterion to compose the clusters the separation
of concerns, which we think it is implicitly considered in team
allocations. Another threat is the number of systems used that
is reduced and can influence in the generalization of the obtained results. To reduce this influence we selected object and
aspect-oriented systems, with different sizes and complexities,
given by the number of modules and dependencies.
V. R ESULTS AND A NALYSIS
TABLE III
C OVERAGES VALUES - S TRATEGY MC
System
BCEL
JBoss
JHotDraw
MyBatis
AJHotDraw
AJHSQLDB
Health
Watcher
Toll
System
MOEA
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
PAES
SPEA2
NSGA-II
0
0,181818
0
0
0,427273
0,345455
1
0,349515
1
0,666667
1
0,540816
0
0
0
0
PAES
0,75
0,75
0
0
0,027972
0,020979
0
0
0
0
0
0
0,166667
0,166667
0,5
0,5
SPEA2
0,578947
0
0
0
0,166667
0,45098
0,729167
1
0,16129
1
0,117647
1
0
0
0
0
-
Table IV contains the results obtained for indicator ED. The
second column presents the cost of the ideal solutions. Such
costs were obtained considering the lowest values of each
objective from all solutions of the P Ftrue of each system
and independently from which solution they were achieved.
TABLE IV
C OST OF THE I DEAL S OLUTION AND L OWER ED F OUND - S TRATEGY MC
System
Ideal Solution
BCEL
(40,54,33,59)
JBoss
(25,17,4,14)
JHotDraw
(283,258,92,140)
MyBatis
(259,148,57,145)
AJHotDraw
(190,100,40,62)
AJHSQLDB (3732,737,312,393)
Health Watcher
(115,149,49,52)
Toll System
(68,41,18,16)
NSGA-II
Lowest ED
Solution Cost
24,5764
(57,59,50,60)
2,0000
(25,17,6,14)
63,2297
(301,274,105,197)
203,2855
(1709,204,81,191)
51,6817
(196,105,43,113)
526,5302 (4217,879,415,499)
39,7869
(138,166,67,73)
5,4772
(68,42,20,21)
The other columns present the solution closest to the ideal
solution and its cost in terms of each objective. For the
systems JBoss, JHotDraw, Health Watcher and Toll System,
all MOEAs found the same solution with the lowest value of
ED. For BCEL, SPEA2 found the solution with the lowest
ED. Finally, PAES obtained solutions with the lowest ED for
MyBatis, AJHotDraw and AJHSQLDB.
From the results of both indicators it is possible to see that,
in the context of our study, PAES is the best MOEA, since
it obtained the best results for six systems: JBoss, JHotDraw,
MyBatis, AJHotDraw, AJHSQLDB and Health Watcher. Such
systems have the greatest numbers of modules and clusters
(Table I). NSGA-II is the second best MOEA, since it found
the best results for five systems: BCEL, JBoss, JHotDraw,
Health Watcher and Toll System. SPEA2 also obtained the
best results for four systems: BCEL, JBoss, Health Watcher
and Toll System. NSGA-II and SPEA2 have similar behavior,
presenting satisfactory results for systems with few modules
and few clusters (Table I).
2) Comparing the strategies M and MC: Aiming at
analysing the impact of using restrictions, two pieces of
information were collected for the strategies M and MC:
number of obtained solutions and runtime. Such numbers are
presented in Table V. The third and the sixth columns contain
the cardinality of P Ftrue . The fourth and the seventh columns
present the mean quantity of solutions from the sets P Fapprox
and the cardinality of P Fknown between parentheses. The fifth
and eighth columns present the mean runtime (in seconds)
used to obtain each P Fapprox and the standard deviation
(between parentheses), respectively.
Verifying the number of solutions of P Ftrue , it can be
noticed that for BCEL and MyBatis the number of solutions
found by MC was lower than M. On the other hand, for JBoss
and JHotDraw such number was greater in MC than in M. So,
it can be observed that the systems with more solutions found
by M have less solutions found by MC and vice-versa.
In spite of the strategies M and MC involve the same
effort related to the number of fitness evaluations, the runtime
between them have great difference (Figure 6 and Table V).
For all systems, NSGA-II, PAES and SPEA2 spent more
runtime in strategy MC. The single exception was SPEA2 that
spent less time with strategy MC for JHotDraw. From the three
MOEAs, SPEA2 spent the greatest runtime. Such fact allows
us to infer that in the presence of several restrictions in the
search space the SPEA2 behavior may become random.
PAES
Lowest ED
Solution Cost
74,0000
(51,59,34,132)
2,0000
(25,17,6,14)
63,2297
(301,274,105,197)
147,5263
(282,235,78,260)
49,1325
(197,106,45,110)
167,2692 (3836,810,365,488)
39,7869
(138,166,67,73)
5,4772
(68,42,20,21)
SPEA2
Lowest ED
Solution Cost
23,4094
(45,63,52,68)
2,0000
(25,17,6,14)
63,2297
(301,274,105,197)
221,4746
(386,267,97,276)
49,6488
(200,106,45,110)
403,7809 (4069,879,403,538)
39,7869
(138,166,67,73)
5,4772
(68,42,20,21)
Figure 7 presents the solutions in the objectives space.
Due to graphics dimension limitation, only three measures
were presented in the pictures. In the case of JHotDraw
(Figure 7(a)), the solutions of M are closer to the minima
objectives (A=0, O=0, R=0, P=0). These solutions are not feasible for the strategy MC due to the restrictions. They impose
the MOEAs to find solutions in other places in the search
space, where a greater number of solutions are feasible, but
more expensive. MyBatis illustrates well this point. Figure 7(b)
presents that the M solutions for MyBatis are in the same area,
next to the minima objectives. The restrictions impose MOEAs
to explore other areas in the search space, and in this case, a
lower number of solutions is found. These solutions are more
expensive. From the results, it is possible to state that the
restrictions imply a more complex search, limiting the search
space and imposing a greater stubbing cost.
To better evaluate the impact on the cost of the solutions
obtained by both strategies, we use the indicator ED. The
solutions closest to the ideal solution are those ones that
have the best trade-off among the objectives and are good
candidates to be adopted by the tester. We compare the cost of
the ideal solutions and the cost of the solutions obtained by a
MOEA. In our comparison we chosen the PAES solutions, this
algorithm presented the best performance, lower ED values for
six systems. These costs are presented in Table VI.
TABLE VI
C OST OF THE SOLUTIONS IN BOTH STRATEGIES
M
System
BCEL
JBoss
JHotDraw
MyBatis
AJHotDraw
AJHSQLDB
Health Watcher
Toll System
Ideal
Solution
(45,24,
0,96)
(10,6,
2,9)
(27,10,
1,12)
(203,70,
13,47)
(39,12,
0,18)
(1263,203,
91,138)
(9,2,
0,1)
(0,0,
0,0)
PAES
Solution
(64,39,
15,111)
(10,6,
2,9)
(30,12,
1,18)
(265,172,
49,184)
(46,19,
1,34)
(1314,316,
138,236)
(9,2,
0,1)
(0,0,
0,0)
MC
Ideal
PAES
Solution
Solution
(40,54,
(51,59,
33,59)
34,132)
(25,17,
(25,17,
4,14)
6,14)
(283,258,
(301,274,
92,140)
105,197)
(259,148,
(282,235,
57,145)
78,260)
(190,100,
(197,106,
40,62)
45,110) )
(3732,737, (3836,810,
312,393)
365,488)
(115,149,
(138,166,
49,52)
67,73)
(68,41,
(68,42,
18,16)
20,21)
We can observe that, except for BCEL, the cost of the MC
solutions are notably greater than the M solutions cost. In most
cases the MC cost is two or three times greater, depending on
TABLE V
N UMBER OF S OLUTIONS AND RUNTIME
System
M
# P Ftrue Number of Solutions
37,43 (37)
37
39,30 (37)
36,70 (37)
1,00 (1)
1
1,13 (1)
1,00 (1)
8,40 (10)
11
10,47 (19)
9,63 (9)
276,37 (941)
789
243,60 (679)
248,77 (690)
70,03 (79)
94
40,73 (84)
68,87 (78)
156,63 (360)
266
145,97 (266)
119,10 (52)
1,00 (1)
1
1,07 (1)
1,00 (1)
1,00 (1)
1
1,07 (1)
1,00 (1)
MOEA
NSGA-II
BCEL
PAES
SPEA2
NSGA-II
JBoss
PAES
SPEA2
NSGA-II
JHotDraw
PAES
SPEA2
NSGA-II
MyBatis
PAES
SPEA2
NSGA-II
AJHotDraw
PAES
SPEA2
NSGA-II
AJHSQLDB
PAES
SPEA2
NSGA-II
Health
PAES
Watcher
SPEA2
NSGA-II
Toll
PAES
System
SPEA2
200000
140000
M
MC
180000
MC
# P Ftrue Number of Solutions
7,57 (11)
15
3,40 (8)
8,53 (19)
1,97 (2)
2
2,87 (2)
2,17 (2)
45,80 (110)
153
85,47 (143)
49,17 (102)
72,60 (103)
200
108,43 (200)
64,33 (144)
16,30 (36)
31
26,57 (31)
17,53 (31)
62,07 (196)
240
122,57 (240)
58,30 (170)
10,70 (11)
11
7,47 (12)
10,20 (11)
4,27 (4)
4
3,50 (4)
4,00 (4)
Execution Time
5,91 (0,05)
6,58 (1,25)
123,07 (18,84)
18,73 (0,20)
10,69 (0,62)
2455,35 (612,18)
29,85 (0,34)
24,29 (1,50)
922,99 (373,98)
74,03 (0,87)
104,30 (7,91)
128,88 (2,65)
75,05 (0,57)
62,07 (2,16)
195,56 (28,22)
62,34 (0,53)
75,62 (5,27)
104,29 (0,68)
12,72 (0,15)
8,27 (0,58)
2580,39 (596,29)
7,33 (0,09)
4,10 (0,75)
3516,71 (570,76)
4000000
M
MC
120000
100000
80000
60000
3000000
100000
Runtime (s)
Runtime (s)
Runtime (s)
120000
80000
60000
40000
2000000
1500000
20000
500000
0
0
0
er
ch
m
ste
Sy
B
at
W
LD
w
ra
System
(b) PAES
ll
To
lth
ea
H
D
is
SQ
JH
A
at
ot
er
ch
m
w
ra
D
yB
JH
A
M
ss
ot
JH
EL
o
JB
BC
ste
Sy
B
LD
at
W
System
(a) NSGA-II
ll
To
lth
ea
H
w
ra
D
SQ
JH
A
is
at
ot
w
ra
D
yB
JH
A
M
ss
ot
JH
er
ch
m
B
LD
at
W
w
ra
ste
Sy
EL
o
JB
BC
ll
To
lth
ea
H
D
is
SQ
JH
A
at
ot
w
ra
D
yB
JH
A
M
ss
ot
JH
EL
o
JB
BC
System
2500000
1000000
40000
20000
M
MC
3500000
160000
140000
Execution Time
8,61 (0,11)
29,89 (22,25)
3786,79 (476,23)
42,50 (0,47)
56,15 (12,50)
3536,01 (335,97)
71,90 (0,45)
51,18 (2,82)
532,83 (81,93)
189,91 (0,83)
132,37 (3,91)
517,52 (67,52)
194,34 (0,83)
115,12 (2,82)
1005,36 (268,37)
160,38 (1,64)
122,01 (4,92)
505,11 (101,90)
27,52 (0,10)
46,98 (5,34)
990,19 (95,94)
13,23 (0,09)
31,13 (16,23)
2229,26 (271,47)
(c) SPEA2
Fig. 6. Runtime
M
MC
P
P
M
MC
250
200
150
100
50
0
140
120
100
80
60
R
40
20
0
0
3500
2500 3000
1500 2000
1000
500
350
300
250
200
150
100
50
0
120
110
100
90
80
70
60
50
R
40
30
20
10
A
0
500
1000
1500
2000
2500
3000
A
(b) MyBatis
(a) JHotDraw
Fig. 7. P Ftrue with and without Modularization Restrictions
the measure. The greatest difference was obtained by programs
Health Watcher, Toll System and JHotDraw. In the two first
cases optimal solutions were found by all the algorithms with
the strategy M. These solutions are not feasible when the
restrictions are considered.
3) Summarizing impact results: Based on the results, it is
clear that the modularization restrictions increase the integration testing costs. Hence, the strategy MC can also be used in
the modularization task as a simulation and decision supporting tool. For example, in a distributed software development,
the strategy MC can be used to allocate the modules to the
different teams to ensure lower testing costs.
Furthermore, all the implemented algorithms can be used
and present good results, solving efficiently the problem.
However we observe that for most complex systems PAES
is the best choice.
B. Practical Use of MC
This section evaluates through an example the usefulness
and applicability of the proposed strategy. We presented in the
last section that the strategy MC implies in greater costs than
M. However the automatic determination of such orders in
the presence of restrictions is fundamental. When we consider
the restrictions, a huge additional effort is necessary. The
usefulness of the proposed strategy relies on the infeasibility
of manually obtaining a satisfactory solution for the problem.
To illustrate this, consider the smallest system used in the
experiment BCEL, with 3 clusters and 45 modules. For it, there
is a number of 1.22E+47 possibilities of different permutations
among the clusters and modules inside clusters to be analysed.
For the other systems such effort is even higher. Since the
task of determining a test order is delegated to some MOEA,
the tester only needs to concentrate his/her effort on choosing
an order achieved by the algorithm, as it is explained in the
example of how to use the proposed strategy presented next.
1) Example of Practical Use of MC: Table VII presents
some solutions from the set of non-dominated solutions
achieved by PAES for JHotDraw. The first column presents
the cost of the solutions (metrics A,O,R,P) and the second
column presents the order of modules in the cluster. JHotDraw
is the fourth largest system (197 modules) and the third largest
system considering the clusters (13 clusters). For this system
PAES found 143 solutions. Therefore, it is necessary that the
software engineer chooses which of these orders will be used.
To demonstrate how each solution should be analysed, we
use the first solution from the table, the solution cost is
(A=283,O=292,R=102,P=206). The order shown in the second
column; {87, 9, 196, ...}, ..., {..., 120, 194, 141}; indicates the
sequence in which the modules must be developed, integrated
and tested. Using this order, to perform integration testing of
the system will be needed the construction of stubs to simulate
283 attributes; 292 operations, that may be class methods,
aspect methods or aspect advices; different 102 types of return
and 206 distinct parameter types.
To choose among the solutions presented in Table VII, it
could be used the rule concerning to the lowest cost for a given
measure. The lowest cost is highlighted in bold, therefore,
the first solution has the lowest cost for the measure A, the
second solution has the lowest cost measure for O, and so
on. The fifth solution provides the best balance of cost among
the four measures and was selected based on the indicator
ED (Table IV). So, if the system under development presents
complex attributes to construct, then the first solution should
be used, or if the system presents parameters of the operations
that are difficult to be simulated, the fourth solution should be
used. However, if the software tester choose to prioritize all
of the measures the third solution is the best option since it is
closer to the minimum cost for all of the measures.
This diversity of solutions with different trade-offs among
the measures is one of the great advantages of using multiobjective optimization, easing the selection of an order of
modules that meets the needs of the tester.
VI. C ONCLUDING R EMARKS
This work described a strategy to solve the Integration and
Test Order problem in the presence of modularization restrictions. The strategy is based on multi-objective and evolutionary optimization algorithms and generates orders considering
that some modules are grouped and need to be developed and
tested together, due, for instance, to a distributed development.
Specific evolutionary operators were proposed to allow
mutation and crossover inside a cluster of modules. To evaluate
the impact of such restrictions the strategy, named MC,
was applied using three different multi-objective evolutionary
algorithms and eight real systems. During the evaluation the
results obtained from the application of MC were compared
with another strategy without restrictions.
With respect to our first research question, all the MOEAs
achieved similar results, so they are able to satisfactorily
solve the referred problem. The results point out that the
modularization restrictions impact significantly on the optimization process. The search becomes more complex since
the restrictions limit the search space, then the stubbing cost
increases. Therefore, as the modularization restrictions impact
on the costs, the proposed strategy can be used as a decision
supporting tool during the cluster composition task, helping,
for example, in the allocation of modules to the different teams
in a distributed development, aiming at minimizing integration
testing costs.
Regarding to the second question, the usefulness of the
strategy MC is supported by the difficulty of manually obtaining solutions with satisfactory trade-off among the objectives.
The application of MC provides the tester a set of solutions
allowing he/she to prioritize some coupling measures, reducing
testing efforts and costs.
MOCAITO adopts only coupling measures, despite such
measures are the most used in the literature, we are aware
that other factors can impact on the integration testing cost. So,
such limitation could be eliminated by using other measures
during the optimization process. This should be evaluated in
future experiments.
Other future work we intend to perform is to conduct experiments involving systems with a greater number of clusters
and dependencies as well as using other algorithms. In further
experiments, we also intend to use a way to group the modules
of a system, such as a clustering algorithm. As MOCAITO is
a generic approach, it is possible to explore other development
contexts and kind of restrictions besides modularization.
ACKNOWLEDGMENTS
We would like to thank CNPq for financial support.
TABLE VII
S OME SOLUTIONS OF PAES FOR THE SYSTEM JH OT D RAW
Solution Cost
Order
{87, 9, 196, 187, 67, 185}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25,
115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {84, 85, 58, 0, 79, 66, 98, 7, 111}, {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 145,
47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 179, 83, 122, 16, 129, 182, 27, 45, 176, 159, 5, 31, 52, 156, 165, 166, 32, 4, 150, 192, 54,
(283,292,102,206)
110, 137, 1, 8, 151, 113, 65, 95, 135, 132, 130}, {128, 181, 147, 125, 169, 124, 61}, {138, 108, 15, 33, 29, 154, 153}, {56, 6, 139, 42, 39, 41, 44,
40, 172, 43}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148},
{10, 90, 195, 78, 88, 183, 81}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 112, 168, 142, 127, 149, 92, 143, 193, 162, 80}, {131, 13, 120, 194, 141}
{87, 9, 196, 187, 67, 185}, {84, 85, 0, 58, 66, 98, 7, 111, 79}, {138, 108, 15, 33, 29, 154, 153}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30,
104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {96, 180, 123, 114, 101,
53, 12, 75, 36, 140, 107, 144, 16, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 132, 110, 179, 83, 122, 32, 129, 182, 27, 45, 176,
(322,258,103,192)
159, 5, 31, 52, 156, 165, 166, 135, 4, 150, 192,54, 137, 1, 8, 151, 113, 130, 65, 95}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {17, 63, 19, 20, 22,
161, 37, 106, 163, 112, 168, 142, 127, 149, 92, 143, 193, 162, 80}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177,
174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {128, 181, 147, 125, 169, 124, 61}, {131, 13, 120, 194, 141}, {10, 90, 195, 78, 88, 183, 81}
{96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 179, 83, 122, 16, 129, 182, 27,
45, 176, 159, 5, 31, 52, 156, 165, 166, 32, 4, 150, 192, 54, 110, 137, 1, 8, 151, 113, 65, 95, 135, 132, 130}, {138, 108, 15, 33, 29, 154, 153}, {87,
9, 196, 187, 67, 185}, {84, 85, 58, 0, 79, 66, 98, 7, 111}, {128, 181, 147, 125, 169, 124, 61}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104,
(2918,326,92,201)
102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {56, 6, 139, 42, 39, 41, 44,
40, 172, 43}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148},
{10, 90, 195, 78, 88, 183, 81}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 143, 168, 142, 127, 149, 92, 112, 193, 162, 80}, {131, 13, 120, 194, 141}
{138, 108, 15, 29, 154, 33, 153}, {56, 6, 42, 39, 139, 43, 40, 172, 44, 41}, {96, 103, 114, 123, 180, 101, 165, 12, 97, 47, 34, 36, 146, 140, 107, 77,
32, 45, 53, 75, 57, 145, 83, 11, 16, 156, 3, 95, 73, 152, 158, 192, 4, 28, 113, 144, 166, 110, 137, 27, 5, 159, 52, 62, 54, 2, 182, 179, 122, 31, 129, 150,
135, 132, 130, 1, 8, 65, 151, 176}, {187, 87, 9, 67, 196, 185}, {84, 85, 0, 58, 66, 7, 98, 79, 111}, {128, 181, 124, 147, 169, 61, 125}, {17, 19, 37,
(3423,313,103,140)
168, 127, 161, 20, 163, 106, 22, 63, 112, 142, 143, 149, 193, 92, 80, 162}, {10, 88, 195, 78, 183, 81, 90}, {136, 171, 94, 60}, {133, 51, 14, 21, 148,
109}, {134, 157, 24, 25, 102, 191, 89, 26, 115, 173, 46, 104, 49, 30, 91, 100, 170, 82, 116, 164, 105, 121, 68, 93, 38, 99, 190, 117, 50, 48, 69, 86,
189, 155, 118, 186, 188, 23}, {131, 141, 13, 194, 120}, {184, 119, 76, 160, 18, 35, 71, 178, 70, 55, 64, 59, 74, 175, 167, 177, 174, 72, 126}
{138, 29, 108, 15, 33, 154, 153}, {187, 87, 9, 196, 185, 67}, {24, 116, 93, 82, 25, 190, 49, 91, 30, 99, 170, 104, 105, 26, 115, 173, 191, 164, 121,
86, 50, 189, 46, 69, 186, 134, 38, 48, 155, 102, 100, 188, 117, 89, 23, 118, 157, 68}, {101, 96, 114, 12, 53, 180, 123, 62, 182, 16, 156, 140, 107, 103,
145, 45, 75, 144, 34, 36, 146, 11, 97, 3, 152, 158, 95, 73, 2, 122, 179, 176, 47, 28, 27, 31, 5, 165, 54, 77, 4, 57, 159, 113, 150, 52, 129, 166, 192,
(301,274,105,197)
83, 110, 32, 137, 135, 1, 8, 151, 65, 132, 130}, {128, 147, 124, 125, 169, 181, 61}, {131, 13, 141, 120, 194}, {58, 84, 66, 0, 7, 98, 111, 85, 79},
{10, 81, 78, 88, 195, 90, 183}, {133, 51, 14, 148, 109, 21}, {56, 6, 139, 42, 40, 39, 44, 43, 172, 41}, {136, 60, 94, 171}, {184, 76, 160, 119, 18,
64, 55, 74, 59, 35, 70, 126, 178, 175, 177, 71, 174, 167, 72}, {17, 161, 163, 20, 63, 142, 168, 19, 149, 106, 112, 193, 22, 143, 37, 127, 92, 162, 80}
R EFERENCES
[1] Z. Wang, B. Li, L. Wang, and Q. Li, “A brief survey on automatic integration test order generation,” in Software Engineering and Knowledge
Engineering Conference (SEKE), 2011, pp. 254–257.
[2] W. K. G. Assunção, T. E. Colanzi, A. T. R. Pozo, and S. R. Vergilio,
“Establishing integration test orders of classes with several coupling
measures,” in 13th Genetic and Evolutionary Computation Conference
(GECCO), 2011, pp. 1867–1874.
[3] W. K. G. Assunção, T. E. Colanzi, S. R. Vergilio, and A. T. R. Pozo, “A
multi-objective optimization approach for the integration and test order
problem,” Information Sciences, 2012, submitted.
[4] T. E. Colanzi, W. K. G. Assunção, A. T. R. Pozo, and S. R. Vergilio,
“Integration testing of classes and aspects with a multi-evolutioanry and
coupling-based approach,” in 3th International Symposium on Search
Based Software Engineering (SSBSE). Springer Verlag, 2011, pp. 188–
203.
[5] S. Vergilio, A. Pozo, J. Árias, R. Cabral, and T. Nobre, “Multiobjective optimization algorithms applied to the class integration and test
order problem,” International Journal on Software Tools for Technology
Transfer, vol. 14, no. 4, pp. 461–475, 2012.
[6] T. E. Colanzi, W. K. G. Assunção, S. R. Vergilio, and A. T. R.
Pozo, “Generating integration test orders for aspect oriented software
with multi-objective algorithms,” in Proceedings of the Latin-American
Workshop on Aspect Oriented Software (LA-WASP), 2011.
[7] W. Assunção, T. Colanzi, S. Vergilio, and A. Pozo, “Evaluating different
strategies for integration testing of aspect-oriented programs,” in Proceedings of the Latin-American Workshop on Aspect Oriented Software
(LA-WASP), 2012.
[8] R. Ré and P. C. Masiero, “Integration testing of aspect-oriented programs: a characterization study to evaluate how to minimize the number
of stubs,” in Brazilian Symposium on Software Engineering (SBES),
2007, pp. 411–426.
[9] E. Carmel and R. Agarwal, “Tactical approaches for alleviating distance
in global software development,” Software, IEEE, vol. 18, no. 2, pp. 22
–29, mar/apr 2001.
[10] J. Noll, S. Beecham, and I. Richardson, “Global software development
and collaboration: barriers and solutions,” ACM Inroads, vol. 1, no. 3,
pp. 66–78, Sep. 2011.
[11] D. C. Kung, J. Gao, P. Hsia, J. Lin, and Y. Toyoshima, “Class firewall,
test order and regression testing of object-oriented programs,” Journal
of Object-Oriented Program, vol. 8, no. 2, pp. 51–65, 1995.
[12] R. Ré, O. A. L. Lemos, and P. C. Masiero, “Minimizing stub creation
during integration test of aspect-oriented programs,” in 3rd Workshop
on Testing Aspect-Oriented Programs (WTAOP), Vancouver, British
Columbia, Canada, 2007, pp. 1–6.
[13] L. C. Briand, Y. Labiche, and Y. Wang, “An investigation of graph-based
class integration test order strategies,” IEEE Transactions on Software
Engineering, vol. 29, no. 7, pp. 594–607, 2003.
[14] Y. L. Traon, T. Jéron, J.-M. Jézéquel, and P. Morel, “Efficient objectoriented integration and regression testing,” IEEE Transactions on
Reliability, pp. 12–25, 2000.
[15] A. Abdurazik and J. Offutt, “Coupling-based class integration and test
order,” in International Workshop on Automation of Software Test (AST).
Shanghai, China: ACM, 2006.
[16] L. C. Briand, J. Feng, and Y. Labiche, “Using genetic algorithms and
coupling measures to devise optimal integration test orders,” in Software
Engineering and Knowledge Engineering Conference (SEKE), 2002.
[17] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on
Evolutionary Computation, vol. 6, no. 2, pp. 182 –197, 2002.
[18] E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the Strength
Pareto Evolutionary Algorithm,” Swiss Federal Institute of Technology
(ETH) Zurich, Gloriastrasse 35, CH-8092 Zurich, Switzerland, Tech.
Rep. 103, 2001.
[19] J. D. Knowles and D. W. Corne, “Approximating the nondominated front
using the pareto archived evolution strategy,” Evolutionary Computation,
vol. 8, pp. 149–172, 2000.
[20] C. A. C. Coello, G. B. Lamont, and D. A. van Veldhuizen, Evolutionary
Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation). Secaucus, NJ, USA: Springer-Verlag New York,
Inc., 2006.
[21] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and
A. Wesslén, Experimentation in Software Engineering: An Introduction.
Kluwer Academic Publishers, Norwell, MA, USA, 2000.
[22] R. S. Pressman, Software Engineering : A Practitioner’s Approach. NY:
McGraw Hill, 2001.
[23] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G. da Fonseca, “Performance assessment of multiobjective optimizers: An analysis
and review,” IEEE Transactions on Evolutionary Computation, vol. 7,
pp. 117–132, 2003.
[24] J. L. Cochrane and M. Zeleny, Multiple Criteria Decision Making.
University of South Carolina Press, Columbia, 1973.
Functional Validation Driven by Automated Tests
Validação Funcional Dirigida por Testes Automatizados
Thiago Delgado Pinto
Departamento de Informática
Centro Federal de Educação Tecnológica, CEFET/RJ
Nova Friburgo, Brasil
[email protected]
Resumo—A qualidade funcional de um software pode ser
avaliada por quão bem ele atende aos seus requisitos funcionais.
Estes requisitos são muitas vezes descritos por intermédio de
casos de uso e verificados por testes funcionais que checam sua
correspondência com as funcionalidades observadas pela
interface com o usuário. Porém, a criação, a manutenção e a
execução destes testes são trabalhosas e caras, enfatizando a
necessidade de ferramentas que as apoiem e realizem esta forma
de controle de qualidade. Neste contexto, o presente artigo
apresenta uma abordagem totalmente automatizada para a
geração, execução e análise de testes funcionais, a partir da
descrição textual de casos de uso. A ferramenta construída para
comprovar sua viabilidade, chamada de FunTester, é capaz de
gerar casos de teste valorados junto com os correspondentes
oráculos, transformá-los em código-fonte, executá-los, coletar os
resultados e analisar se o software está de acordo com os
requisitos funcionais definidos. Avaliações preliminares
demonstraram que a ferramenta é capaz de eficazmente detectar
desvios de implementação e descobrir defeitos no software sob
teste.
Abstract – The functional quality of any software system can
be evaluated by how well it conforms to its functional
requirements. These requirements are often described as use
cases and are verified by functional tests that check whether the
system under test (SUT) runs as specified. There is a need for
software tools to make these tests less laborious and more
economical to create, maintain and execute. This paper presents
a fully automated process for the generation, execution, and
analysis of functional tests based on use cases within software
systems. A software tool called FunTester has been created to
perform this process and detect any discrepancies from the SUT.
Also while performing this process it generates conditions to
cause failures which can be analyzed and fixed.
Keywords – functional validation; automated functional tests;
use cases; business rules; test data generation; test oracle
generation; test case generation and execution;
I.
INTRODUÇÃO
A fase de testes é sabidamente uma das mais caras da
construção de um software, correspondendo a 35 a 50% de seu
custo total quando feito da forma tradicional [1] e de 15 a 25%
quando desenvolvido com uso de técnicas formais leves [2].
Quando feita de forma manual, a atividade de teste se torna
Arndt von Staa
Departamento de Informática
Pontifícia Universidade Católica, PUC-Rio
Rio de Janeiro, Brasil
[email protected]
ineficiente e tediosa [3], usualmente apoiada em práticas ad
hoc e dependente da habilidade de seus criadores. Assim,
torna-se valioso o uso de ferramentas que possam automatizar
esta atividade, diminuindo os custos envolvidos e aumentando
as chances de se entregar um software com menor quantidade
de defeitos remanescentes.
Em geral, é entendido que um software de qualidade atende
exatamente aos requisitos definidos em sua especificação [4].
Para verificar este atendimento, geralmente são realizados
testes funcionais que observam a interface (gráfica) do
software visando determinar se este realmente executa tal como
especificado. Evidentemente supõe-se que os requisitos
estejam de acordo com as necessidades e expectativas dos
usuários. Como isso nem sempre é verdade, torna-se necessária
a possibilidade de redefinir a baixo custo os testes a serem
realizados.
Para simular a execução destes testes, é possível imitar a
operação de um usuário sobre a interface, entrando com ações
e dados, e verificar se o software se comporta da maneira
especificada. Esta simulação pode ser realizada via código,
com a combinação de arcabouços de teste unitário e
arcabouços de teste de interface com o usuário. Entretanto, para
gerar o código de teste de forma automática, é preciso que a
especificação do software seja descrita com mais formalidade e
de maneira estruturada ou, pelo menos, semiestruturada. Como
casos de uso são largamente utilizados para documentar
requisitos de um software, torna-se interessante adaptar sua
descrição textual para este fim.
A descrição textual de casos de uso, num estilo similar ao
usado por Cockburn [5], pode ser descrita numa linguagem
restrita e semiestruturada, como o adotado por Días, Losavio,
Matteo e Pastor [6] para a língua espanhola. Esta formalidade
reduz o número de variações na interpretação da descrição,
facilitando a sua transformação em testes.
Trabalhos como [7, 8, 9, 10, 11, 12], construíram soluções
para apoiar processos automatizados ou semiautomatizados
para a geração dos casos de teste. Entretanto, alguns aspectos
importantes não foram abordados, deixando de lado, por
exemplo, a geração dos valores utilizados nos testes, a geração
dos oráculos e a combinação de cenários entre múltiplos casos
de uso, que são essenciais para sua efetiva aplicação prática.
TABELA I.
PANORAMA SOBRE AS FERRAMENTAS
#
Questão
[9]
[11]
[12]
[7]
[10]
1
Usa somente
casos de uso
como fonte para
os testes?
Qual a forma de
documentação
dos casos de uso?
Controla a
declaração de
casos de uso?
Dispensa a
declaração de
fluxos
alternativos que
avaliam regras de
negócio?
Gera cenários
automaticamente?
Há um cenário
cobrindo cada
fluxo?
Há cenários que
verifiquem regras
de negócio para
um mesmo fluxo?
Há cenários que
combinam
fluxos?
Há cenários que
incluem mais de
um caso de uso?
Há métricas para
cobertura dos
cenários?
Gera casos de
teste semânticos?
Gera valores para
os casos de teste
automaticamente?
Gera oráculos
automaticamente?
Casos de teste são
gerados para um
formato
independente de
linguagem ou
framework?
Gera código de
teste?
Os resultados da
execução do
código gerado
são rastreados?
sim
sim
sim
sim
sim
Fun
Tester
sim1
PRS
IRS
IRS
IRS
UCML
VRS
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
a.
sim
não
não
não
não
Sim
não
não
não
não
não
Sim
sim
sim
sim
sim
sim
Sim
sim
sim
sim
sim
sim
sim
não
sim
não
não
não
sim
não
sim
sim
sim
sim
sim
não
não
não
não
não
sim
não
não
não
sim
sim
sim
não
sim
não
não
sim
sim
não
não
não
não
não
sim
não
não
não
não
não
sim
não
sim
não
não
sim
sim
sim
não
sim
sim
não
sim
sim
N/A
não
não
N/A
sim
modo que os testes abstratos sejam realizados. A seguir
selecionam-se os valores dos dados de entrada, gerando os
casos de teste valorados. Aplicando a especificação aos casos
de teste valorados determinam-se os oráculos, obtendo-se
assim os casos de teste úteis. Estes, finalmente são traduzidos
para scripts ou código a ser usado por ferramentas ou
arcabouços de teste automatizado.
Este artigo descreve um processo totalmente automatizado
que trata muitos dos problemas não resolvidos por trabalhos
anteriores (como a geração automática de dados de teste,
oráculos e cenários que combinam mais de um caso de uso) e
introduz novas abordagens para aumentar sua aplicação prática
e realizar uma validação funcional de alta eficácia.
As próximas seções são organizadas da seguinte forma: A
Seção II apresenta trabalhos correlatos. A Seção III detalha o
novo processo definido. A Seção IV expõe brevemente a
arquitetura da solução. A Seção V retrata uma avaliação
preliminar da ferramenta. Por fim, a Seção VI apresenta as
conclusões do trabalho.
II.
TRABALHOS CORRELATOS
Esta seção realiza uma avaliação de alguns trabalhos
correlatos, com foco na descrição textual de casos de uso como
principal fonte para a geração dos testes.
A Tabela I apresenta um panorama sobre os trabalhos que
construíram ferramentas para este propósito, incluindo a
ferramenta que materializa a abordagem discutida neste artigo,
chamada de FunTester (acrônimo para Funcional Tester). Nela,
é possível observar que FunTester apresenta uma solução mais
completa, implementando avanços que permitem sua aplicação
prática em projetos reais.
III.
PROCESSO
A Figura 1 apresenta o processo realizado na abordagem
apresentada e seguido pela ferramenta construída.
N/A=Não se aplica; PRS=Português Restrito Semiestruturado; IRS=Inglês Restrito Semiestruturado;
UCML=Use Case Markup Language; VRS=Vocabulário Restrito Semiestruturado independente
de idioma.
No processo de geração de casos de teste automatizados
criam-se primeiro os casos de teste abstratos. Estes
determinam as condições que cada caso de teste deve satisfazer
(por exemplo, os caminhos a serem percorridos). A partir deles
determinam-se os casos de teste semânticos, isto é, casos de
teste independentes de arcabouço de testes. Estes levam em
conta as condições que os dados de entrada devem satisfazer de
1
As regras de negócio, descritas adicionalmente em relação às outras
soluções, ainda pertencem aos casos de uso.
Fig. 1. Processo seguido pela ferramenta
Neste processo, o usuário participa apenas das etapas de
descrição dos casos de uso e de suas regras de negócio, sendo
as demais totalmente automatizadas. As etapas 1, 2, 3, 4, 5 e 9
são realizadas pela ferramenta em si, enquanto as etapas 6, 7 e
8 são realizadas por extensões da ferramenta, para a linguagem
e arcabouço de testes alvo. A seguir, será realizada uma
descrição de cada uma delas.
A. Descrição textual de casos de uso (Etapa 1)
Nesta etapa, o usuário realiza a especificação do software
através de casos de uso, auxiliado pela ferramenta. A descrição
textual segue um modelo similar ao de Cockburn [5]. A
ferramenta usa somente alguns dos campos desta descrição
para a geração de testes. As pré-condições e pós-condições são
usadas para estabelecer as dependências entre casos de uso,
numa espécie de máquina de estados. Dos fluxos (disparador,
principal e alternativos) são obtidos os cenários de execução.
De seus passos são extraídas as ações executadas pelo ator e
pelo sistema, que junto a outras informações do caso de uso
(como a indicação se ele pode ser disparado somente através de
outro caso de uso), são usadas para a geração dos testes úteis,
na etapa 5.
A ferramenta permite definir um vocabulário composto
pelos termos esperados por sua extensão para a transformação
dos testes úteis em código-fonte e pelos termos
correspondentes, usados na descrição textual dos casos de uso.
Isto permite tanto documentar o software usando palavras ou
até idiomas diferentes do vocabulário usado para geração dos
testes quanto adaptar esse último para diferentes arcabouços de
teste.
A sintaxe de um passo numa gramática livre de contexto
(GLC) similar à Backus-Naur Form (BNF) é descrita a seguir:
<passo>
<disparador>
<alvo>
<elemento>
<ação>
<documentação>
<caso-de-uso>
<widget>
<URL>
<comando>
<tecla>
<tempo>
::=
|
::=
::=
::=
|
::=
::=
::=
::=
::=
::=
::=
::=
<disparador> <ação> <alvo>+
<disparador> <documentação>
"ator" | "sistema"
<elemento> | <caso-de-uso>
<widget> | <URL> | <comando>
<tecla> | <tempo>
string
string
string
string
string
string
string
integer
O ator ou o sistema dispara uma ação sobre um ou mais
alvos ou sobre uma documentação. Cada alvo pode ser um
elemento ou um caso de uso. Um elemento pode ser um
widget, uma URL, um comando, uma tecla ou um tempo (em
milissegundos, que é geralmente usado para aguardar um
processamento).
preencher os passos dos fluxos. Este detalhamento possibilita
saber o que representa cada elemento, extrair as informações
necessárias para sua conversão em widgets, inferir seus
possíveis valores e formatos e extrair as informações
necessárias para a geração de oráculos. A introdução das regras
de negócio também permite reduzir o número de fluxos
alternativos necessários para tratar erros de uso, uma vez que a
ferramenta gerará automaticamente casos de teste para isto.
Isto permite reduzir consideravelmente o número de caminhos
no uso do software (introduzidos pela combinação dos fluxos),
diminuindo o número de cenários, casos de teste e
consequentemente o tempo de execução dos testes.
A sintaxe definida para as regras de negócio permite
determinar regras tanto para dados contínuos quanto para dados
discretos. O detalhamento de um elemento e de suas regras é
exposto a seguir (em GLC):
<elemento>
<tipo>
<nome>
<nome-interno>
<regra>
<tipo-dado>
<espec-valor>
<tipo-espec>
<mensagem>
<ref-valor>
::=
|
::=
|
::=
::=
::=
::=
|
::=
::=
|
|
|
|
|
|
::=
::=
<nome><tipo><nome-interno>
<nome><tipo><nome-interno><regra>+
"widget" | "url" | "comando"
"teclas" | "tempo"
string
string
<tipo-dado><espec-valor>+
"string" | "integer" | "double"
"date" | "time" | "datetime"
<tipo-espec><mensagem>
"valor-min" <ref-valor>
"valor-max" <ref-valor>
"comprimento-min" <ref-valor>
"comprimento-max" < ref-valor>
"formato" <ref-valor>
"igual-a" <ref-valor>+
"diferente-de" <ref-valor>+
string
<valor>+|<elemento>
Um elemento – que seria equivalente ao widget da
descrição do passo – possui um nome (que é o exibido para o
usuário na documentação), um tipo (ex.: uma janela, um botão,
uma caixa de texto, etc.) e um nome interno (que é usado
internamente para identificar o widget no SST). Se for definido
como editável (se recebe entradas de dados do usuário), pode
conter uma ou mais regras de negócio. Cada regra define o tipo
de dado admitido pelo elemento e uma ou mais especificações
de valor, que permitem definir valores limítrofes, formatos,
lista de valores admissíveis ou não admissíveis, além
(opcionalmente) da mensagem esperada do SST caso alguma
destas definições seja violada (ex.: valor acima do limítrofe).
Cada especificação de valor pode ser proveniente de definição
manual, de definição a partir de outro elemento ou a partir de
consulta parametrizável a um banco de dados (obtendo seus
parâmetros de outra definição, se preciso), fornecendo
flexibilidade para a construção das regras.
O tipo de alvo e o número de alvos possíveis para uma ação
podem variar conforme a configuração do vocabulário usado.
O tipo de elemento também pode variar conforme a ação
escolhida.
A definição de regras com valores obtidos através de
consulta a banco de dados permite utilizar dados criados com o
propósito de teste. Estes dados podem conter valores idênticos
aos esperados pelo sistema, permitindo simular condições de
uso real, o que é desejável em ferramentas de teste.
B. Detalhamento das regras de negócio (Etapa 2)
Esta etapa é um dos importantes diferenciais da ferramenta
e permite que o usuário detalhe os elementos descritos ao
C. Geração de cenários para cada caso de uso (Etapa 3)
Nesta etapa, a ferramenta combina os fluxos de cada caso
de uso, gerando cenários. Cada cenário parte do fluxo
principal, possivelmente passando por fluxos alternativos,
retornando ao fluxo principal ou caindo em recursão (repetindo
a passagem por um ou mais fluxos).
Como os casos de recursão potencializam o número de
combinações entre fluxos, o número de recursões deve ser
mantido baixo, para não inviabilizar a aplicação prática da
geração de cenários. Para isso, a ferramenta permite
parametrizar o número máximo de recursões, limitando a
quantidade de cenários gerados.
A geração de cenários realizada cobre a passagem por todos
os fluxos do caso de uso, bem como a combinação entre todos
eles, pelo menos uma vez. Em cada fluxo, todos os passos são
cobertos. Isto garante observar defeitos relacionados à
passagem por determinados fluxos ou ocasionados por sua
combinação. Segundo o critério de cálculo de cobertura de
caso de uso adotado por Hassan e Yousif [13], que divide o
total de passos cobertos pelo teste pelo total de passos do caso
de uso, a cobertura atingida é de 100%.
D. Combinação de cenários entre casos de uso (Etapa 4)
Esta etapa realiza a combinação entre cenários, levando em
conta os estados definidos nas pré-condições e pós-condições,
bem como as chamadas a casos de uso, que podem ocorrer em
passos de certos fluxos. Quando uma pré-condição referencia
uma pós-condição gerada por outro caso de uso, é estabelecida
uma relação de dependência de estado. Logo, os cenários do
caso de uso do qual se depende devem ser gerados primeiro,
para então realizar a combinação. O mesmo ocorre quando
existe uma chamada para outro caso de uso.
Para representar a rede de dependências entre os casos de
uso do SST e obter a ordem correta para a geração dos
cenários, é gerado um grafo acíclico dirigido dos casos de uso
e então aplicada uma ordenação topológica [14]. Antes de
combinar dois cenários, entretanto, é preciso verificar se os
fluxos de um geram os estados esperados pelo outro. Caso não
gerem, os cenários não são combinados, uma vez que a
combinação poderá gerar um novo cenário incompleto ou
incorreto, podendo impedir que a execução alcance o caso de
uso alvo do teste. Assim, na versão atual da ferramenta, para
combinar dois casos de uso, e , onde depende de , são
selecionados de
somente os cenários que terminem com
sucesso, isto é, que não impeçam a execução do caso de uso
conforme previsto.
Para garantir a correta combinação dos cenários de um caso
de uso, sem que se gerem cenários incompletos ou incorretos,
realiza-se primeiro a combinação com os cenários de casos de
uso chamados em passos; depois com os cenários de fluxos
disparadores do próprio caso de uso; e só então com cenários
de casos de uso de pré-condições.
Dados e dois casos de uso quaisquer do conjunto de
casos de uso do software,
, seja
o número de cenários
do caso de uso ,
o número de cenários do caso de uso ,
o número de cenários de sucesso de , se A depende de B,
então a cobertura dos cenários de ,
, pode ser calculada
como:
Se, por exemplo, um caso de uso tiver 5 cenários, sendo 3
de sucesso, e outro caso de uso, , que depende de , tiver 8
cenários, a cobertura total seria de 40 combinações, enquanto a
cobertura alcançada seria de 24 combinações, ou 60% do total.
Apesar de esta cobertura não ser total (100%), acredita-se que
ela seja eficaz para testes de casos de uso, uma vez que a
geração de cenários incorretos ou incompletos pode impedir o
teste do caso de uso alvo.
É importante notar que a combinação entre cenários é
multiplicativa, ou seja, a cada caso de uso adicionado, seu
conjunto de cenários é multiplicado pelos atuais. Foram
consideradas algumas soluções para este problema, cuja
implementação está entre os projetos futuros, discutidos na
Seção V.
E. Geração de casos de teste úteis (Etapa 5)
Nesta etapa, a ferramenta gera os casos de teste úteis
(formados por comandos, dados valorados e oráculos)
utilizando os cenários e as regras de negócio. Estas regras
permitem inferir os valores válidos e não válidos para cada
elemento e gerá-los automaticamente, de acordo com o tipo de
teste. Os oráculos são gerados com uso da definição das
mensagens esperadas para quando são fornecidos valores não
válidos, de acordo com o tipo de verificação desejada para o
teste.
Como cada descrição de expectativa de comportamento do
sistema é transformada em comandos semânticos e estes
(posteriormente) em comandos na linguagem usada pelo
arcabouço de testes, quando um comando não é executado
corretamente por motivo do SST não corresponder à sua
expectativa, o teste automaticamente falhará. Assim, não é
necessário haver oráculos que verifiquem a existência de
elementos de interface, exceto a exibição de mensagens.
Segundo Meyers [15], o teste de software torna-se mais eficaz
se os valores de teste são gerados baseados na análise das
"condições de contorno" ou "condições limite". Ao utilizar
valores acima e abaixo deste limite, os casos de teste exploram
as condições que aumentam a chance de encontrar defeitos. De
acordo com o British Standard 7925-1 [16], o teste de software
torna-se mais eficaz se os valores são particionados ou
divididos de alguma maneira, como, por exemplo, ao meio.
Além disto, é interessante a inclusão do valor zero (0), que
permite testar casos de divisão por zero, bem como o uso de
valores aleatórios, que podem fazer o software atingir
condições imprevistas. Portanto, levando em consideração as
regras de negócio definidas, para a geração de valores
considerados válidos, independentes do tipo, são adotados os
critérios de: (a) valor mínimo; (b) valor imediatamente
posterior ao mínimo; (c) valor máximo; (d) valor
imediatamente anterior ao máximo; (e) valor intermediário; (f)
zero, se dentro da faixa permitida; (g) valor aleatório, dentro da
faixa permitida. E para a geração de valores considerados não
válidos, os critérios de: (a) valor imediatamente anterior ao
mínimo; (b) valor aleatório anterior ao mínimo; (c) valor
imediatamente posterior ao máximo; (d) valor aleatório
posterior ao máximo; (e) formato de valor incorreto.
A Tabela II exibe os tipos de teste gerados na versão atual
da ferramenta, que visam cobrir as regras de negócio definidas.
Baseado neles, podemos calcular a quantidade mínima de
casos de teste úteis gerados por cenário, QM, como:

Onde é o número de elementos editáveis, é o número
de elementos editáveis obrigatórios e
é o número de
elementos editáveis com formato definido. Para cada cenário
de um caso de uso simples, por exemplo, com 5 elementos,
sendo 3 obrigatórios e 1 com formatação, existirão 32 casos de
teste.
TABELA II.
TIPOS DE TESTE GERADOS
Descrição
Somente obrigatórios
Todos os obrigatórios exceto um
Todos com valor/tamanho mínimo
Todos com valor/tamanho posterior
ao mínimo
Todos com valor/tamanho máximo
Todos com valor/tamanho anterior ao
máximo
Todos com o valor intermediário,
dentro da faixa
Todos com zero, ou um valor
aleatório dentro da faixa, se zero não
for permitido
Todos com valores aleatórios dentro
da faixa
Todos com valores aleatórios dentro
da faixa, exceto um, com valor
imediatamente anterior ao mínimo
Todos com valores aleatórios dentro
da faixa, exceto um, com valor
aleatório anterior ao mínimo
Todos com valores aleatórios dentro
Conclui
caso de uso
Sim
Não
Testes
Sim
Sim
1
1 por elemento
editável
obrigatório
1
1
Sim
Sim
1
1
Sim
1
Sim
1
Sim
1
Não
1 por elemento
editável
Não
1 por elemento
editável
Não
1 por elemento
editável
Não
1 por elemento
editável
Não
1 por elemento
com formato
definido
da faixa, exceto um, com valor
imediatamente posterior ao máximo
Todos com valores aleatórios dentro
da faixa, exceto um, com valor
aleatório posterior ao máximo
Todos com formato permitido, exceto
um
Os testes gerados cobrem todas as regras de negócio
definidas, explorando seus valores limítrofes e outros. Como o
total de testes possíveis para um caso de uso não é conhecido,
acredita-se que com a cobertura acumulada, unindo todas as
coberturas descritas até aqui, espera-se exercitar
consideravelmente o SST na busca por defeitos, obtendo alta
eficácia.
Atualmente os casos de teste úteis são exportados para a
JavaScript Object Notation (JSON), por ser compacta,
independente de linguagem de programação e fácil de analisar
gramaticalmente.
F. Transformação em código-fonte (Etapa 6)
Esta etapa e as duas seguintes utilizam a extensão da
ferramenta escolhida pelo usuário, de acordo o software a ser
testado.
A extensão da ferramenta lê o arquivo JSON contendo os
testes úteis e os transforma em código-fonte, para a linguagem
e arcabouços de teste disponibilizados por ela. Atualmente, há
uma extensão que os transforma em código Java e os
arcabouços TestNG2 e FEST3, visando o teste de aplicações
com interface gráfica Swing.
Para facilitar o rastreamento de falhas ou erros nos testes, a
extensão construída realiza a instrumentação do código-fonte
gerado, indicando, com comentários de linha, o passo
semântico correspondente à linha de código. Esta
instrumentação será utilizada na Etapa 8, para pré-análise dos
resultados.
G. Execução do código-fonte (Etapa 7)
Nesta etapa, a extensão da ferramenta executa o códigofonte de testes gerado. Para isto, ela usa linhas de comando
configuradas pelo usuário, que podem incluir a chamada a um
compilador, ligador (linker), interpretador, navegador web, ou
qualquer outra aplicação ou arquivo de script que dispare a
execução dos testes.
Durante a execução, o arcabouço de testes utilizado gera
um arquivo com o log de execução dos testes. Este arquivo será
lido e analisado na próxima etapa do processo.
H. Conversão e pré-análise dos resultados de execução
(Etapa 8)
Nesta etapa, a extensão da ferramenta lê o log de execução
dos testes e analisa os testes que falharam ou obtiveram erro,
investigando: (a) a mensagem de exceção gerada, para detectar
o tipo de problema ocorrido; (b) o rastro da pilha de execução,
para detectar o arquivo e a linha do código-fonte onde a
exceção ocorreu e obter a identificação do passo semântico
correspondente (definida pela instrumentação realizada na
Etapa 6), possibilitando rastrear o passo, fluxo e cenário
correspondentes; (c) comparar o resultado esperado pelo teste
semântico com o obtido.
O log de execução e as informações obtidas da pré-análise
são convertidos para um formato independente de arcabouço de
testes e exportados para um arquivo JSON, que será lido e
analisado pela ferramenta na próxima etapa.
I. Análise e apresentação dos resultados (Etapa 9)
Por fim, a ferramenta realiza a leitura do arquivo com os
resultados da execução dos testes e procura analisá-los para
rastrear as causas de cada problema encontrado (se houverem).
Nesta etapa, o resultado da execução é confrontado com a
especificação do software, visando identificar possíveis
problemas.
2
3
http://testng.org
http://fest.easytesting.org
Fig. 2. Arquitetura da solução
IV.
ARQUITETURA
A Figura 2 apresenta a arquitetura da solução construída,
indicando seus componentes e a ordem de interação entre os
mesmos, de forma a fornecer uma visão geral sobre como o
processo descrito é praticado.
É interessante observar que, em geral, o código gerado fará
uso de dois arcabouços de teste: um para automatizar a
execução dos testes e outro para testar especificamente o tipo
de interface (com o usuário) desejada. Esse arcabouço de
automação dos testes que irá gerar os resultados da execução
dos testes lidos pela extensão da ferramenta.
V.
AVALIAÇÃO
Uma avaliação preliminar da eficácia da ferramenta foi
realizada com um software construído por terceiros, coletado
da Internet. O software avaliado contém uma especificação de
requisitos por descrição textual de casos de uso, realiza acesso
a um banco de dados MySQL4 e possui interface Swing.
A especificação encontrada no software avaliado estava
incompleta, faltando, por exemplo, alguns fluxos alternativos e
regras de negócio. Quando isso ocorre, dá-se margem para que
a equipe de desenvolvedores do software complete a
especificação conforme sua intuição (e criatividade), muitas
vezes divergindo da intenção original do projetista, que tenta
mapear o problema real. Como isto acabou ocorrendo no
software avaliado, optou-se por coletar os detalhes não
4
http://www.mysql.com/
presentes na especificação a partir da implementação do
software. Desta forma, são aumentadas as chances da
especificação estar próxima da implementação, acusando
menos defeitos deste tipo.
Para testar a eficácia da ferramenta, foram geradas: (a) uma
versão da especificação funcional do SST com divergências;
(b) duas versões modificadas do SST, com emprego de
mutantes.
Para gerar as versões com empregos de mutantes, levou-se
em consideração, assim como no trabalho de Gutiérrez et al.
[7], o modelo de defeitos de caso de uso introduzido por Binder
[17], que define operadores mutantes para casos de uso. O uso
desses operadores é considerado mais apropriado para testes
funcionais (do que os operadores "clássicos"), uma vez que seu
objetivo não é testar a cobertura do código do SST em si, mas
gerar mudanças no comportamento do SST que possam ser
observadas por testes funcionais. Nesse contexto, o termo
"mutante" possui uma conotação diferente do mesmo termo
aplicado a código. Um mutante da especificação funcional tem
o poder de gerar uma variedade de casos de teste que
reportarão falhas. Da mesma forma, um mutante de código tem
o poder de gerar falhas em uma variedade de casos de teste
gerados a partir da especificação funcional.
Na versão da especificação do SST com divergências,
foram introduzidas novas regras de negócio, visando verificar
se os testes gerados pela ferramenta seriam capazes de
identificar as respectivas diferenças em relação ao SST. Na
primeira versão modificada do SST, foi usado o operador
mutante para casos de uso "substituição de regras de validação
ou da admissão de um dado como correto" (SRV). Com ele,
operadores condicionais usados no código de validação de
dados do SST foram invertidos, de forma a não serem
admitidos como corretos. E na segunda versão do SST, foi
utilizado operador mutante de casos de uso "informação
incompleta ou incorreta mostrada pelo sistema" (I3). Com ele,
as mensagens mostradas pelo sistema quando detectado um
dado inválido foram modificadas, de forma a divergirem da
esperada pela especificação.
A Tabela III apresenta a quantidade de modificações
realizadas nos três principais cenários analisados e a Tabela IV,
o número testes que obtiveram falha, aplicando-se estas
modificações.
Com o total de 7 modificações na especificação original,
mais 9 testes obtiveram falha em relação ao resultado original.
Com o emprego de 22 mutações com o primeiro mutante
(SRV), mais 51 testes obtiveram falha. E com o emprego de 10
mutações com o segundo mutante (I3), mais 12 testes falharam.
TABELA III.
MODIFICAÇÕES NOS TRÊS PRINCIPAIS CENÁRIOS
Cenário
Modificações
na
especificação
funcional
2
3
2
Cenário 1
Cenário 2
Cenário 3
TOTAL
7
TABELA IV.
10
Número de testes que falharam
SST
original
TOTAL
22
Mutações
com
mutante
I3
4
4
2
NÚMERO DE TESTES QUE FALHARAM
Cenário
Cenário 1
Cenário 2
Cenário 3
Mutações
com
mutante
SRV
2
14
6
4
35
37
76
SST frente à
especificação
funcional
com
divergências
10
40
35
SST com
mutante
SRV
SST com
mutante I3
8
59
60
10
37
41
85
127
88
Assim, além de corretamente detectar defeitos na versão
original, geradas por entradas de dados não válidos e por
diferenças na especificação, os testes gerados pela ferramenta
foram capazes de observar as mudanças realizadas, tanto em
relação à especificação quanto em relação à implementação do
SST.
VI.
CONCLUSÕES
O presente artigo apresentou uma nova abordagem para a
geração e execução automática de testes funcionais, baseada na
especificação de requisitos através da descrição textual de
casos de uso e do detalhamento de suas regras de negócio.
Os resultados preliminares obtidos com o emprego da
ferramenta foram bastante promissores, uma vez que se pôde
perceber que ela é capaz de atingir alta eficácia em seus testes,
encontrando corretamente diferenças entre a especificação e a
implementação do SST.
Além disso, seu uso permite estabelecer um meio de aplicar
Test-Driven Development no nível de especificação, de forma
que o software seja construído, incrementalmente, para passar
nos testes gerados pela especificação construída. Ou seja, uma
equipe de desenvolvimento pode criar a especificação
funcional de uma parte do sistema, gerar os testes funcionais a
partir desta especificação, implementar a funcionalidade
correspondente e executar os testes gerados para verificar se a
funcionalidade corresponde à especificação. Isto, inclusive,
pode motivar a criação de especificações mais completas e
corretas, uma vez que compensará fazê-lo.
A. Principais contribuições
Dentre as principais contribuições da abordagem construída
destacam-se: (a) Apresentação de um processo completo e
totalmente automatizado; (b) Uso da descrição de regras de
negócio na especificação dos casos de uso, permitindo gerar os
valores e oráculos dos testes, bem como tornar desnecessário
descrever uma parcela significativa dos fluxos alternativos; (c)
Uso de fontes de dados externas (ex.: banco de dados) na
composição das regras de negócio, permitindo a simulação de
condições reais de uso; (d) Geração de cenários que envolvem
repetições de fluxos (loops) com número de repetições
parametrizável; (e) Geração de cenários que envolvem mais de
um caso de uso; (f) Geração de testes semânticos com nomes
correlatos ao tipo de verificação a ser realizada, permitindo que
o desenvolvedor entenda o que cada teste verifica, facilitando
sua manutenção; e (g) Uso de vocabulário configurável,
permitindo realizar a descrição textual em qualquer idioma e
diferentes arcabouços de teste.
B. Principais restrições
As principais restrições da abordagem apresentada
atualmente são: (a) Não simulação de testes para fluxos que
tratam exceções (ex.: falha de comunicação via rede, falha da
mídia de armazenamento, etc.), uma vez que exceções tendem
a ser de difícil (senão impossível) simulação através de testes
funcionais realizados através da interface com o usuário; (b) As
regras de negócio atualmente não suportam o uso de
expressões que envolvam cálculos, fórmulas matemáticas ou o
uso de expressões condicionais (ex.: if-then-else); (c) A
abrangência dos tipos de interface gráfica passíveis de teste
pela ferramenta é proporcional aos arcabouços de testes de
interface utilizados. Assim, é possível que determinados tipos
de interface com o usuário, como as criadas para games ou
aplicações multimídia, possam não ser testadas por completo,
se o arcabouço de testes escolhido não suportá-los, ou não
suportar determinadas operações necessárias para seu teste.
C. Trabalhos em andamento
Atualmente outros tipos de teste estão sendo acrescentados
à ferramenta, aumentando seu potencial de verificação. A
capacidade de análise automática dos problemas ocorridos, de
acordo com os resultados fornecidos pelo arcabouço de testes
alvo, também está sendo ampliada.
Testes mais rigorosos da ferramenta estão sendo elaborados
para verificar a flexibilidade e eficácia da ferramenta em
diferentes situações de uso prático.
D. Trabalhos futuros
A atual geração de cenários combina todos os fluxos do
caso de uso pelo menos uma vez, garantindo um bom nível de
cobertura para a geração de testes. Isto é um atributo desejável
para verificar o SST antes da liberação de uma versão para o
usuário final. Durante seu desenvolvimento, entretanto, pode
ser interessante diminuir a cobertura realizada, para que o
processo de execução de testes ocorra em menor tempo. Para
isto, propõe-se o uso de duas técnicas: (a) atribuir um valor de
importância para cada fluxo, que servirá como como filtro
para a seleção dos cenários desejados para teste; e (b) realizar a
indicação da não influência de certos fluxos em outros,
evitando gerar cenários que os combinem. Esta última técnica
deve ser empregada com cuidado, para evitar a indicação de
falsos negativos (fluxos que se pensa não influenciar o estado
de outros, mas na verdade influencia).
Para diminuir a cobertura realizada pela combinação de
cenários, com intuito de acelerar a geração dos cenários para
fins de testes rápidos, pode-se empregar a técnica de uso do
valor de importância, descrita anteriormente, além de uma
seleção aleatória e baseada em histórico. Nesta seleção, o
combinador de cenários realizaria a combinação entre dois
casos de uso escolhendo, pseudoaleatoriamente, um cenário
compatível (de acordo com as regras de combinação descritas
na Etapa 4) de cada um. Para esta seleção não se repetir, seria
guardado um histórico das combinações anteriores. Dessa
forma, a cobertura seria atingida gradualmente, alcançando a
cobertura completa ao longo do tempo.
Para reduzir o número de testes gerados, também se pode
atribuir um valor de importância para as regras de negócio, de
forma a gerar testes apenas para as mais relevantes. Também se
poderia adotar a técnica de seleção gradual aleatória (descrita
anteriormente) para as demais regras, a fim de que a cobertura
total das regras de negócio fosse atingida ao longo do tempo.
Por fim, pretende-se criar versões dos algoritmos
construídos (não discutidos neste artigo) que executassem em
paralelo, acelerando o processo. Obviamente, também serão
criadas extensões para outros arcabouços de teste, como
Selenium5 ou JWebUnit6 (que visam aplicações web),
aumentando a abrangência do uso da ferramenta.
REFERÊNCIAS
[1]
MILER, Keith W., MORELL, Larry J., NOONAN, Robert E., PARK,
Stephen K., NICOL, David M., MURRIL, Branson W., VOAS, Jeffrey
M., "Estimating the probability of failure when testing reveals no
failures", IEEE Transactions on Software Engineering, n. 18, 1992, pp
33-43.
BENDER, Richard, "Proposed software evaluation and test KPA", n. 4,
1996,
Disponível
em:
http://www.uml.org.cn/test/12/softwaretestingmaturitymodel.pdf
MAGALHÃES, João Alfredo P. "Recovery oriented software", Tese de
Doutorado, PUC-Rio, Rio de Janeiro, 2009.
CROSBY, Philip B., "Quality is free", McGraw-Hill, New-York, 1979.
COCKBURN, Alistar. "Writing effective use cases", Addison-Wesley,
2000.
[2]
[3]
[4]
[5]
5
6
http://seleniumhq.org
http://jwebunit.sourceforge.net/
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
DÍAS, Isabel, LOSAVIO, Francisca, MATTEO, Alfredo, PASTOR,
Oscar, "A specification pattern for use cases", Information &
Management, n. 41, 2004, pp. 961-975.
GUTIÉRREZ, Javier J., ESCALONA, Maria J., MEJÍAS, Manuel,
TORRES, Jesús, CENTENO, Arturo H, "A case study for generating
test cases from use cases", University of Sevilla, Sevilla, Spain, 2008.
CALDEIRA, Luiz Rodolfo N., "Geração semi-automática de massas de
testes funcionais a partir da composição de casos de uso e tabelas de
decisão", Dissertação de Mestrado, PUC-Rio, Rio de Janeiro, 2010.
PESSOA, Marcos B., "Geração e execução automática de scripts de
teste para aplicações web a partir de casos de uso direcionados por
comportamento", Dissertação de Mestrado, PUC-Rio, Rio de Janeiro,
2011.
KASSEL, Neil W., "An approach to automate test case generation from
structured use cases", Tese de Doutorado, Clemson University, 2006.
JIANG, Mingyue, DING, Zuohua, "Automation of test case generation
from textual use cases", Zhejiang Sci-Tech University, Hangzhou,
China, 2011.
BERTOLINI, Cristiano, MOTA, Alexandre, "A framework for GUI
testing based on use case design", Universidade Federal de Pernambuco,
Recife, Brazil, 2010.
HASSAN, Hesham A., YOUSIF, Zahraa E., "Generating test cases for
platform independent model by using use case model", International
Journal of Engineering Science and Technology, vol. 2, 2010.
KAHN, Arthur B., "Topological sorting of large networks",
Communications of the ACM 5, 1962, pp. 558–562.
MEYERS, Glenford J., "The art of software testing", John Wiely &
Sons, New York, 1979.
BRITISH STANDARD 7925-1, "Software testing: vocabulary", 1998.
BINDER, Robert V., "Testing object-oriented systems: models, patterns
and tools", Addison-Wesley, 2000.
Visualization, Analysis, and Testing of Java and
AspectJ Programs with Multi-Level System Graphs
Otávio Augusto Lazzarini Lemos∗ , Felipe Capodifoglio Zanichelli∗ , Robson Rigatto∗ ,
Fabiano Ferrari† , and Sudipto Ghosh‡
∗ Science
and Technology Department – Federal University of São Paulo at S. J. dos Campos – Brazil
{otavio.lemos, felipe.zanichelli, robson.rigatto}@unifesp.br
† Computing Department – Federal University of Sao Carlos – Brazil
[email protected]
‡ Department of Computer Science, Colorado State University, Fort Collins, CO, USA
[email protected]
Abstract—Several software development techniques involve the
generation of graph-based representations of a program created
via static analysis. Some tasks, such as integration testing, require
the creation of models that represent several parts of the system,
and not just a single component or unit (e.g., unit testing).
Besides being a basis for testing and other analysis techniques,
an interesting feature of these models is that they can be used
for visual navigation and understanding of the software system.
However, the generation of such models – henceforth called
system graphs – is usually costly, because it involves the reverse
engineering and analysis of the whole system, many times done
upfront. A possible solution for such a problem is to generate the
graph on demand, that is, to postpone detailed analyses to when
the user really needs it. The main idea is to start from the package
structure of the system, representing dependencies at a high level,
and to make control flow and other detailed analysis interactively
and on demand. In this paper we use this idea to define a model
for the visualization, analysis, and structural testing of objectoriented (OO) and aspect-oriented (AO) programs. The model is
called Multi-Level System Graph (MLSG), and is implemented in
a prototype tool based on Java and AspectJ named SysGraph4AJ
(for Multi-Level System Graphs for AspectJ). To evaluate the
applicability of SysGraph4AJ with respect to performance, we
performed a study with three AspectJ programs, comparing
SysGraph4AJ with a similar tool. Results indicate the feasibility
of the approach, and its potential in helping developers better
understand and test OO and AO Java programs. In particular,
SysGraph4AJ performed around an order of magnitude faster
than the other tool.
I. I NTRODUCTION
Several software engineering tasks require the representation of source code in models suitable for analysis, visualization, and testing [1]. For instance, control flow graphs can be
used for structural testing [2], and call graphs can be used
for compiler optimization [3]. Some techniques require the
generation of graphs that represent the structure of multiple
modules or whole systems. For instance, structural integration
testing may require the generation of control flow graphs for
several units that interact with each other in a program [2, 4].
Most testing tools focus only on the representation of local
parts of the systems, outside their contexts, or do not even support the visualization of the underlying models. For instance,
JaBUTi, a family of tools for testing Java Object Oriented
(OO) and Aspect-Oriented (AO) programs, supports the visualization of models of single units [5], pairs of units [4], subsets
of units [2], or, at most, a cluster of units of the system [6].
On the other hand, tools like Cobertura [7] and EMMA [8],
which support branch and statement coverage analysis of
Java programs, do not support visualization of the underlying
control flow models. Being able to view these models in a
broader context is important to improve understanding of the
system as a whole, especially for testers and white-box testing
researchers and educators.
Nevertheless, the generation of such system models is
usually costly because it requires the reverse engineering and
analysis of whole systems. Such a task may affect the performance of tools, a feature very much valued by developers
nowadays. For instance, recently Eclipse, the leading Java
IDE, has been criticized for performance issues [9]. To make
the construction of these models less expensive, analyses can
be made at different levels of abstraction, on demand, and
interactively. This strategy also supports a visual navigation
of the system, where parts considered more relevant can be
targeted. The visualization itself might also help discovering
faults, since it is a form of inspection, but more visually
appealing.
The strategy of analyzing systems incrementally is also
important because studies indicate that the distribution of faults
in software systems follow the Pareto principle; that is, a small
number of modules tend to present the majority of faults [10].
In this way, it makes more sense to be able to analyze systems
in parts (but also within their contexts), providing a focused
testing activity and thus saving time.
In this paper we apply this strategy for the representation
of Java and AspectJ programs. The type of model we propose
– called Multi-Level System Graph (MLSG) – supports the
visualization, analysis, and structural testing of systems written
in those languages. Since researchers commonly argue that
Aspect-Oriented Programming (AOP) introduces uncertainty
about module interactions [11], we believe the MLSG is
particularly useful in this context.
The first level of the MLSG shows the system’s package
structure. As the user chooses which packages to explore,
classes are then analyzed and shown in a second level, and
methods, fields, pieces of advice, and pointcuts can be viewed
in a third level. From then on, the user can explore low level
control-flow and call chain graphs built from specific units
(methods or pieces of advice). At this level, dynamic coverage
analysis can also be performed, by supporting the execution of
test cases. It is important to note that the analysis itself is only
done when the user selects to explore a particular structure of
the system; that is, it is not only a matter of expanding or
collapsing the view.
To automate the visualization, analysis, and testing based on
MLSGs, we implemented a tool called SysGraph4AJ (MultiLevel System Graphs for AspectJ). The tool implements the
MLSG model and supports its visualization and navigation.
Currently the tool also supports statement, branch, and cyclomatic complexity coverage analysis at unit level within
MLSGs. We have also conducted an initial evaluation of the
tool with three large AspectJ systems. Results show evidence
of the feasibility of the tool for the analysis and testing of OO
and AO programs.
The remainder of this paper is structured as follows. Section II discuss background concepts related to the presented
approach and Section III presents the proposed approach. Section IV introduces the SysGraph4AJ tool. Section V discusses
an initial evaluation of the approach and Section VI presents
related work. Finally, Section VII presents the conclusions and
future work.
II. BACKGROUND
To help understanding the approach proposed in this paper,
we briefly introduce key concepts related to Software testing
and AOP.
A. Software testing and White-box testing
Software testing can be defined as the execution of a program against test cases with the intent of revealing faults [12].
The different testing techniques are defined based on the
artifact used to derive test cases. Two of the most wellknown techniques are functional – or black-box – testing, and
structural – or white-box – testing. Black-box testing derives
test cases from the specification or description of a program,
while white-box testing derives test cases from the internal
representation of a program [12]. Some of the well-known
structural testing criteria are statement, branch, or def-use [13]
coverage, which require that all commands, decisions, or pairs
of assignment and use locations of a variable be covered by
test cases. In this paper we consider method and advice (see
next section) as the smallest units to be tested, i.e., the units
targeted by unit testing.
In white-box testing, the control-flow graph (CFG) is used
to represent the flow of control of a program, where nodes
represent a block of statements executed sequentially, and
edges represent the flow of control from one block to another [13]. White-box testing is usually supported by tools,
as manually deriving CFGs and applying testing criteria is
unreliable and uneconomical [14]. However, most open and
professional coverage analysis tools do not support the visualization of the CFGs (such as Cobertura, commented in
Section I, Emma1 , and Clover2 ). Some prototypical academic
tools do support the visualization of CFG, but mostly separated
from its context (i.e., users select to view the model from
code, and only the CFG for that unit is shown, apart from
any other representation). For instance, JaBUTi supports the
visualization of CFGs or CFG clusters apart from the system
overall structure.
In this paper we propose a multi-level model for visualization, analysis, and testing of OO and AO programs that is
built interactively. CFGs are shown within the larger model,
so testing can be done with a broader understanding of the
whole system.
B. AOP and AspectJ
AOP supports the implementation of separate modules
called aspects that contribute to the implementation of other
modules of the system. General-purpose AOP languages define
four features: (1) a join point model that describes hooks
in the program where additional behavior may be defined;
(2) a mechanism for identifying these join points; (3) modules
that encapsulate both join point specifications and behavior
enhancement; and (4) a weaving process to combine both base
code and aspects [15].
AspectJ [16], the most popular AOP language to date, is
an extension of the Java language to support general-purpose
AOP. In AspectJ, aspects are modules that combine join
point specifications – pointcuts or, more precisely, pointcut
designators (PCDs3 ); pieces of advice, which implement the
desired behavior to be added at join points; and regular OO
structures such as methods and fields.
Advice can be executed before, after, or around join points
selected by the corresponding pointcut, and are implemented
as method-like constructs. Advice can also pick context information from the join point that caused them to execute.
Aspects can also declare members – fields and methods – to be
owned by other types, i.e., inter-type declarations. AspectJ also
supports declarations of warnings and errors that arise when
certain join points are identified at compile time, or reached
at execution.
Consider the partial AspectJ implementation of an Online
Music Service shown in Figure 1, where songs from a database
can be played and have their information shown to the user
(adapted from an example presented by Bodkin and Laddad
[17]). Each user has an account with credit to access songs
available on the database. At a certain price, given for each
song, users can play songs. The song price is debited from
the user account whenever the user plays it. Reading lyrics
of a song should be available to users at no charge. If a
1 http://emma.sourceforge.net/
- 01/14/2013.
2 http://www.atlassian.com/software/clover/overview
- 01/14/2013.
pointcut is the set of selected join points itself and the PCD is usually a
language construct that defines pointcuts. For simplicity, we use these terms
interchangeably.
3A
public class Song implements Playable {
private String name;
public Song(String name) { ... }
public String getName() { ... }
public void play() { ... }
public void showLyrics() { ... }
public boolean equals(Object o) { ... }
public int hashCode() { ... }
}
public aspect BillingPolicy {
public pointcut useTitle() :
execution(* Playable.play(..)) ||
execution(* Song.showLyrics(..));
public pointcut topLevelUseTitle():
useTitle() && !cflowbelow(useTitle());
after(Playable playable) returning :
topLevelUseTitle() && this(playable) {
User user =
(User)Session.instance().getValue("currentUser");
int amount = playable.getName().length();
user.getAccount().bill(amount);
System.out.println("Charge: " + user + " " + amount);
}
}
public aspect AccountSuspension {
private boolean Account.suspended = false;
public boolean Account.isSuspended() { ... }
after(Account account) returning: set(int Account.owed)
&& this(account) { ... }
before() : BillingPolicy.topLevelUseTitle() {
User user =
(User) Session.instance().getValue("currentUser");
if (user.getAccount().isSuspended()) {
throw new IllegalArgumentException();
}
}
}
Fig. 1.
Partial source code of the Online Music Service [17].
In the AO code example presented in Figure 1 there are
two faults. The first is related to the useTitle pointcut,
which selects the execution of the showLyrics method as a
join point. This makes the user be charged when accessing
songs’ lyrics. However, according to the specification of the
program, reading lyrics should not be charged. As commented
in Section I, AOP tends to introduce uncertainty about module
interactions, when only the code is inspected. In particular, for
this example scenario, the explicit visualization of where the
pieces of advice affect the system is important to discover the
fault4 . We believe the model we propose would help the tester
in such a task.
The second fault present in the AO code
example in Figure 1 is the throwing of an
inadequate exception in AccountSuspension’s before
advice
–
IllegalArgumentException
instead
of
AccountSuspensionException. Such fault would not be
easily spotted via inspection: only by executing the true part
of the branch, would the fault most probably be revealed. For
this case, structural testing using branch coverage analysis
would be more adequate than the system’s inspection5 .
Based on this motivating example, we believe the interplay
between visualization and structural testing is a promising
approach, specially for AO programs. Therefore, in this section
we define a model to support such approach, also keeping in
mind the performance issue when dealing with large models.
The incremental strategy is intended to keep an adequate
performance while deriving the model.
A. The Multi-Level System Graph
user tries to play a song without enough credits, the system
yields an adequate failure. The system also manages the user
access sessions. In particular, Figure 1 shows the Song class
that represents songs that can be played; the BillingPolicy
aspect, that implements the billing policy of the system; and
the AccountSuspension aspect, which implements the account
suspension behavior of the system. Note that the after returning
advice of the BillingPolicy aspect and the before advice of
the AccountSuspension aspect affect the execution of some
of the system’s methods, according to the topLevelUseTitle
pointcut.
III. I NCREMENTAL V ISUALIZATION , A NALYSIS , AND
T ESTING OF OO AND AO PROGRAMS
To support the visualization, analysis, and structural testing
of OO and AO programs, an interesting approach is to derive
the underlying model by levels and interactively. Such a model
could then be used as a means to navigate through the system
incrementally and apply structural testing criteria to test the
program as it is analyzed. The visualization of the system
using this model would also serve itself as a visually appealing
inspection of the system’s structure. This type of inspection
could help discovering faults statically, while the structural
testing functionality would support the dynamic verification
of the system.
The model we propose is called Multi-Level System Graph
(MLSG), and it represents the high-level package structure of
the system all the way down to the control flow of its units.
The MLSG is a composition of known models – such as call
and control flow graphs – that nevertheless, to the best of our
knowledge, have not been combined in the way we propose
in this paper.
The MSLG can be formally defined as a directed graph
M LSG = N, E where:
• N = P ∪ C ∪ A ∪ M ∪ Ad ∪ F ∪ P c ∪ F l, where:
– P is the set of package nodes, C is the set of class
nodes, A is the set of aspect nodes, M is the set of
method nodes, Ad is the set of advice nodes, F is the
set of field nodes, P c is the set of pointcut nodes, and
F l is the set of control flow nodes that represent blocks
of code statements.
4 There are some features presented by modern IDEs that also help discovering such a fault. For instance, Eclipse AJDT [18] shows where each advice
affects the system. This could help the developer notice that the showLyrics
method should not be affected. However, we believe the model we present
makes the join points more explicit, further facilitating the inspection of such
types of fault.
5 The example we present is only an illustration. It is clear that the
application of graph inspections and structural testing could help revealing
other types of faults. However, we believe the example serves as a motivation
for the approach we propose in this paper, since it shows two instances of
faults that could be found using an approach that supports both techniques.
billing
main
model
repository
P
P
P
P
+
package level
BillingPolice A
AccountSuspension A
Main
C
+
Song C
Playable C
+
+
init
useTitle
pc
+
init
m
+
isSuspended
m
+
topLevelUseTitle
m
afterReturning
pc
afterReturning
a
init
+
m
m
+
getName
a
class/aspect level
+
showLyrics
before
+
C Playlist
+
hashCode
+
a
m
m
f
+
m
play
equals
m
method/advice level
fl
1
fl
fl
2
3
control flow level
Legend
P
Package node
Contains edge
C
Class node
A
Aspect node
Call/Interception edge
m
Method node
f
Field node
Advice node
pc
Pointcut node
fl
n
Control flow node
+
Unanalyzed node
Control flow edge
Fig. 2.
An example MLSG for the Music Online program.
E = Co ∪ Ca ∪ I ∪ F e, where:
– Co is the set of contains edges. A contains edge
(N1 , N2 ) represents that the structure represented by
node N1 contains the structure represented by node N2 ;
Ca is the set of call edges (N1 , N2 ), N1 ∈ M, N2 ∈
M , which represent that the method represented by
N1 calls the method represented by N2 ; I is the set
of interception edges. An interception edge (N1 , N2 ),
N1 ∈ Ad, N2 ∈ (M ∪ Ad), represents that the method
or advice represented by N2 is intercepted by the
advice represented by N1 ; F e is the set of flow edges
(N1 , N2 ), N1 ∈ F l, N2 ∈ F l, which represent that
control may flow from the block of code represented
by N1 to the block of code represented by N2 . The
edges’ types are defined by the types of their source
and target nodes (e.g., interception edges must have
advice nodes as source nodes and advice or method
nodes as target nodes).
An example of a partial MLSG of the example AO system
discussed in Section II is presented in Figure 2. Note that
there are parts of the system that were not fully analyzed
(Unanalyzed Nodes). This is because the model shows a
scenario where the user chose to expand some of the packages, modules (classes/aspects), and units (methods/pieces of
advice), but not all. By looking at the model we can quickly
see that the BillingPolicy after returning advice and the
AccountSuspension before advice are affecting the Song’s
play and showLyrics methods. As commented before, the
interception of the showLyrics method shown by crossed
interception edges is a fault, since only the play operation
should be charged. This is an example of how the inspec•
a
tion of the MLSG can support discovering faults. In the
same model we can also see the control-flow graph of the
AccountSuspension before advice (note that the control flow
graphs of the other units could also be expanded once the user
selects this operation).
Coverage analysis can be performed by using the MLSG.
For instance, by executing test cases against the program, with
the model we could analyze how many of the statement blocks
or branches of the AccountSuspension before advice would
be covered. While doing this the fault present in this unit could
be uncovered.
Another type of model that is interesting for the visualization and testing of systems is the call chain graph (CCG),
obtained from the analysis of the call hierarchy (such as
done in [19]). The CCG shows the sequence of calls (and,
in our case, advice interactions as well) that happen from a
given unit. The same information is available at the MLSG;
however, the CCG shows a more vertical view of the system,
according to the call hierarchy, while the MLSG shows a more
horizontal view of the system, according to its structure, with
methods and pieces of advice at the same level. Figure 3 shows
an example CCG built from the Playlist’s play method.
Numbers indicate the order in which the methods are called
or in which control flow is passed to pieces of advice.
IV. I MPLEMENTATION : S YS G RAPH 4AJ
We developed a prototype tool named SysGraph4AJ (from
Multi-Level System Graphs for AspectJ) that implements the
proposed MLSG model. The tool is a standalone application
written in Java. The main features we are keeping in mind
while developing SysGraph4AJ is its visualization capability
(we want the models to be shown by the tool to be intuitive
play
getName
m
1
3
2
a
m
before
afterReturning
a
Fig. 3. An example CCG for the Music Online program, built from the
Playlist’s play method.
and useful at the same time) and its performance. In particular,
we started developing SysGraph4AJ also because of some
performance issues we have observed in other testing tools
(such as JaBUTi). We believe this feature is much valued by
developers nowadays, as commented in Section I.
The architecture of the tool is divided into the following five
packages: model, analysis, gui, visualization, and graph.
The model package contains classes that represent the MLSG
constituents (each node and edge type); the analysis package
is responsible for the analysis of the object code in order to
be represented by a MLSG; the gui package is responsible for
the Graphical User Interface; the visualization package is
responsible for implementing the visualization of the MLSGs;
and the graph package is responsible for the control flow graph
implementation (we decided to separate it from the model
package because this part of the MLSG is more complex than
others).
For the analysis part of the tool, we used both the Java
API (to load classes, for instance), and the Apache Byte Code
Engineering Library (BCEL6 ). BCEL was used to analyze
methods (for instance, to check their visibility, parameter and
return types) and aspects. We decided to make the analysis
using bytecode for three reasons: (1) AspectJ classes and
aspects are compiled into common Java bytecode and can
be analyzed with BCEL, so no weaving of the source code
is required (advice interactions can be more easily identified
due to implementation conventions adopted by AspectJ [20]);
(2) analysis can be made even with the absence of the source
code; and (3) in comparison with source code, the bytecode
represents more faithfully the interactions that actually occur
in the system (so the model is more realistic). The analysis
package also contains a class responsible for the coverage
analysis. Currently we are using the Java Code Coverage
Library (JaCoCo7 ) to implement class instrumentation and
statement and branch coverage analysis. We decided to use
JaCoCo because it is free and provided good performance in
our initial experiments.
For the visualization functionality, we used the Java Universal Network/Graph Framework (JUNG8 ). We decided to
use JUNG because it is open source, provides adequate documentation and is straightforward to use. Moreover, we made
performance tests with JUNG and other graph libraries and
noted that JUNG was the fastest. The visualization package
is the one that actually uses the JUNG API. This package is
6 http://commons.apache.org/bcel/
- 01/15/2013.
7 http://www.eclemma.org/jacoco/trunk/index.html
8 http://jung.sourceforge.net/
- 01/30/2013
- 01/31/2013.
responsible for converting the underlying model represented in
our system – built using the analysis package – to a graph in
JUNG representation. This package also implements the creation of the panel that will show the graphical representation
of the graphs, besides coloring and mouse functionalities. To
convert our model to a JUNG graph we begin by laying it
out as a tree, where the root is the “bin” folder of the project
being analyzed, and the methods, pieces of advice, fields and
pointcuts are the leaves. This strategy makes the graph present
the layered structure that we desire, such as in the example
presented in Figure 2. Since the model is initially a tree, there
are no call, interception, or flow edges, only contains edges.
Information about the additional edges is stored in a table, and
later added to the model.
The visualization package also contains a class to construct CCGs from MLSGs. The construction of the CCG is
straightforward because all call and interception information
is available at the MLSG. Control flow graphs are implemented
inside the graph package. It contains a model subpackage that
implements the underlying model, and classes to generate the
graphs from a unit’s bytecode.
A. Tool’s Usage
When the user starts SysGraph4AJ, he must first decide
which Java project to analyze. Currently this is done by
selecting a “bin” directory which contains all bytecode classes
of the system. Initially the tool shows the root package (which
represents the “bin” directory) and all system packages. Each
package is analyzed until a class or aspect is found (that is, if
there are packages and subpackages with no classes, the lower
level packages are analyzed until a module is reached). From
there on, the user can double-click on the desired classes or
aspects to see their structure constituents (methods, pieces of
advice, fields, and pointcuts).
When a user double-clicks a method, dependence analysis
is performed, and call and interception edges are added to
the MLSG, according to the systems’ interactions. Figure 4
shows a screenshot of SysGraph4AJ with a MLSG similar to
the one presented in Figure 2, for the same example system.
In SysGraph4AJ, we decided to use colors instead of letter
labels to differentiate node types. We can see, for instance, that
aspects are represented by pink nodes, methods are represented
by light blue nodes, and pieces of advice are represented
by white nodes. Contains edges are represented by dashed
lines, call/interception edges are represented by dashed lines
with larger dashes, and control-flow edges are represented
by solid lines. Within CFGs, yellow nodes represent regular
blocks while black nodes represent blocks that contain return
statements (exit nodes).
Control flow graphs can be shown by left-clicking on a
unit and choosing the “View control flow graph” option. Call
chain graphs can also be shown by choosing the “View call
chain” option, or the “View recursive call chain”. The second
option does the analysis recursively until the lowest level
calls and advice interceptions, and shows the corresponding
graph. The first option shows only the units that were already
Fig. 4.
A screenshot of the SysGraph4AJ tool.
analyzed in the corresponding MLSG. The CCG is shown in
a separate window, as its view is essentially different from
that of the MLSG (as commented in Section III). Figure 5
shows a screenshot of SysGraph4AJ with a CCG similar to
the one presented in Figure 3. The sequence of calls skips
number 2 because it refers to a library or synthetic method
call. We decided to exclude those to provide a leaner model
containing only interactions within the system. Such calls are
not represented in the MLSG either.
Fig. 5. An example CCG for the Music Online program, built from the
Playlist’s play method, generated with SysGraph4AJ.
Code coverage is performed by importing JUnit test cases.
Currently, this is done via a menu option on SysGraph4AJ’s
main window menu bar. The user chooses which JUnit test
class to execute, and the tool automatically runs the test cases
and calculates instruction and branch coverage. Coverage is
shown in a separate window, but we are currently implementing the visualization of coverage on the MLSG itself. For
instance, when a JUnit test class is run, we want to show
on the MLSG which classes were covered and the coverage
percentages close to the corresponding class and units.
V. E VALUATION
As an initial evaluation of the SysGraph4AJ tool, we
decided to measure its performance while analyzing realistic
software systems. It is important to evaluate performance
because this was one of our main expected quality attributes
while developing the tool.
For this study, we selected three medium-sized AspectJ
applications. The first is an AspectJ version of a Java-based
object-relational data mapping framework called iBATIS9 .
The second system is an aspect-oriented version of HealthWatcher [21], a typical Java web-based information system.
The third target application is a software product line for
mobile devices, called MobileMedia [22]. The three systems
were used in several evaluation studies [2, 11, 23, 24]. To
have an idea of size, the iBATIS version used in our study
has approximately 15 KLOC within 220 modules (classes
and aspects) and 1330 units (methods and pieces of advice);
HealthWatcher, 5 KLOC within 115 modules and 512 units;
and MobileMedia, 3 KLOC within 60 modules and 280.
Besides measuring the performance of our tool by itself
while analyzing the three systems, we also wanted to have
a basis for comparison. In our case, we believe the JaBUTi
tool [5] is the most similar in functionality to SysGraph4AJ. In
particular, it also applies control-flow testing criteria and supports the visualization of CFGs (although not interactively and
within a multi-level system graph such as in SysGraph4AJ).
Moreover, JaBUTi is an open source tool, so we have access to
its code. Therefore, we also evaluated JaBUTi while making
similar analysis of the same systems and selected modules.
Since JaBUTi also performs data-flow analysis, to make a fair
comparison, we removed this functionality from the system
9 http://ibatis.apache.org/
- 02/05/2013
TABLE I
R ESULTS OF THE EXPLORATORY EVALUATION .
System
Class – Method/Advice
SG4AJ – u SG4AJ – C JBT – C
24
25
150
22
53
895
24
56
766
109
113
489
63
155
711
55
56
395
36
52
216
18
19
139
19
20
324
Avg.
41.12
61.00
453.89
DynamicTagHandler - doStartFragment
iBATIS
ScriptRunner - runScript
BeanProbe - getReadablePropertyNames
HWServerDistribution - around execution(...)
HealthWatcher HWTimestamp - updateTimestamp
SearchComplaintData - executeCommand
MediaUtil - readMediaAsByteArray
MobileMedia UnavailablePhotoAlbumException - getCause
PersisteFavoritesAspect - around getBytesFromImageInfo(...)
Legend: SG4AJ – SysGraph4AJ; JBT – JaBUTi; u – analysis of a single unit; C – analysis of whole classes.
before running the experiment. The null hypothesis H10 of
our experiment is that there is no difference in performance
between SysGraph4AJ and JaBUTi while performing analyses of methods; and the alternative hypothesis H1a is that
SysGraph4AJ presents better performance than JaBUTi while
performing analyses of methods.
We randomly selected three units (method/advice) inside
three modules (class/aspect) of each of the target systems.
The idea was to simulate a situation where the tester would
choose a single unit to be tested. We made sure that the
analyzed structures were among the largest. The time taken
to analyze and instrument each unit – including generating
its CFG – was measured in milliseconds (ms). Since the
generation of the model in SysGraph4AJ is interactive, to
measure only the analysis and instrumentation performance,
we registered the time taken by the specific operations that
perform these functions (i.e., we recorded the time before and
after each operation and summarized the results). With respect
to JaBUTi, in order to test a method with this tool, the class
that contains the method must be selected. When this is done,
all methods contained in the class are analyzed. Therefore,
in this evaluation we were in fact comparing the interactive
analysis strategy adopted by SysGraph4AJ against the upfront
analysis strategy adopted by JaBUTi. Table I shows the results
of our study.
Note that the analysis of methods and pieces of advice in
SysGraph4AJ (column SG4AJ – u) is very fast (it takes much
less than a second, on average, to analyze and instrument each
unit and generate the corresponding model). In JaBUTi, on the
other hand, the performance is around 10 times worse. This is
somehow expected, since JaBUTi analyzes all methods within
the class. To check wether the difference was statistically significant, we ran a Wilcoxon/Mann Whitney paired test, since
the observations did not seem to follow a normal distribution
(according to a Shapiro-Wilk normality test). The statistical
test revealed a significant difference at 95% confidence level
(p-value = 0.0004094; with Bonferroni correction since we are
performing two tests with the same data).
Now it is clear that this evaluation is in fact measuring how
analyzing methods interactively is faster than forthrightly. To
compare the performance of SysGraph4AJ with JaBUTi only
with respect to their instrumentation and analysis methods,
regardless of the adopted strategy, we also measured how long
SysGraph4AJ took to analyze all methods within the target
classes. Column SG4AJ – C of Table I shows these observations. The difference was again statistically significant (p-value
= 0.0007874, with Bonferroni correction). Both statistical tests
support the alternative hypothesis that SysGraph4AJ is faster
than JaBUTi in the analyses of methods.
Although SysGraph4AJ appears to perform better than
JaBUTi while analyzing and instrumenting modules and units,
the figures shown even for JaBUTi are still small (i.e., waiting
half a second for an analysis to be done might not affect the
user’s experience). However, we must note that these figures
are for the analysis of single modules, and therefore only a
part of the startup operations performed by JaBUTi. To have
an idea of how long it would take for JaBUTi to analyze
a whole system, we can estimate the total time to analyze
iBATIS, the largest system in our sample. Note that even if
the tester was interested in testing a single method from each
class, the whole system would have to be analyzed, because
of the strategy adopted by the tool.
iBATIS contains around 220 modules (classes + aspects).
Therefore, we could multiply the number of modules by the
average time JaBUTi took to analyze the nine target modules
(453.89 ms). This would summarize 99, 855.80 ms more than
1.5 minutes. Having to wait more than a minute and a half
before starting to use the tool might annoy users, considering,
for instance, the recent performance critiques the Eclipse IDE
has received [9]. Also note that iBATIS is a medium-sized
system, for larger systems the startup time could be even
greater.
It is important to note, however, that JaBUTi implements
many other features that are not addressed by SysGraph4AJ
(e.g., dataflow-based testing and metrics’ gathering). This
might also explain why it performs worse than SysGraph4AJ:
the design goals of the developers were broader than ours.
Moreover, we believe that the use of consolidated libraries in
the implementation of core functions of SysGraph4AJ, such as
BCEL for analysis, JaCoCo for instrumentation and coverage
analysis, and JUNG for visualization, helped improving its
performance. JaBUTi’s implementation also relies on some
libraries (such as BCEL), but many other parts of the tool were
implemented by the original developers themselves (such as
class instrumentation), which might explain in part its weaker
performance (i.e., it is hard to be an expert in the implementation of a diversity of features, and preserve their good
performance, specially when the developers are academics
more interested in proof-of-concept prototypes).
With respect to the coverage analysis performance, we
have not yet been able to measure it for the target systems,
because they require a configured environment unavailable at
the time. However, since the observed execution time overhead
for instrumented applications with JaCoCo is typically less
than 10% [25], we believe the coverage analysis performance
will also be adequate. In any case, to have an initial idea of
performance for the coverage analysis part of SysGraph4AJ,
we recorded the time taken to execute 12 test cases against
the Music Online example application shown in Section II. It
took only 71 ms to run the tests and collect coverage analysis
information for the BillingPolicy aspect.
VI. R ELATED WORK
Research which is related to this work addresses: (i) challenges and approaches for testing AO programs, with focus
on structural testing; and (ii) tools to support structural testing
of Java and AspectJ programs. Both categories are next
described.
A. Challenges and Approaches for AO Testing
Alexander et al. [26] were the first to discuss the challenges
for testing AO programs. They described potential sources
of faults and fault types which are directly related to the
control and data flow of such programs (e.g. incorrect aspect
precedence, incorrect focus on control flow, and incorrect
changes in control dependencies). Ever since, several refinements and amendments to Alexander et al.’s taxonomy have
been proposed [24, 27, 28, 29].
To deal with challenges such as the ones described by
Alexander et al., researchers have been investigating varied testing approaches. With respect to structural testing,
Lemos et al. [30] devised a graph-based representation, named
AODU, which includes crosscutting nodes, that is, nodes that
represent control flow and data flow information about the
advised join points in the base code. Evolutions of Lemos
et al.’s work comprises the integration of unit graphs to support
pairwise [4], pointcut-based [2] and multi-level integration
testing of Java and AspectJ programs [6]. These approaches
are supported by the JaBUTI family of tools, which is described in the next section. The main difference between the
AODU graph and the MLSG introduced in this paper is that the
former is constructed only for a single unit and is displayed out
of its context. The MLSG shows the CFGs of units within the
system’s context. On the other hand, the AODU contains dataflow information, which are not yet present in our approach.
Other structural testing approaches for AO programs at
the integration level can also be found in the literature. For
example, Zhao [31] proposes integrated graphs that include
particular groups of communicating class methods and aspect
advices. Another example is the approach of Wedyan and
Ghosh [32], who represent a whole AO system through a data
flow graph, named ICFG, upon which test requirements are
derived. Our approach differs from Zhao’s and Wedyan and
Ghosh’s approaches again with respect to the broader system
context in which the MLSG is built. Moreover, to the best
of our knowledge, none of the described related approachers
were implemented.
B. Tools
JaBUTi [5] was developed to support unit-level, control flow
and data flow-based testing of Java programs. The tool is capable of deriving and tracking the coverage of test requirements
for single units (i.e., class methods) based on bytecode-level
instrumentation. In subsequent versions, JaBUTi was extended
to support the testing of AO programs written in AspectJ at
the unit [30] and integration levels [2, 4, 6].
The main difference between the JaBUTi tools and SysGraph4AJ (described in this paper) relies on the flexibility
the latter offers to the user. SysGraph4AJ enables the user to
expand the program graph up to a full view of the system in
terms of packages, modules, units and internal unit CFGs. That
is, SysGraph4AJ provides a comprehensive representation of
the system, and derives test requirements from this overall
representation. JaBUTi members, on the other hand, provide
more restricted views of the system structure. For example,
the latest JaBUTi/AJ version automates multi-level integration
testing [6]. In this case, the tester selects a unit from a
preselected set of modules, then an integrated CFG is built
up to a prespecified depth level. Once the CFG is built, the
test requirements are derived and the tester cannot modify the
set of modules and units under testing.
DCT-AJ [32] is another recent tool that automates data
flow-based criteria for Java and AspectJ programs. Differently
from JaBUTi/AJ and SysGraph4AJ, DCT-AJ builds an integrated CFG to represent all interacting system modules, which
however is only used as the underlying model to derive test
requirements. That is, the CFG created by DCT-AJ cannot be
visualized by the user.
Other open and professional coverage analysis tools such
as Cobertura [7], EMMA [8] and Clover [33] do not support
the visualization of the CFGs. They automate control flowbased criteria like statement and branch coverage, and create
highlighted code coverage views the user can browse through.
Finally, widely used IDEs such as Eclipse10 and NetBeans11
offer facilities related to method call hierarchy browsing. This
enables the programmer to visualize method call chains in a
tree-based representation that can be expanded or collapsed
through mouse clicks. However, these IDEs neither provide
native coverage analysis nor a program graph representation
as rich in detail as the MLSG model.
10 http://www.eclipse.org/
11 http://netbeans.org/
- 17/04/2013.
- 17/04/2013.
VII. C ONCLUSIONS
In this paper we have presented an approach for visualization, analysis, and structural testing of Java and AspectJ
programs. We have defined a model called Multi-Level System
Graph (MLSG) that represents the structure of a system, and
can be constructed interactively. We have also implemented
the proposed approach in a tool, and provided initial evidence
of its good performance.
Currently, the tool supports visualization of the system’s
structure and structural testing at unit level. However, we
intend to make SysGraph4AJ a basic framework for implementing other structural testing approaches, such as integration
testing.
Since, in general, most professional developers do not have
time to invest in understanding whole systems with the type
of approach presented in this paper, we believe the MLSG can
be especially useful for testers at this moment. However, we
also believe that if the MLSG could be seamlessly integrated
with development environments, the approach would also be
interesting for other types of users. For instance, by providing
direct links from the MLSG nodes to the source code of the
related structures, users could navigate through the system and
also easily edit its code. In the future we intend to extend our
tool to provide such type of functionality.
[7]
[8]
[9]
[10]
[11]
ACKNOWLEDGEMENTS
The authors would like to thank FAPESP (Otavio Lemos,
grant 2010/15540-2), for financial support.
[12]
R EFERENCES
[13]
[1] E. S. F. Najumudheen, R. Mall, and D. Samanta, “A
dependence representation for coverage testing of objectoriented programs.” Journal of Object Technology, vol. 9,
no. 4, pp. 1–23, 2010.
[2] O. A. L. Lemos and P. C. Masiero, “A pointcutbased coverage analysis approach for aspect-oriented
programs,” Inf. Sci., vol. 181, no. 13, pp. 2721–2746,
Jul. 2011.
[3] D. Grove, G. DeFouw, J. Dean, and C. Chambers,
“Call graph construction in object-oriented languages,”
in Proc. of the 12th ACM SIGPLAN conference on
Object-oriented programming, systems, languages, and
applications, ser. OOPSLA ’97. New York, NY, USA:
ACM, 1997, pp. 108–124.
[4] O. A. L. Lemos, I. G. Franchin, and P. C. Masiero, “Integration testing of object-oriented and aspect-oriented
programs: A structural pairwise approach for java,” Sci.
Comput. Program., vol. 74, no. 10, pp. 861–878, Aug.
2009.
[5] A. M. R. Vincenzi, J. C. Maldonado, W. E. Wong, and
M. E. Delamaro, “Coverage testing of java programs
and components,” Science of Computer Programming,
vol. 56, no. 1-2, pp. 211–230, Apr. 2005.
[6] B. B. P. Cafeo and P. C. Masiero, “Contextual integration
testing of object-oriented and aspect-oriented programs:
A structural approach for java and aspectj,” in Proc. of
[14]
[15]
[16]
[17]
[18]
[19]
the 2011 25th Brazilian Symposium on Software Engineering, ser. SBES ’11. Washington, DC, USA: IEEE
Computer Society, 2011, pp. 214–223.
M. Doliner, “Cobertura tool website,” Online, 2006, http:
//cobertura.sourceforge.net/index.html - last accessed on
06/02/2013.
V. Roubtsov, “EMMA: A free Java code coverage tool,”
Online, 2005, http://emma.sourceforge.net/ - last accessed on 06/02/2013.
The
H
Open,
“Weak
performance
of
eclipse
4.2
criticised,”
Online,
2013,
http://www.h-online.com/open/news/item/
Weak-performance-of-Eclipse-4-2-criticised-1702921.
html - last accessed on 19/04/2013.
C. Andersson and P. Runeson, “A replicated quantitative analysis of fault distributions in complex software
systems,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp.
273–286, May 2007.
F. Ferrari, R. Burrows, O. Lemos, A. Garcia,
E. Figueiredo, N. Cacho, F. Lopes, N. Temudo, L. Silva,
S. Soares, A. Rashid, P. Masiero, T. Batista, and J. Maldonado, “An exploratory study of fault-proneness in
evolving aspect-oriented programs,” in Proc. of the 32nd
ACM/IEEE International Conference on Software Engineering - Volume 1, ser. ICSE ’10. New York, NY, USA:
ACM, 2010, pp. 65–74.
G. J. Myers, C. Sandler, T. Badgett, and T. M. Thomas,
The Art of Software Testing, 2nd ed. John Wiley &
Sons, 2004.
S. Rapps and E. J. Weyuker, “Selecting software test data
using data flow information,” IEEE Trans. Softw. Eng.,
vol. 11, no. 4, pp. 367–375, 1985.
IEEE, “IEEE standard for software unit testing,” Institute
of Electric and Electronic Engineers, Standard 10081987, 1987.
T. Elrad, R. E. Filman, and A. Bader, “Aspect-oriented
programming: Introduction,” Communications of the
ACM, vol. 44, no. 10, pp. 29–32, 2001.
G. Kiczales, J. Irwin, J. Lamping, J.-M. Loingtier,
C. Lopes, C. Maeda, and A. Menhdhekar, “Aspectoriented programming,” in Proc. of the European Conference on Object-Oriented Programming, M. Akşit and
S. Matsuoka, Eds., vol. 1241. Berlin, Heidelberg, and
New York: Springer-Verlag, 1997, pp. 220–242.
R. Bodkin and R. Laddad, “Enterprise AspectJ tutorial
using eclipse,” Online, 2005, eclipseCon 2005. Available
from:
http://www.eclipsecon.org/2005/presentations/
EclipseCon2005 EnterpriseAspectJTutorial9.pdf
(accessed 12/3/2007).
The Eclipse Foundation, “AJDT: Aspectj development
tools,” Online, 2013, http://www.eclipse.org/ajdt/ - last
accessed on 19/04/2013.
A. Rountev, S. Kagan, and M. Gibas, “Static and dynamic
analysis of call chains in java,” in Proc. of the 2004 ACM
SIGSOFT international symposium on Software testing
and analysis, ser. ISSTA ’04. New York, NY, USA:
ACM, 2004, pp. 1–11.
[20] E. Hilsdale and J. Hugunin, “Advice weaving in aspectj,”
in Proceedings of the 3rd international conference on
Aspect-oriented software development, ser. AOSD ’04.
New York, NY, USA: ACM, 2004, pp. 26–35.
[21] P. Greenwood, T. Bartolomei, E. Figueiredo, M. Dosea,
A. Garcia, N. Cacho, C. Sant’Anna, S. Soares, P. Borba,
U. Kulesza, and A. Rashid, “On the impact of aspectual
decompositions on design stability: an empirical study,”
in Proc. of the 21st European conference on ObjectOriented Programming, ser. ECOOP’07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 176–200.
[22] E. Figueiredo, N. Cacho, C. Sant’Anna, M. Monteiro,
U. Kulesza, A. Garcia, S. Soares, F. Ferrari, S. Khan,
F. Castor Filho, and F. Dantas, “Evolving software product lines with aspects: an empirical study on design
stability,” in Proc. of the 30th international conference
on Software engineering, ser. ICSE ’08. New York, NY,
USA: ACM, 2008, pp. 261–270.
[23] F. C. Filho, N. Cacho, E. Figueiredo, R. Maranhão,
A. Garcia, and C. M. F. Rubira, “Exceptions and aspects:
the devil is in the details,” in Proc. of the 14th ACM
SIGSOFT international symposium on Foundations of
software engineering, ser. SIGSOFT ’06/FSE-14. New
York, NY, USA: ACM, 2006, pp. 152–162.
[24] F. C. Ferrari, J. C. Maldonado, and A. Rashid, “Mutation testing for aspect-oriented programs,” in Proc. of
the 2008 International Conference on Software Testing,
Verification, and Validation, ser. ICST ’08. Washington,
DC, USA: IEEE Computer Society, 2008, pp. 52–61.
[25] Mountainminds GmbH & Co. KG and Contributors,
“Control flow analysis for java methods,” Online,
2013, available from: http://www.eclemma.org/jacoco/
trunk/doc/flow.html (accessed 02/04/2013).
[26] R. T. Alexander, J. M. Bieman, and A. A. Andrews,
“Towards the systematic testing of aspect-oriented programs,” Dept. of Computer Science, Colorado State University, Tech. Report CS-04-105, 2004.
[27] M. Ceccato, P. Tonella, and F. Ricca, “Is AOP code easier
or harder to test than OOP code?” in Proceedings of
the 1st Workshop on Testing Aspect Oriented Programs
(WTAOP) - held in conjunction with AOSD, Chicago/IL
- USA, 2005.
[28] A. van Deursen, M. Marin, and L. Moonen, “A systematic aspect-oriented refactoring and testing strategy,
and its application to JHotDraw,” Stichting Centrum
voor Wiskundeen Informatica, Tech.Report SEN-R0507,
2005.
[29] O. A. L. Lemos, F. C. Ferrari, P. C. Masiero, and C. V.
Lopes, “Testing aspect-oriented programming pointcut
descriptors,” in Proceedings of the 2nd Workshop on
Testing Aspect Oriented Programs (WTAOP). Portland/Maine - USA: ACM Press, 2006, pp. 33–38.
[30] O. A. L. Lemos, A. M. R. Vincenzi, J. C. Maldonado, and
P. C. Masiero, “Control and data flow structural testing
criteria for aspect-oriented programs,” The Journal of
Systems and Software, vol. 80, no. 6, pp. 862–882, 2007.
[31] J. Zhao, “Data-flow-based unit testing of aspect-oriented
programs,” in Proceedings of the 27th Annual IEEE
International Computer Software and Applications Conference (COMPSAC). IEEE Computer Society, 2003,
pp. 188–197.
[32] F. Wedyan and S. Ghosh, “A dataflow testing approach
for aspect-oriented programs,” in Proceedings of the 12th
IEEE International High Assurance Systems Engineering
Symposium (HASE). San Jose/CA - USA: IEEE Computer Society, 2010, pp. 64–73.
[33] Atlassian, Inc., “Clover: Java and Groovy code coverage,” Online, http://www.atlassian.com/software/clover/
overview - last accessed on 06/02/2013.
A Method for Model Checking Context-Aware
Exception Handling
Lincoln S. Rocha
Rossana M. C. Andrade
Alessandro F. Garcia
Grupo de Pesquisa GREat
UFC, Quixadá-CE, Brasil
Email: [email protected]
Grupo de Pesquisa GREat
UFC, Fortaleza-CE, Brasil
Email: [email protected]
Grupo de Pesquisa OPUS
PUC-Rio, Rio de Janeiro-RJ, Brasil
Email: [email protected]
Resumo—O tratamento de exceção sensível ao contexto
(TESC) é uma técnica de recuperação de erros empregada
na melhoria da robustez de sistemas ubíquos. No projeto do
TESC, os projetistas especificam condições de contexto que
são utilizadas para caracterizar situações de anormalidade e
estabelecem critérios para a seleção das medidas de tratamento.
A especificação errônea dessas condições representam faltas de
projeto críticas. Elas podem fazer com que o mecanismo de
TESC, em tempo de execução, não seja capaz de identificar
as situações de anormalidade desejadas ou reagir de forma
adequada quando estas são detectadas. Desse modo, para que a
confiabilidade do TESC não seja comprometida, é necessário que
estas faltas de projeto sejam rigorosamente identificadas e removidas em estágios iniciais do desenvolvimento. Contudo, embora
existam abordagens formais para verificação de sistemas ubíquos
sensíveis ao contexto, estas não proveem suporte apropriado para
a verificação do TESC. Nesse cenário, este trabalho propõe um
método para verificação de modelos do TESC. As abstrações
propostas pelo método permitem aos projetistas modelarem
aspectos comportamentais do TESC e, a partir de um conjunto de
propriedades pré-estabelecidas, identificar a existência de faltas
de projeto. Com o objetivo de avaliar a viabilidade do método:
(i) uma ferramenta de suporte à verificação foi desenvolvida;
e (ii) cenários recorrentes de faltas em TESC foram injetados
em modelos de um sistema de forma a serem analisados com a
abordagem de verificação proposta.
Index Terms—Sistemas Ubíquos, Tratamento de Exceção, Verificação de Modelos
Abstract—The context-aware exception handling (CAEH) is
an error recovery technique employed to improve the ubiquitous
software robustness. In the design of CAEH, context conditions
are specified to characterize abnormal situations and used to
select the proper handlers. The erroneous specification of such
conditions represents a critical design fault that can lead the
CAEH mechanism to behave erroneously or improperly at
runtime (e.g., abnormal situations may not be recognized and
the system’s reaction may deviate from what is expected). Thus,
in order to improve the CAEH reliability this kind of design faults
must be rigorously identified and eliminated from the design in
the early stages of development. However, despite the existence
of formal approaches to verify context-aware ubiquitous systems,
such approaches lack specific support to verify the CAEH
behavior. This work proposes a method for model checking
CAEH. This method provides a set of modeling abstractions and
3 (three) properties formally defined that can be used to identify
exiting design faults in the CAEH design. In order to assess
the method feasibility: (i) a support tool was developed; and (ii)
fault scenarios that are recurring in the CAEH was injected in
a correct model and verified using the proposed approach.
Index Terms—Ubiquitous Systems, Exception Handling, Model
Checking
I. I NTRODUÇÃO
Os sistemas ubíquos sensíveis ao contexto são sistemas capazes de observar o contexto em que estão inseridos e reagir de
forma apropriada, adaptando sua estrutura e comportamento ou
executando tarefas de forma automática [1]. Nesses sistemas,
o contexto representa um conjunto de informações sobre o
ambiente (físico ou lógico, incluindo os usuários e o próprio
sistema), que pode ser usado com o propósito de melhorar
a interação entre o usuário e o sistema ou manter a sua
execução de forma correta, estável e otimizada [2]. Devido ao
seu amplo domínio de aplicação (e.g., casas inteligentes, guias
móveis de visitação, jogos e saúde) e por tomarem decisões
de forma autônoma no lugar das pessoas, os sistemas ubíquos
sensíveis ao contexto precisam confiáveis para cumprir com a
sua função. Tal confiabilidade requer que esses sejam robustos
(i.e., capazes de lidar com situações anormais) [3].
O tratamento de exceção sensível ao contexto (TESC)
é uma abordagem utilizada para recuperação de erros que
vem sendo explorada como uma alternativa para melhorar os
níveis de robustez desse tipo de sistema [4][3][5][6][7][8].
No TESC, o contexto é usado para caracterizar situações
anormais no sistemas (e.g., uma falha de software/hardware
ou a quebra de algum invariante do sistema), denominadas de
exceções contextuais, e estruturar as atividades de tratamento,
estabelecendo critérios para a seleção e execução de tratadores.
De um modo geral, a ocorrência de uma exceção contextual
requer que o fluxo normal do sistema seja desviado para que o
tratamento apropriado seja conduzido. Entretanto, dependendo
da situação e do mecanismo de TESC adotado, o fluxo de
controle pode ser retomado, ou não, após o término tratamento.
No projeto de sistemas ubíquos sensíveis ao contexto, o
uso de formalismos e abstrações apropriados, faz-se necessário, para lidar com questões complexas inerentes a esses sistemas (e.g., abertura, incerteza, adaptabilidade, volatilidade e gerenciamento de contexto) em tempo de projeto [9][10][11][12][13][14]. Em particular, o projetista de
TESC é responsável por especificar as condições de contexto
utilizadas [5][3]: (i) na definição das exceções contextuais;
e (ii) na seleção das medidas de tratamento. No caso (i),
as condições de contexto especificadas são usadas pelo mecanismo de TESC para detectar a ocorrência de exceções
contextuais em tempo de execução. Por outro lado, no caso
(ii), as condições de contexto funcionam como pré-condições
que são estabelecidas para cada possível tratador de uma
exceção contextual. Assim, quando uma exceção é detectada, o mecanismo seleciona, dentre os possíveis tratadores
daquela exceção, aqueles cujas pré-condições são satisfeitas
no contexto corrente do sistema. Entretanto, a falibilidade
humana e o conhecimento parcial sobre a forma como o
contexto do sistema evolui, podem levar os projetistas a
cometerem erros de especificação, denominadas de faltas de
projeto (design faults). Por exemplo, devido a negligência ou
lapsos de atenção, contradições podem ser facilmente inseridas
pelos projetistas na especificação das condições de contexto
ou, mesmo não existindo contradições, essas condições podem
representar situações de contexto que nunca ocorrem em
tempo de execução, devido a forma como o contexto evolui.
Nessa perspectiva, a inserção de faltas de projeto, e a
sua eventual propagação até a fase de codificação, podem
fazer com que o mecanismo de TESC seja configurado de
maneira imprópria, comprometendo a sua confiabilidade em
detectar exceções contextuais ou selecionar os tratadores apropriados. Estudos recentes relatam que a validação do código
de tratamento de exceção de sistemas que utilizam recursos
externos não confiáveis, a exemplo dos sistemas ubíquos
sensíveis ao contexto, é uma atividade difícil e extremamente
desafiadora [15]. Isso decorre do fato de que para testar todo o
espaço de exceções levantadas no sistema, é necessário gerar,
de forma sistemática, todo o contexto que desencadeia essas
exceções. Essa atividade, além de complexa, pode se tornar
proibitiva em alguns casos devido aos altos custos associados.
Desse modo, uma análise rigorosa do projeto do TESC, buscando identificar e remover faltas de projeto, podem contribuir
para a melhoria dos níveis de confiabilidade do TESC e para a
diminuição dos custos associados à identificação e correção de
defeitos decorrentes da inserção de faltas de projeto. Contudo,
embora existam abordagens formais baseadas em modelos
voltadas para a análise do comportamento de sistemas ubíquos sensíveis ao contexto [16][11][17], essas estão voltadas
somente para o comportamento adaptativo. Elas não provêem
abstrações e suporte apropriado para modelagem e análise do
comportamento do TESC, tornando essa atividade ainda mais
complexa e sujeita a introdução de faltas.
Nesse cenário, este trabalho propõe um método baseado
em verificação de modelos para apoiar a identificação automática de faltas de projeto no TESC (Seção IV). O método
proposto provê um conjunto de abstrações que permitem aos
projetistas modelarem aspectos comportamentais do TESC e
mapeá-los para um modelo formal de comportamento (estrutura de Kripke) compatível com a técnica de verificação de
modelos [18]. O formalismo adotado é baseado em estados,
transições e lógica temporal devido as necessidades peculiares
de projeto e verificação de modelos de TESC (Seção III).
Um conjunto de propriedades comportamentais é estabelecido
e formalmente definido com lógica temporal no intuito de
auxiliar os projetistas na identificação de determinados tipos
de faltas de projeto (Seção II). Além disso, com o objetivo
de avaliar a viabilidade do método (Seção V): (i) uma ferra-
menta de suporte à verificação foi desenvolvida e (ii) cenários
recorrentes de faltas em TESC foram injetados em modelos
de um sistema de forma a serem analisados com a abordagem
de verificação proposta. Ao final, a Seção VI descreve os
trabalhos relacionados e a Seção VII conclui o artigo.
II. T RATAMENTO DE E XCEÇÃO S ENSÍVEL AO C ONTEXTO
No tratamento de exceção sensível ao contexto (TESC),
o contexto e a sensibilidade ao contexto são utilizados pelo
mecanismo de tratamento de exceção para definir, detectar e
tratar condições anormais em sistemas ubíquos, chamadas de
exceções contextuais. Na Seção II-A desta seção são descritos
os principais tipos de exceções contextuais. Além disso, uma
discussão sobre onde e como faltas1 de projeto podem ser
cometidas no projeto do TESC é oferecida na Seção II-B.
A. Tipos de Exceções Contextuais
As exceções contextuais representam situações anormais
que requerem que um desvio no fluxo de execução seja feito
para que o tratamento da excepcionalidade seja conduzido.
A detecção da ocorrência de uma exceção contextual pode
indicar uma eventual falha em algum dos elementos (hardware
ou software) que compõem o sistema ou que alguma invariante
de contexto, necessária a execução de alguma atividade do sistema, tenha sido violada. Neste trabalho, as exceções contextuais foram agrupadas em 3 (três) categorias: infraestrutura,
invalidação de contexto e segurança. Elas são descritas nas
próximas subseções.
1) Exceções Contextuais de Infraestrutura: Esse tipo de exceção contextual está relacionada com a detecção de situações
de contexto que indicam que alguma falha de hardware ou
software ocorreu em algum dos elementos que constituem o
sistema ubíquo. Um exemplo desse tipo de exceção contextual
é descrito em [4] no escopo de um sistema de aquecimento
“inteligente”. A função principal daquele sistema é ajustar a
temperatura do ambiente às preferências dos usuários. Naquele
sistema, uma situação de excepcionalidade é caracterizada
quando a temperatura do ambiente atinge um valor acima do
limite estabelecido pelas preferencias do usuário. Esse tipo
de exceção contextual ajuda a identificar, de forma indireta,
a ocorrência de falhas cuja origem pode ser o sistema que
controla o equipamento de aquecimento (falha de software) ou
o próprio equipamento (falha de hardware). O mau funcionamento do sistema de aquecimento é considerado uma situação
anormal, pois pode colocar em risco a saúde dos usuários.
Observe que para detectar essa exceção contextual é necessário
ter acesso à informações de contexto sobre a temperatura do
ambiente e as preferências dos usuários.
2) Exceções Contextuais de Invalidação de Contexto: Esse
tipo de exceção contextual está relacionada com a violação
de determinadas condições de contexto durante a execução
de alguma tarefa do sistema. Essas condições de contexto
funcionam como invariantes da tarefa e, quando violadas,
1 Este trabalho adota a nomenclatura de [19], na qual uma falta (fault)
é a causa física ou algorítmica de um erro (error), que, se propagado até a
interface de serviço do componente ou sistema, caracteriza uma falha (failure).
caracterizam uma situação de anormalidade. Por exemplo,
os autores de [3] descrevem esse tipo de exceção em uma
aplicação de leitor de música sensível ao contexto. O leitor de
música executa no dispositivo móvel do usuário, enviando um
fluxo continuo de som para a saída de áudio do dispositivo.
Entretanto, quando o usuário entra em uma sala vazia, o
aplicativo busca por algum dispositivo de áudio disponível no
ambiente e transfere o fluxo de som para aquele dispositivo.
Nessa aplicação, é estabelecido como contexto invariante a
necessidade do usuário estar sozinho dentro da sala. Para
os autores de [3], a violação desse invariante é considerado
uma situação excepcional, pois o seu não cumprimento pode
trazer desconforto ou aborrecimento para as demais pessoas
presentes na sala. Note que a detecção dessa exceção depende
de informações de contexto sobre a localização do usuário e
o número de pessoas que estão na mesma sala que ele.
3) Exceções Contextuais de Segurança: Esse tipo de exceção está relacionada com situações de contexto que ajudam
a identificar a violação de políticas de segurança (e.g., autenticação, autorização e privacidade) e demais situações que
podem colocar em risco a integridade física ou financeira dos
usuários do sistema. Por exemplo, o sistema de registro médico
sensível ao contexto apresentado em [3] descreve esse tipo de
exceção. Nesse sistema existem três usuários envolvidos: os
pacientes, os enfermeiros e os médicos. Os médicos podem
fazer registros sobre seus pacientes e os enfermeiros podem
ler e atualizar esses registros ao tempo em que assistem aos
pacientes. Entretanto, os enfermeiros só podem ter acesso aos
registros se estiverem dentro da enfermaria em que o paciente
se encontra e se o médico responsável estiver presente. Naquele sistema, quando um enfermeiro tenta acessar os registros
do paciente, porém não se encontra na mesma enfermaria que
este paciente ou encontra-se na enfermaria, mas o médico
responsável não está presente, caracteriza-se uma situação
excepcional. Perceba que a detecção desse tipo de exceção
depende das informações de contexto sobre a localização e o
perfil do paciente, do enfermeiro e do médico.
B. Propensão a Faltas de Projeto
Com base em trabalhos existentes na literatura [4][3][5][6][7][8], é possível dividir o projeto do
TESC em duas grandes atividades: (i) especificação do
contexto excepcional; e (ii) especificação do tratamento
sensível ao contexto. Na atividade (i), os projetistas
especificam as condições de contexto que caracterizam
as situações de anormalidade identificadas no sistema.
Dessa forma, em tempo de execução, quando uma dessas
situações são detectadas pelo mecanismo de TESC, diz-se
que uma ocorrência da exceção contextual associada foi
identificada. Por outro lado, na atividade (ii), os projetistas
especificam as ações de tratamento a serem executadas para
mitigar a situação de excepcionalidade detectada. Entretanto,
dependendo do contexto corrente do sistema quando a
exceção é detectada, um conjunto de ações de tratamento
podem ser mais apropriado do que outro para tratar aquela
determinada ocorrência excepcional. Desse modo, faz parte
do trabalho do projetista na atividade (ii), agrupar as ações de
tratamento e estabelecer condições de contexto que ajudem o
mecanismo de TESC, em tempo de execução, a selecionar o
conjunto de ações de tratamento apropriado para lidar com
uma ocorrência excepcional em função do contexto corrente.
A falibilidade dos projetistas, o conhecimento parcial sobre
a forma como o contexto do sistema evolui em tempo de
execução, a inexistência de uma notação apropriada e a falta de
suporte ferramental, tornam o projeto do TESC uma atividade
extremamente propensa a faltas de projeto. Por exemplo,
devido a negligência ou lapsos de atenção, contradições podem
ser facilmente inseridas pelos projetistas durante a especificação das condições de contexto construídas nas atividades
(i) e (ii) do projeto do TESC. Além disso, mesmo que os
projetistas criem especificações livres de contradições, essas
podem representar situações de contexto que nunca ocorrerão
em tempo de execução devido a forma como o sistema e o
seu contexto evoluem. Faltas de projeto como estas podem
fazer com que o mecanismo de TESC seja mal configurado,
comprometendo a sua confiabilidade em detectar as situações
de anormalidade desejadas e selecionar as ações de tratamento
adequadas para lidar com ocorrências excepcionais específicas.
Adicionalmente, existe outro tipo de falta de projeto pode
ser facilmente cometida por projetistas. Por exemplo, considere o projeto do TESC para uma exceção contextual em
que as especificações das condições de contexto construídas
nas atividades (i) e (ii) estejam livres de faltas de projeto
como as descritas anteriormente. Perceba que, mesmo nesse
caso, pode ocorrer do projetista especificar a condição de
contexto que caracteriza a situação de anormalidade e as
condições de seleção das ações de tratamento de tal forma
que estas nunca sejam satisfeitas, simultaneamente, em tempo
de execução. Isso pode acontecer nos casos em que essas
condições de contexto sejam contraditórias entre si ou que
não seja possível o sistema atingir um estado em que seu
contexto satisfaça a ambas ao mesmo tempo. Desse modo,
face a propensão à falta de projetos, uma abordagem rigorosa
deve ser empregada pelos projetistas para que faltas de projeto
sejam identificadas e removidas, antes que sejam propagadas
até a fase de codificação.
III. V ERIFICAÇÃO DE M ODELOS
A verificação de modelos é um método formal empregado
na verificação automática de sistemas reativos concorrentes
com número finito de estados [18]. Nessa abordagem, o
comportamento do sistema é modelado através de algum
formalismo baseado em estados e transições e as propriedades
a serem verificadas são especificadas usando lógicas temporais. A verificação das propriedades comportamentais é dada
através de uma exaustiva enumeração (implícita ou explícita)
de todos os estados alcançáveis do modelo do sistema. A
estrutura de Kripke (Definição 1) é um formalismo para
modelagem de comportamento, onde os estados são rotulados
ao invés das transições. Na estrutura de Kripke, cada rótulo
representa um instantâneo (estado) da execução do sistema.
Essa característica foi preponderante para sua escolha neste
trabalho, uma vez que os aspectos comportamentais do projeto
TESC que se deseja analisar são influenciados pela observação
do estado do contexto do sistema, e não pelas ações que o
levaram a alcançar um estado de contexto em particular.
Definição 1 (Estruturas de Kripke). Uma estrutura de Kripke
K = hS, I, L, →i sobre um conjunto finito de proposições
atômicas AP é dado por um conjunto finito de estados S, um
conjunto de estados iniciais I ⊆ S, uma função de rótulos
L : S → 2AP , a qual mapeia cada estado em um conjunto de
proposições atômicas que são verdadeiras naquele estado, e
uma relação de transição total →⊆ S × S, isto é, que satisfaz
a restrição ∀s ∈ S ∃s0 ∈ S tal que (s, s0 ) ∈→.
Usualmente, as propriedades se deseja verificar são divididas em dois tipos: (i) de segurança (safety), que buscam
expressar que “nada ruim acontecerá” durante a execução do
sistema; e (ii) de progresso (liveness), que buscam expressar
que, eventualmente, “algo bom acontecerá” durante a execução
do sistema. Essas propriedades são expressas usando lógicas
temporais que são interpretadas sobre uma estrutura de Kripke.
Dessa forma, dada uma estrutura de Kripke K e uma fórmula
temporal ϕ, uma formulação geral para o problema de verificação de modelos consiste em verificar se ϕ é satisfeita na
estrutura K, formalmente K |= ϕ. Nesse caso, K representa o
modelo do sistema e ϕ a propriedade que se deseja verificar.
Neste trabalho, devido a sua expressividade e ao verificador
de modelos utilizado, a lógica temporal CTL (Computation
Tree Logic) foi escolhida para a especificação de propriedades
sobre o comportamento do TESC. CTL é uma lógica temporal
de tempo ramificado que permite expressar propriedades sobre
estados. As fórmulas de CTL são construídas sobre proposições atômicas utilizando operadores proposicionais (¬, ∧, ∨,
→ e ↔) e operadores temporais (EX, EF, EG, EU, AX, AF, AG e
AU). Sejam φ e ϕ fórmulas CTL, a intuição para os operadores
temporais é dada na Tabela I. Para obter mais detalhes sobre
CTL e verificação de modelos, consulte [18].
Tabela I
I NTUIÇÃO PARA OS O PERADORES T EMPORAIS DE CTL.
EXφ
“existe um caminho tal que no próximo estado φ é verdadeira.”
EFφ
“existe um caminho tal que no futuro φ será verdadeira.”
EGφ
“existe um caminho tal que φ é sempre verdadeira.”
EU(φϕ)“existe um caminho tal que φ é verdadeira até que ϕ passe a ser.”
AXφ
“para todo caminho, no próximo estado φ é verdadeira.”
AFφ
“para todo caminho, φ é verdadeira no futuro.”
AGφ
“para todo caminho, φ é sempre verdadeira.”
AU(φϕ)“para todo caminho, φ é verdadeira até que ϕ passe a ser.”
IV. O M ÉTODO P ROPOSTO
Nesta seção é apresentado o método proposto para a verificação de modelos do TESC. Uma visão geral do método
é oferecida na Seção IV-A. A Seção IV-B aborda a forma
como o espaço de estados a ser explorado é derivado. Além
disso, a atividade de modelagem (Seção IV-C), a derivação da
estrutura de Kripke (Seção IV-D) e a atividade de especificação
(Seção IV-E) do método são detalhadas.
A. Visão Geral
O método proposto provê um conjunto de abstrações e convenções que permitem aos projetistas expressarem de forma
rigorosa o comportamento excepcional sensível ao contexto e
mapeá-lo para uma estrutura de Kripke particular, formalismo
apresentado na Seção III que serve de base para a técnica de
verificação de modelos. Além disso, o método oferece uma
lista de propriedades comportamentais, a serem verificadas
sobre o comportamento excepcional sensível ao contexto,
com o intuito de auxiliar os projetistas na descoberta de
determinados tipos de faltas de projeto no TESC.
O método é decomposto em duas atividades: modelagem
e especificação. Na atividade de modelagem (Seção IV-C), o
comportamento do TESC é modelado utilizando um conjunto
de construtores próprios que ajudam a definir as exceções
contextuais e estruturar as ações de tratamento. Na atividade de
especificação (Seção IV-E), um conjunto de propriedades que
permitem identificar um conjunto bem definido de faltas de
projeto no TESC, são apresentadas e formalizadas utilizando a
lógica CTL. Entretanto, o fato do método conseguir representar
o modelo de comportamento do TESC como uma estrutura de
Kripke, permite que outros tipos de propriedades comportamentais possam ser definidas pelos projetistas utilizando CTL.
B. Determinando o Espaço de Estados
Nos estágios iniciais do projeto do TESC, um dos principais
esforços está na identificação das informações de contexto que
podem ser úteis para projetar o TESC. Nesses estágios, por
não haver um conhecimento detalhado sobre o tipo, a origem
e a estrutura dessas informações, é pertinente abstrair esses
detalhes e buscar lidar com informações de contexto mais alto
nível. Essas informações de contexto de alto nível, podem ser
vistas como proposições sobre o contexto do sistema, que recebem, em tempo de execução, uma interpretação, verdadeira
ou falsa, de acordo com a valoração assumida pelas variáveis
de contexto de mais baixo nível observadas pelo sistema.
No método proposto, essas proposições são denominadas de
proposições contextuais e compõem o conjunto CP, que
representa a base de conhecimento utilizada para caracterizar
situações contextuais relevantes para o projeto do TESC.
Nesse cenário, para que o espaço de estados a ser explorado
seja obtido é preciso criar uma função de valoração que atribua
valores para as proposições contextuais em CP. Entretanto,
construir essa função não é uma atividade trivial, pois essas
proposições contextuais representam informações de contexto
de baixo nível que assumem uma valoração de forma não
determinística, seguindo leis que extrapolam a fronteira e
o controle do sistema (e.g., tempo, condições climáticas e
mobilidade) e que podem estar relacionadas entre si de forma
dependente ou conflitante. Desse modo, embora as proposições
contextuais permitam abstrair os detalhes das variáveis de
contexto de baixo nível, elas trazem consigo problemas de
dependência semântica que dificultam a construção de uma
função de valoração. Para lidar com essa questão, o método
proposto adota a técnica de programação por restrições [20]
como função de valoração para as proposições contextuais.
Essa técnica permite ao projetista estabelecer restrições semânticas (Definição 2) sobre CP garantindo que todas as soluções
geradas (o espaço de estados a ser explorado) satisfazem as
restrições estabelecidas. Por convenção, a função csp(CP, C)
será utilizada para designar o espaço de estados derivado a
partir do conjunto C de restrições definido sobre CP.
Definição 2 (Restrição). Uma restrição é definida como
uma fórmula lógica sobre o conjunto CP de proposições
contextuais tal como descrito na gramática em (1), onde
p ∈ CP é uma proposição contextual, φ e ϕ são fórmulas
lógicas e ¬ (negação), ∧ (conjunção), ∨ (disjunção), ⊕
(disjunção exclusiva), → (implicação) e ↔ (dupla implicação)
operadores lógicos.
φ, ϕ : := p | ¬φ | φ ∧ ϕ | φ ∨ ϕ | φ ⊕ ϕ | φ → ϕ | φ ↔ ϕ (1)
C. Atividade de Modelagem
Durante a modelagem do comportamento do TESC, algumas questões de projeto relacionadas com a definição e a
detecção de exceções contextuais, com o agrupamento, seleção
e execução das medidas de tratamento precisam ser pensadas.
No método proposto, a atividade de modelagem tem como
objetivo tratar essas questões e possibilitar o mapeamento do
modelo de comportamento do TESC para uma estrutura de
Kripke. Para isso, são propostas as abstrações de exceções
contextuais, casos de tratamento e escopos de tratamento.
1) Exceções Contextuais: No método proposto, uma exceção contextual é definida por um nome e uma fórmula
lógica utilizada para caracterizar o seu contexto excepcional
(Definição 3). Uma exceção contextual é detectada quando a
fórmula ecs é satisfeita em algum dado estado de contexto
do sistema. Nesse momento, diz-se que a exceção contextual
foi levantada. Por convenção, dada uma exceção contextual
e = hname, ecsi a função ecs(e) é definida para recuperar a
especificação de contexto excepcional (ecs) da exceção e.
Definição 3 (Exceção Contextual). Dado um conjunto proposições contextuais CP, uma exceção contextual é definida
pela tupla hname, ecsi, onde name é o nome da exceção
contextual, ecs é uma fórmula lógica definida sobre CP que
especifica o contexto excepcional de detecção.
2) Casos de Tratamento: Como discutido anteriormente,
uma exceção contextual pode ser tratada de formas diferentes
dependendo do contexto em que o sistema se encontra. Os
casos de tratamento (Definição 4) definem as diferentes
estratégias que podem ser empregadas para tratar uma exceção
contextual em função do contexto corrente do sistema. Um
caso de tratamento é composto por uma condição de seleção
e um conjunto de fórmulas lógicas que são utilizadas para
descrever a situação de contexto esperada após a execução
de cada ação (ou bloco de ações) de tratamento de forma
sequencial. Por convenção, os constituintes de um caso de
tratamento serão referenciados de agora em diante como
condição de seleção e conjunto de medidas de tratamento,
respectivamente.
Definição 4 (Caso de Tratamento). Dado um conjunto de
proposições contextuais CP, um caso de tratamento é definido
como uma tupla hcase = hα, Hi, onde α é uma fórmula
lógica definida sobre CP e H é um conjunto ordenado de
fórmulas lógicas definidas sobre CP.
3) Escopos de Tratamento: Tipicamente, os tratadores de
exceção encontram-se vinculados a áreas específicas do código
do sistema onde exceções podem ocorrer. Essa estratégia ajuda
a delimitar o escopo de atuação de um tratador durante a
atividade de tratamento. No método proposto, o conceito de
escopos de tratamento (Definição 5) é criado para delimitar a
atuação dos casos de tratamento e estabelecer uma relação de
precedência entre estes. Essa relação de precedência é essencial para resolver situações de sobreposição entre condições
de seleção de casos de tratamento (i.e., situações em que mais
de um caso de tratamento pode ser selecionado num mesmo
estado do contexto). Dessa forma, no método proposto, o caso
de tratamento de maior precedência é avaliado primeiro, se
este não tiver a sua condição de seleção satisfeita, o próximo
caso de tratamento com maior precedência é avaliado, e assim
por diante.
Definição 5 (Escopo de Tratamento). Dado um conjunto de
proposições contextuais CP, um escopo de tratamento é definido pela tupla he, HCASEi, onde e é uma exceção contextual
e HCASE é um conjunto ordenado de casos de tratamento.
A noção de conjunto ordenado, mencionado na Definição 5,
está relacionada com a existência de uma relação de ordem
entre os casos de tratamento. Essa relação permite estabelecer
a ordem de precedência em que cada caso de tratamento
será avaliado quando uma exceção contextual for levantada.
No método proposto, a ordem de avaliação utilizada leva
em consideração a posição ocupada por cada caso de tratamento dentro do conjunto HCASE. Portanto, para os casos
de tratamento hcasei e hcasej , se i < j, então hcasei
tem precedência sobre hcasej (i.e., hcasei ≺ hcasej ). No
entanto, essa relação de ordem não é fixa, porém obrigatória,
podendo ser alterada pelo projetista com o propósito de obter
algum tipo benefício.
D. Derivando a Estrutura de Kripke
Como apresentado na Seção III, uma estrutura de Kripke
é uma tupla K = hS, I, L, →i definida sobre um conjunto
finito de proposições atômicas AP. Desse modo, o processo de
derivação de uma estrutura de Kripke consiste em estabelecer
os elementos que a constituem, observando todas as restrições
impostas pela sua definição, quais sejam: (i) o conjunto S de
estados deve ser finito; e (ii) a relação de transição → deve
ser total. Ao longo desta seção são descritos os procedimentos
adotados pelo método para obter cada um dos constituintes da
estrutura de Kripke que representa o TESC, chamada de EK.
Inicialmente, de forma direta, o conjunto AP de proposições atômicas sobre o qual EK é definida, é formado pelo
conjunto CP de proposições contextuais (i.e., AP = CP).
Além disso, o conjunto S de estados de EK é obtido a partir
dos conjuntos CP de proposições contextuais e G de restrições
estabelecidas sobre CP por meio da função S = csp(CP, G).
Já o conjunto I de estados iniciais é estabelecido como segue:
I = {s|s ∈ S, e ∈ E, val(s) |= ecs(e)}, onde S é o conjunto
de estados, E é o conjunto de todas as exceções contextuais
modeladas no sistema e val(s) significa a valoração das
proposições contextuais no estado s. Desse modo, os estados
iniciais são os estados em que a valoração das proposições
contextuais (val(s)) satisfazem (|=) as especificações de contexto excepcional das exceções modeladas (ecs(e)), i.e., o
conjunto I é composto pelos estados excepcionais do modelo.
Já a função de rótulos L, é composta pela valoração de todos
os estados do sistema: L = {val(s) | s ∈ S}.
Antes de apresentar a forma como a relação de transição →
de EK é derivada, duas funções auxiliares são introduzidas. O
objetivo dessas funções é construir um conjunto de transições
entre pares de estados. Em (2a), as transições são construídas
a partir de um dado estado s e uma fórmula lógica φ, onde o
estado de partida é o estado s e os estados de destino são todos
aqueles cujo rótulo satisfazem (|=) a fórmula φ. Já em (2b), as
transições são construídas a partir de um par de fórmulas, φ e
ϕ, onde os estados de partida são todos aqueles que satisfazem
φ e os de destino são os que satisfazem ϕ.
ST(s, φ, S) = {(s, r) | r ∈ S, L(r) |= φ}
(2a)
FT(φ, ϕ, S) = {(s, r) | s, r ∈ S, L(s) |= φ, L(r) |= ϕ} (2b)
A relação de transição → de EK representa a sequência de
ações realizadas durante a atividade de tratamento para cada
exceção contextual detectada e tratada pelo mecanismo de
TESC. Essas transições entre estados iniciam em um estado
excepcional e terminam em um estado caracterizado pela última medida de tratamento do caso de tratamento selecionado
para tratar aquela exceção. O Algoritmo 1 descreve como as
transições em EK são geradas, recebendo como entrada os
conjuntos: Γ, de escopos de tratamento; I, de estados iniciais
(excepcionais); e S de todos os estados. Desse modo, para
cada escopo de tratamento he, HCASEi (linha 4) e para cada
estado em I (linha 5), verifica-se se a exceção e do escopo de
tratamento corrente pode ser levantada no estado s (linha 6).
Em caso afirmativo, para cada caso de tratamento hα, Hi
(linha 7), verifica-se se este pode ser selecionado (linha 8).
Transições entre o estado excepcional e os estados que satisfazem a primeira medida de tratamento do caso de tratamento
é feita por meio de uma chamada a função ST(s, H(0), (S))
(linha 9), sendo armazenada em conjunto auxiliar (AUX ).
Caso este conjunto auxiliar seja não vazio (linha 10), essas transições são guardadas no conjunto de retorno T R
(linha 12). Adicionalmente, o mesmo é feito para cada par
de medidas de tratamento por meio de chamadas a função
FT(H(i − 1), H(i), (S)) (linhas 13 e 15). Perceba que os
laços mais interno e intermediário são interrompidos (linhas 17
e 20) quando não é possivel realizar transições entre estados.
Além disso, o comando break (linha 22) garante que apenas
um caso de tratamento é selecionado para tratar a exceção
e levando em consideração a relação de ordem baseada nos
índices. Por fim, antes de retornar o conjunto final de relações
de transição (linha 36), o fragmento de código compreendido
entre as linhas 19 e 35 adiciona uma auto-transição (transição
de loop) nos estados terminais (i.e., nos estados que não
possuem sucessores) para garantir a restrição de totalidade
imposta pela definição de estrutura de Kripke.
Algoritmo 1 Geração da Relação → de Transição de EK.
1: function T RANSICAO EK(Γ, I, S)
2:
TR = ∅
3:
AU X = ∅
4:
for all he, HCASEi ∈ Γ do
5:
for all s ∈ I do
6:
if L(s) |= ecs(e) then
7:
for all hα, Hi ∈ HCASE do
8:
if L(s) |= α then
9:
AU X = ST(s, H(0), S)
10:
if AU X 6= ∅ then
11:
T R = T R ∪ AU X
12:
for i = 1, |H| do
13:
AU X = FT(H(i − 1), H(i), S)
14:
if AU X 6= ∅ then
15:
T R = T R ∪ AU X
16:
else
17:
break
18:
end if
19:
end for
20:
break
21:
else
22:
break
23:
end if
24:
end if
25:
end for
26:
end if
27:
end for
28:
end for
29:
if T R 6= ∅ then
30:
for all s ∈ S do
31:
if 6 ∃t ∈ S, (s, t) ∈ T R then
32:
T R = T R ∪ (s, s)
33:
end if
34:
end for
35:
end if
36:
return T R
37: end function
E. Atividade de Especificação
A atividade de especificação consiste na determinação de
propriedades sobre o comportamento do TESC com o intuito
de encontrar faltas de projeto. Neste trabalho foram catalogadas 3 propriedades comportamentais que, se violadas, indicam
a existência de faltas de projeto no TESC, são elas: progresso
de detecção, progresso de captura e progresso de tratador.
Cada uma dessas propriedades é apresentada a seguir.
1) Progresso de Detecção: Essa propriedade determina que
para cada estado da estrutura de Kripke do contexto, deve
existir pelo menos um estado onde cada exceção contextual é
detectada. A violação dessa propriedade indica a existência
de exceções contextuais que não são detectadas. Esse tipo
de falta de projeto é denominada de exceção morta. Essa
propriedade deve ser verificada para cada uma das exceções
contextuais modeladas no sistema. Desse modo, seja e uma
exceção contextual, a fórmula (3), escrita em CTL, especifica
formalmente essa propriedade.
EF(ecs(e))
(3)
2) Progresso de Captura: Essa propriedade estabelece que
para cada exceção de contexto levantada, deve existir, pelo
menos, um caso de tratamento habilitado a capturar aquela
exceção. A violação dessa propriedade indica que existem
estados do contexto onde exceções contextuais são levantadas,
mas não podem ser capturadas e, consequentemente, tratadas.
Esse tipo de falta de projeto é denominada de tratamento
nulo. É importante observar que, mesmo existindo situações
de contexto onde o sistema não pode tratar aquela exceção, é
importante que o projetista esteja ciente de que esse fenômeno
ocorre no seu modelo. Sendo assim, seja he, HCASEi um
escopo de tratamento com hα0 , H0 i, hα1 , H1 i, . . . , hαn , Hn i
∈ HCASE casos de tratamento, a fórmula (4), escrita em CTL,
especifica formalmente essa propriedade.



i<|HCASE|
_
EF ecs(e) ∧ 
αi 
(4)
0
3) Progresso de Tratador: Essa propriedade determina que
para cada estado do contexto onde uma exceção contextual é
levantada, deve existir pelo menos um destes estados onde
cada caso de tratamento é selecionado para tratar aquela
exceção. A violação dessa propriedade indica que existem
casos de tratamento, definidos em um escopo de tratamento
de uma exceção contextual particular, que nunca serão selecionados. Esse tipo de falta de projeto é denominada de
tratador morto. Desse modo, seja he, HCASEi um escopo de
tratamento com hα0 , H0 i, hα1 , H1 i, . . . , hαn , Hn i ∈ HCASE
casos de tratamento, a fórmula (5), escrita em CTL, especifica
formalmente essa propriedade.
i<|HCASE|
^
EF(ecs(e) ∧ (αi ))
(5)
0
V. AVALIAÇÃO
Nesta seção é feita uma avaliação do método proposto.
Na Seção V-A é apresentada a ferramenta desenvolvida de
suporte ao método. A Seção V-B descreve o sistema exemplo
utilizado na avaliação. Na Seção V-C o projeto do TESC de
duas exceções do sistema exemplo é detalhado. Por fim, na
Seção V-D os cenários de injeção de faltas são descritos e um
sumário dos resultados é oferecido na Seção V-E.
A. A Ferramenta
A ferramenta2 foi implementada na plataforma Java e provê
uma API para que o projetista especifique as proposições de
contexto (Seção IV-B), as restrições semânticas (Seção IV-B),
as exceções contextuais (Seção IV-C1), os casos de tratamento
(Seção IV-C2) e os escopos de tratamento (Seção IV-C3).
Essas especificações são enviadas ao módulo conversor que
gera os estados do contexto e constrói o modelo de comportamento do TESC e o conjunto de propriedades descritas
pelo método. É importante mencionar que o projetista pode
informar propriedades adicionais a serem verificadas sobre
o modelo, além daquelas já predefinidas por nosso método
(Seção IV-E). De posse do modelo de comportamento e das
propriedades, a ferramenta submete estas entradas ao módulo
de verificação de modelos, o qual executa o processo de
verificação e gera um relatório de saída contendo os resultados
da verificação. Para fazer a geração dos estados do contexto, a
ferramenta faz uso do Choco Solver3 , uma das implementações
de referência da JSR 331: Constraint Programming API 4 . Já
no processo de verificação, a ferramenta utiliza o verificador de
modelos MCiE desenvolvido no projeto The Model Checking
in Education5 . Esse verificador foi escolhido, principalmente,
pelo fato de ser implementado na plataforma Java, o que
facilitou a sua integração com a ferramenta desenvolvida.
B. A Aplicação UbiParking
O UbiParking é uma aplicação baseada no conceito de “Cidades Ubíquas”. A ideia por trás desse conceito é o provimento
de serviços ubíquos relacionados com o cotidiano das cidades
e das pessoas que nelas habitam, com o propósito de melhorar
a convivência urbana sob diversos aspectos, tais como trânsito,
segurança e atendimento aos cidadãos. O objetivo UbiParking
é auxiliar motoristas na atividade estacionar seus veículos.
Nesse sentido, o UbiParking disponibiliza um mapa plotado
com todas as vagas de estacionamento disponíveis por região.
Este mapa de vagas livres é atualizado com base em informações coletadas por meio de sensores implantados nos acostamentos das vias e nos estacionamentos públicos. Os sensores
detectam quando uma vaga de estacionamento está ocupada ou
livre, enviando esta informação para o sistema. Desse modo,
utilizando o UbiParking em seus dispositivos móveis ou no
computador de bordo dos seus veículos, os cidadãos podem
obter informações sobre a distribuição das vagas livres por
região, podendo reservar uma vaga e solicitar ao sistema uma
rota mais apropriada com base em algum critério de sua preferência (e.g., menor distância, maior número de vagas livres
ou menor preço). Chegando ao estacionamento escolhido, o
UbiParking conduz o motorista até a vaga reservada ou à
vaga livre mais próxima, considerando os casos onde a vaga
reservada é ocupada de forma imprevisível por outro veículo.
Do mesmo modo, quando o motorista retorna ao seu veículo,
o UbiParking o conduz até a saída mais próxima, poupandolhe tempo. Os estacionamentos do UbiParking possuem uma
disposição espacial composta por entradas, pátio de vagas e
saídas. Além disso, o estacionamento ubíquo é equipado com
sensores de temperatura, detectores de fumaça e aspersores
controlados automaticamente, para o caso de incêndio.
C. Projeto do TESC para o UbiParking
Nesta seção é descrita a utilização do método no projeto do
TESC de duas exceções contextuais da aplicação UbiParking,
a exceção de incêndio e a exceção de vaga indisponível. A
exceção de incêndio modela uma condição de incêndio dentro
3 http://www.emn.fr/z-info/choco-solver
4 http://jcp.org/en/jsr/detail?id=331
2 http://www.great.ufc.br/~lincoln/JCAEHV/JCAEHV.zip
5 http://www2.cs.uni-paderborn.de/cs/kindler/Lehre/MCiE/
do estacionamento. Por meio das informações de contexto coletadas pelos sensores de fumaça e temperatura, o UbiParking
consegue detectar a ocorrência desse tipo de exceção contextual dentro do estacionamento. Para lidar com essa exceção, o
sistema aciona os aspersores e conduz os motoristas até o lado
de fora do estacionamento. Já a exceção de vaga indisponível,
modela uma situação em que o veículo está em movimento
dentro do pátio de vagas indo em direção à sua vaga reservada.
Porém, outro veículo ocupa aquela vaga. Nesse caso, se a
vaga for a última vaga livre disponível no estacionamento,
fica caracterizada a situação de anormalidade. Essa exceção
contextual é detectada pelo sistema através de informações de
contexto sobre as reservas de vagas, a localização do veículo e
os dados que vem dos sensores de detecção de vaga ocupada.
Como forma de tratar essa exceção contextual, o UbiParking
conduz o veículo até o lado de fora do estacionamento, onde
outra vaga livre em outro estacionamento pode ser reservada.
Com base nesses dois cenários de exceção, as proposições
descritas na Tabela II foram estabelecidas.
Tabela II
P ROPOSIÇÕES C ONTEXTUAIS DO UbiParking.
Proposições
Significado
inMovement
atParkEntrance
atParkPlace
atParkExit
hasSpace
isHot
hasSmoke
isSprinklerOn
“O veículo está em movimento?”
“O veículo está na entrada do estacionamento?”
“O veículo está no pátio de vagas do estacionamento?”
“O veículo está na saída do estacionamento?”
“Há vaga livre no estacionamento?”
“Esta quente no estacionamento?”
“Há fumaça no estacionamento?”
“Os aspersores estão ligados?”
Perceba que as proposições contextuais atParkPlace,
atParkExit e atParkEntrance (Tabela II) possuem uma
relação semântica particular. No UbiParking, o veículo, do
ponto de vista espaço-temporal, só pode estar fora ou dentro
do estacionamento em um dado instante. Caso ele esteja
fora do estacionamento, as três proposições devem assumir
valor verdade falso. Por outro lado, se o veículo estiver no
estacionamento, ele só poderá estar em um dos seguintes
lugares: na entrada, no pátio de vagas ou na saída do estacionamento, mas não em mais de um local simultaneamente.
Esse tipo de relação semântica entre proposições contextuais
deve ser levado em consideração no momento da modelagem.
Desse modo, a seguinte restrição deve ser derivada durante a
modelagem do UbiParking para garantir a consistência semântica: (atParkEntrance ⊕ atParkPlace ⊕ atParkExit) ∨
(¬atParkEntrance ∧ ¬atParkPlace ∧ ¬atParkExit).
A exceção contextual de incêndio é descrita pela tupla:
h“F ire”, hasSmoke ∧ isHoti. Dois casos de tratamento foram identificados para tratar essa exceção contextual em
função de contexto do veículo. Para a situação de contexto
em que o veículo encontra-se na entrada do estacionamento,
o seguinte caso de tratamento pode ser formulado: hcaser0 =
hα0r , H0r i, onde α0r = inMovement∧atParkEntrance e H0r =
{isSprinklerOn ∧ (¬atParkEntrance ∧ ¬atParkPlace
∧ ¬atParkExit)}. O caso de tratamento hcasef0 é selecionado quando o veículo encontra-se entrando no estacionamento. Dessa forma, se ele é selecionado, o efeito esperado
após a execução do tratamento (H0r ) é que o sistema atinja
um estado em que os aspersores estejam ligados e o veículo
esteja fora do estacionamento.
Por outro lado, na situação em que o veículo encontra-se
dentro do pátio de vagas do estacionamento, outro caso de
tratamento pode ser derivado: hcaser1 = hα1r , H1r i, onde α1f
= inMovement ∧ atParkPlace e H1r = {isSprinklerOn
∧ atParkExit, isSprinklerOn ∧ (¬atParkEntrance ∧
¬atParkPlace ∧ ¬atParkExit)}. No hcaser1 , o veículo
encontra-se em movimento dentro do pátio de vagas do
estacionamento. Nesse caso de tratamento, duas medidas
de tratamento são esperadas que ocorram sequencialmente
(H1r ). A primeira consiste em levar o sistema a um estado
em que o veículo esteja na saída do estacionamento e os
aspersores encontrem-se ligados. Já a segunda, consiste em
levar o sistema a um estado no qual os aspersores continuem ligados e o veículo encontrar-se fora do estacionamento. O escopo de tratamento dessa exceção é dado por:
h“F ire”, {hcaser0 , hcaser1 }i.
A exceção contextual de vaga indisponível é descrita
pela tupla: h“N oF reeSpace”, inMovement ∧ atParkPlace
∧ (¬hasSpace)i. Para essa exceção, apenas um caso de
tratamento foi definido: hcasen0 = hα0n , H0n i, onde α0n =
inMovement ∧ atParkPlace e H0n = {inMovement ∧
atParkExit}. A condição desse caso de tratamento estabelece que ele só é selecionado se o veículo estiver em
movimento dentro do pátio de vagas. A medida de tratamento
associada a esse caso de tratamento define que após o tratamento, o veículo deve encontrar-se em movimento na saída
do estacionamento. O escopo de tratamento dessa exceção é
dado por: h“N oF reeSpace”, {hcasen0 }i.
D. Cenários de Injeção de Faltas
A injeção de faltas (fault injection) é uma técnica empregada na avaliação da confiabilidade de sistemas computacionais [21]. Ela consiste na inserção controlada de faltas em
um modelo ou sistema computacional com o propósito de
avaliar aspectos de robustez e dependabilidade. Essa técnica
foi utilizada neste trabalho como forma de avaliar a eficácia do
método proposto. Para isso, o projeto do TESC das exceções
de incêndio e vaga indisponível (Seção V-C) do UbiParking foi
modelado utilizando a ferramenta de suporte ao método. Esse
projeto foi submetido ao verificador de modelos da ferramenta
e nenhuma das faltas de projeto estabelecidas pelo método
foi encontrada (i.e., exceção morta, tratamento nulo e tratador
morto), portanto, trata-se de um modelo correto. A partir desse
modelo correto, para cada propriedade que se deseja verificar
(i.e., progresso de detecção, progresso de captura e progresso
de tratador), foi feita uma alteração deliberada no modelo com
o propósito de violá-la. Essas alterações representam faltas de
projeto similares aquelas descritas na Seção II-B, as quais os
projetistas estão sujeitos a cometer.
1) Cenário 1: Violando o Progresso de Detecção: Essa
propriedade é violada quando não existe, pelo menos, um
estado de contexto do sistema onde a exceção em questão pode
ser detectada. Isso pode ocorrer quando o projetista: (i) insere
uma contradição na especificação do contexto excepcional;
ou (ii) a especificação representa uma situação de contexto
que nunca ocorrerá em tempo de execução. Embora essas
duas faltas de projeto sejam diferentes do ponto de vista de
significado, elas representam a mesma situação para o modelo
de comportamento do TESC: uma expressão que não pode
ser satisfeita dentro do modelo do sistema. Dessa forma, para
injetar uma falta de projeto que viole essa propriedade é
suficiente garantir que a especificação do contexto excepcional
seja insatisfazível no modelo do TESC, independente de ser
provocada por uma falta do tipo (i) ou (ii). Neste trabalho,
optou-se por utilizar faltas do tipo (i). Desse modo, as contradições foram construídas a partir da conjunção da especificação
do contexto excepcional e a sua negação.
2) Cenário 2: Violando o Progresso de Captura: Essa
propriedade é violada quando não é possível selecionar, pelo
menos um, caso de tratamento quando uma exceção é detectada. Isso pode ocorrer quando a condição de seleção dos
casos de tratamento representam: (i) uma contradição; (ii)
uma situação de contexto que nunca ocorre; ou (iii) uma
contradição entre a condição de seleção e a especificação do
contexto excepcional da exceção contextual associada. Embora
essas faltas de projeto sejam diferentes do ponto de vista de
significado, elas representam a mesma situação para o modelo
de comportamento do TESC: uma expressão que não pode ser
satisfeita dentro do modelo do sistema ou quando uma exceção
contextual é detectada. Assim, para injetar uma falta de projeto
que viole essa propriedade é suficiente garantir que a condição
de seleção dos casos de tratamento seja insatisfazível no
modelo do TESC, independente de ser provocada por uma
falta do tipo (i), (ii) ou (iii). Neste trabalho, foi utilizado faltas
do tipo (i), construídas a partir da conjunção de cada condição
de seleção dos casos de tratamento e a sua negação.
3) Cenário 3: Violando o Progresso de Tratador: Essa
propriedade é violada quando existe, pelo menos um, caso
de tratamento que nunca é selecionado quando uma exceção é
detectada. As situações onde isso pode ocorrer são exatamente
os mesmas descritas para a propriedade do Cenário V-D2. A
diferença é que para violar a propriedade de progresso de
tratador basta que apenas um caso de tratamento seja mal
projetado (i.e., contenha uma falta de projeto), enquanto que
para violar a propriedade de progresso de captura, descrita no
Cenário V-D2, existe a necessidade de que todos os casos de
tratamento sejam mal projetados. Desse modo, optou-se por
utilizar o mesmo tipo de falta de projeto do Cenário V-D2.
E. Sumário dos Resultados
Cada cenário foi executado individualmente e foram consideradas 3 (três) tipos de permutações, denominadas de
rodadas: (i) a injeção de falta apenas na exceção “Fire”; (ii)
a injeção de falta apenas na exceção “NoFreeSpace”; e (iii)
a injeção de falta em ambas exceções de forma simultânea.
Na primeira rodada do Cenário V-D1, como esperado, a falta
injetada foi detectada através da identificação de uma falta
de projeto de exceção morta no projeto da exceção de incêndio. Além dessa, outras faltas de projeto foram detectadas:
tratamento nulo e tratador morto. O fato dessas outras duas
faltas serem detectadas no projeto da exceção de incêndio
é compreensível, uma vez que só se pode selecionar um
caso de tratamento para tratar uma exceção quando está é
detectada. O mesmo resultado ocorreu na segunda rodada
do Cenário V-D1, porém, com respeito a exceção de vaga
indisponível. Por outro lado, na terceira rodada, nenhuma falta
de projeto foi identificada. Nessa rodada, como as faltas foram
inseridas em ambas exceções, nenhum estado excepcional foi
gerado, consequente, o modelo de comportamento não pode
ser derivado e a sua verificação não pode ser conduzida.
Na primeira rodada do Cenário V-D2, a falta injetada foi
detectada através da identificação das faltas de projeto de
tratamento nulo e tratador morto. Nenhuma falta de exceção
morta foi identificada, uma vez que exceções foram detectadas
no modelo. Na segunda rodada do Cenário V-D2, o mesmo
resultado foi encontrado para a exceção de vaga indisponível.
Por fim, na terceira rodada, como esperado, um par de faltas
de projeto de tratamento nulo e tratador morto foi identificado
para cada exceção contextual. com respeito a o Cenário V-D3,
na primeira, como esperado, a falta injetada foi detectada
através da identificação da falta de projeto de tratador morto.
Na segunda rodada do Cenário V-D3, o mesmo resultado foi
encontrado para a exceção de vaga indisponível. Por fim,
na terceira rodada, como esperado, uma falta de projeto de
tratador morto foi identificada para cada exceção contextual.
VI. T RABALHOS R ELACIONADOS
No escopo da revisão bibliográfica realizada, não foram
encontrados trabalhos que abordam a mesma problemática
endereçada neste artigo. Porém, os trabalhos [22][16][11][17]
possuem uma relação próxima à solução proposta neste artigo.
Particularmente, [16][11][17] estão relacionados a descoberta
de faltas de projeto no mecanismo de adaptação de sistemas
ubíquos sensíveis ao contexto. Essa problemática consiste na
má especificação das regras de adaptação em tempo de projeto.
Essas regras são compostas por uma condição de guarda (antecedente) e um conjunto de ações associadas (consequente).
A condição de guarda descreve situações de contexto às quais
o sistema deve reagir. Já as ações, caracterizam a reação do
sistema ao contexto detectado. Dessa forma, a especificação
erronea das condições de guarda podem levar o sistema a uma
configuração imprópria e, posteriormente, a uma falha.
Em [22] é proposto uma forma de especificar a semântica
do comportamento adaptativo por meio de fórmulas lógicas
temporais, entretanto, não provê suporte ferramental para
a verificação de propriedades. Já os trabalhos [16][11][17],
buscam representar o comportamento adaptativo por meio de
algum formalismo baseado em estados e transições. De posse
desse modelo, técnicas formais de análise (e.g., algoritmos
simbólicos e verificadores de modelos) são empregadas com
o intuito de identificar faltas de projeto. Em [16] o focado é
dado ao domínio de aplicações sensíveis ao contexto formadas
por composições de Web Services. Nesse trabalho o objetivo é
encontrar inconsistências na composição dos serviços e nas
suas interações. Para isso, é proposto um mapeamento da
especificação da aplicação baseada em BPEL para um modelo
formal utilizado para fazer as análises e verificações, chamado
de CA-STS (Context-Aware Symbolic Transition System). Por
outro lado, [11] busca identificar problemas específicos de má
especificação das regras de adaptação. Para isso, eles propõem
um formalismo baseado em máquina de estados, chamado AFSM (Adaptation Finite-State Machine). Esse formalismo é
usado para modelar o comportamento adaptativo e servir como
base para a verificação de propriedades e detecção de faltas de
projeto. Em [17] é feita uma extensão de [11], onde é proposto
um método para melhorar a efetividade da A-FSM por meio
de técnicas de programação por restrições, mineração de dados
e casamento de padrões. Entretanto, é importante mencionar
que todos os trabalhos, exceto [22], são limitados com relação
ao tipo de propriedades a serem verificadas. Por proporem
seus próprios formalismos e implementarem ferramentas que
analisam apenas um conjunto particular de propriedades, a sua
extensão acaba sendo limitada. Diferente do método proposto,
que permite que novas propriedades sejam incorporadas.
VII. C ONCLUSÕES E T RABALHOS F UTUROS
Este trabalho apresentou um método para a verificação
do projeto do TESC. As abstrações do método permitem
que aspectos importantes do comportamento do TESC sejam
modelados e mapeados para uma de Kripke, permitindo que
seja analisado por um verificador de modelos. Adicionalmente,
um conjunto de propriedades que capturam a semântica de
determinados tipos de faltas de projeto foram formalmente especificadas e apresentas como forma de auxiliar os projetistas
na identificação de faltas de projeto. Além disso, a ferramenta
de suporte e os cenários de injeção de faltas, utilizados para
avaliar o método, apresentam resultados que demonstram a
viabilidade da proposta. Como trabalhos futuros, pretendese tratar questões relacionadas ao tratamento de exceções
concorrentes no modelo com a definição de uma função de
resolução que permita selecionar as medidas de tratamento
mais adequadas face ao conjunto de exceções levantadas. Além
disso, outro direcionamento para trabalhos futuros consiste
na extensão do modelo para que seja possível representar
a composição dos comportamentos adaptativo e excepcional,
com o objetivo de analisar a influência de um sobre o outro.
Por fim, outra linha a ser investigada é a criação de uma DSL
para o projeto do TESC para que um experimento envolvendo
usuários possa ser conduzido.
R EFERÊNCIAS
[1] S. W. Loke, “Building taskable spaces over ubiquitous services,” IEEE
Pervasive Computing, vol. 8, no. 4, pp. 72–78, oct.-dec. 2009.
[2] A. K. Dey, “Understanding and using context,” Personal Ubiquitous
Computing, vol. 5, no. 1, pp. 4–7, 2001.
[3] D. Kulkarni and A. Tripathi, “A framework for programming robust
context-aware applications,” IEEE Trans. Softw. Eng., vol. 36, no. 2, pp.
184–197, 2010.
[4] K. Damasceno, N. Cacho, A. Garcia, A. Romanovsky, and C. Lucena,
“Context-aware exception handling in mobile agent systems: The moca
case,” in Proceedings of the 2006 international workshop on Software
Engineering for Large-Scale Multi-Agent Systems, ser. SELMAS’06.
New York, NY, USA: ACM, 2006, pp. 37–44.
[5] J. Mercadal, Q. Enard, C. Consel, and N. Loriant, “A domain-specific
approach to architecturing error handling in pervasive computing,” in
Proceedings of the ACM international conference on Object oriented
programming systems languages and applications, ser. OOPSLA ’10.
New York, NY, USA: ACM, 2010, pp. 47–61.
[6] D. M. Beder and R. B. de Araujo, “Towards the definition of a
context-aware exception handling mechanism,” in Fifth Latin-American
Symposium on Dependable Computing Workshops, 2011, pp. 25–28.
[7] L. Rocha and R. Andrade, “Towards a formal model to reason about
context-aware exception handling,” in 5th International Workshop on
Exception Handling (WEH) at ICSE’2012, 2012, pp. 27–33.
[8] E.-S. Cho and S. Helal, “Toward efficient detection of semantic exceptions in context-aware systems,” in 9th International Conference on
Ubiquitous Intelligence Computing and 9th International Conference on
Autonomic Trusted Computing (UIC/ATC), sept. 2012, pp. 826 –831.
[9] J. Whittle, P. Sawyer, N. Bencomo, B. H. C. Cheng, and J.-M. Bruel,
“Relax: Incorporating uncertainty into the specification of self-adaptive
systems,” in Proceedings of the 2009 17th IEEE International Requirements Engineering Conference, RE, ser. RE ’09. Washington, DC,
USA: IEEE Computer Society, 2009, pp. 79–88.
[10] D. Cassou, B. Bertran, N. Loriant, and C. Consel, “A generative
programming approach to developing pervasive computing systems,”
in Proceedings of the 8th International Conference on Generative
Programming and Component Engineering, ser. GPCE’09. ACM, 2009,
pp. 137–146.
[11] M. Sama, S. Elbaum, F. Raimondi, D. Rosenblum, and Z. Wang,
“Context-aware adaptive applications: Fault patterns and their automated
identification,” IEEE Trans. Softw. Eng., vol. 36, no. 5, pp. 644–661,
2010.
[12] C. Bettini, O. Brdiczka, K. Henricksen, J. Indulska, D. Nicklas, A. Ranganathan, and D. Riboni, “A survey of context modelling and reasoning
techniques,” Pervasive Mob. Comput., vol. 6, pp. 161–180, April 2010.
[13] A. Coronato and G. De Pietro, “Formal specification and verification of
ubiquitous and pervasive systems,” ACM Transactions on Autonomous
and Adaptive Systems, vol. 6, no. 1, pp. 9:1–9:6, Feb. 2011.
[14] F. Siewe, H. Zedan, and A. Cau, “The calculus of context-aware
ambients,” J. Comput. Syst. Sci., vol. 77, pp. 597–620, Jul. 2011.
[15] P. Zhang and S. Elbaum, “Amplifying tests to validate exception
handling code,” in Proceedings of the 2012 International Conference
on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE
Press, 2012, pp. 595–605.
[16] J. Cubo, M. Sama, F. Raimondi, and D. Rosenblum, “A model to design
and verify context-aware adaptive service composition,” in Proceedings
of the 2009 IEEE International Conference on Services Computing, ser.
SCC ’09. Washington, DC, USA: IEEE, 2009, pp. 184–191.
[17] Y. Liu, C. Xu, and S. C. Cheung, “Afchecker: Effective model checking for context-aware adaptive applications,” Journal of Systems and
Software, vol. 86, no. 3, pp. 854–867, 2013.
[18] E. M. Clarke, Jr., O. Grumberg, and D. A. Peled, Model Checking.
Cambridge, MA, USA: MIT Press, 1999.
[19] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts
and taxonomy of dependable and secure computing,” IEEE Transactions
on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004.
[20] P. Van Hentenryck and V. Saraswat, “Strategic directions in constraint
programming,” ACM Comput. Surv., vol. 28, no. 4, pp. 701–726, 1996.
[21] J. Ezekiel and A. Lomuscio, “Combining fault injection and model checking to verify fault tolerance in multi-agent systems,” in Proceedings of
The 8th International Conference on Autonomous Agents and Multiagent
Systems - Volume 1, Richland, SC, 2009, pp. 113–120.
[22] J. Zhang and B. Cheng, “Using temporal logic to specify adaptive
program semantics,” J. Syst. Software, vol. 79, no. 10, pp. 1361–1369,
2006.
Prioritization of Code Anomalies based on
Architecture Sensitiveness
Roberta Arcoverde, Everton Guimarães, Isela Macía,
Alessandro Garcia
Informatics Department, PUC-Rio
Rio de Janeiro, Brazil
{rarcoverde, eguimaraes, ibertran, afgarcia}@inf.puc-rio.br
Abstract— Code anomalies are symptoms of software
maintainability
problems,
particularly
harmful
when
contributing to architectural degradation. Despite the existence
of many automated techniques for code anomaly detection,
identifying the code anomalies that are more likely to cause
architecture problems remains a challenging task. Even when
there is tool support for detecting code anomalies, developers
often invest a considerable amount of time refactoring those that
are not related to architectural problems. In this paper we
present and evaluate four different heuristics for helping
developers to prioritize code anomalies, based on their potential
contribution to the software architecture degradation. Those
heuristics exploit different characteristics of a software project,
such as change-density and error-density, for automatically
ranking code elements that should be refactored more promptly
according to their potential architectural relevance. Our
evaluation revealed that software maintainers could benefit from
the recommended rankings for identifying which code anomalies
are harming architecture the most, helping them to invest their
refactoring efforts into solving architecturally relevant problems.
Keywords— Code anomalies, Architecture degradation and
Refactoring;
I.
INTRODUCTION
Code anomalies, commonly referred to as code smells [9],
are symptoms in the source code that may indicate deeper
maintainability problems. The presence of code anomalies
often represents structural problems, which make code harder
to read and maintain. Those anomalies can be even more
harmful when they impact negatively on the software
architecture design [4]. When that happens, we call those
anomalies architecturally relevant, as they represent symptoms
of architecture problems. Moreover, previous studies [14][35]
have confirmed that the progressive manifestation of code
anomalies is a key symptom of architecture degradation [14].
The term architecture degradation is used to refer to the
continuous quality decay of architecture design when evolving
software systems. Thus, as the software architecture degrades,
the maintainability of software systems can be compromised
irreversibly. As examples of architectural problems, we can
mention Ambiguous Interface and Component Envy [11], as
well as cyclic dependencies between software modules [27].
Yuanfang Cai
Department of Computer Science
Drexel University
Philadelphia, USA
[email protected]
In order to prevent architecture degradation, software
development teams should progressively improve the system
maintainability by detecting and removing architecturally
relevant code anomalies [13][36]. Such improvement is
commonly achieved through refactoring [6][13] - a widely
adopted practice [36] with well known benefits [29]. However,
developers often focus on removing – or prioritizing - a
limited subset of anomalies that affect their projects [1][16].
Furthermore, most of the remaining anomalies are
architecturally relevant [20]. Thus, when it is not possible to
distinguish which code anomalies are architecturally relevant,
developers can spend more time fixing problems that are not
harmful to the architecture design. This problem occurs even in
situations where refactoring need to be applied in order to
improve the adherence of the source code to the intended
architecture [1][19][20].
Several code analysis techniques have been proposed for
automatically detecting code anomalies [18][25][28][32].
However, none of them help developers to prioritize anomalies
with respect to their architectural relevance, as they present the
following limitations: first, most of these techniques focus on
the extraction and combination of static code measures. The
analysis of the source code structure alone is often not enough
to reveal whether an anomalous code element is related to the
architecture decomposition [19][20]. Second, they do not
provide means to support the prioritization or ranking of code
anomalies. Finally, most of them disregard: (i) the exploitation
of software project factors (i.e. frequency of changes and
number of errors) that may be an indicator of the architectural
relevance of a module, and (ii) the role that code elements play
in the architectural design.
In this context, this paper proposes four prioritization
heuristics to help identifying and ranking architecturally
relevant code anomalies. Moreover, we assessed the accuracy
of the proposed heuristics when ranking code anomalies based
on their architecture relevance. The assessment was carried out
in the context of four target systems from heterogeneous
domains, developed by different teams using different
programming languages. Our results show that the proposed
heuristics were able to accurately prioritize the most code
relevant anomalies of the target systems mainly for scenarios
where: (i) there were architecture problems involving groups of
classes that changed together; (ii) changes were not
predominantly perfective; (iii) there were code elements
infected by multiple anomalies; and (iv) the architecture roles
are well-defined and have distinct architectural relevance.
The remainder of this paper is organized as follows. Section
II introduces the concepts involved in this work, as well as the
related work. Section III introduces the study settings. Section
IV describes the prioritization heuristics proposed in this paper.
Section V presents the evaluation of the proposed heuristics,
and Section VI the evaluation results. Finally, Section VII
presents the threats to validity, while Section VIII discusses the
final remarks and future work.
II.
BACKGROUND AND RELATED WORK
This section introduces basic concepts related to software
architecture degradation and code anomalies (Section II.A). It
also discusses researches that investigate the interplay between
code anomaly and architectural problems (Section II.B).
Finally, the section introduces existing ranking systems for
code anomalies (Section II.C).
A. Basic Concepts
One of the causes for architecture degradation [14] on
software projects is the continuous occurrence of code
anomalies. The term code anomaly or code smell [9] is used to
define structures on the source code that usually indicate
maintenance problems. As examples of code anomalies we
can mention Long Methods and God Classes [9]. In this work,
we consider a code anomaly as being architecturally relevant
when it has a negative impact in the system architectural
design. That is, the anomaly is considered relevant when it is
harmful or related to problems in the architecture design.
Therefore, the occurrence of an architecturally relevant code
anomaly can be observed if the anomalous code structure is
directly realizing an architectural element exhibiting an
architecture problem [19-21].
Once a code anomaly is identified, the corresponding code
may suffer some refactoring operations, so the code anomaly
is correctly removed. When those code anomalies are not
correctly detected, prioritized and removed in the early stage
of software development, the ability of these systems to
evolve can be compromised. In some cases, the architecture
has to be completely restructured. For this reason, the
effectiveness of automatically detected code anomalies using
strategies has been studied under different perspectives
[16][18][19][26][31]. However, most techniques and tools
disregard software project factors that might indicate the
relevance of an anomaly in terms of its architecture design,
number of errors and frequency of changes. Moreover, those
techniques do not help developers to distinguish which
anomalous element are architecturally harmful without
considering the architectural role that a given code element
plays on the architectural design.
B. Harmful Impact and Detection of Code Anomalies
The negative impact of code anomalies on the system
architecture has been investigated by many studies in the stateof-art. For instance, the study developed in [23] reported that
the Mozilla’s browser code was overly complex and tightly
coupled therefore hindering its maintainability and ability to
evolve. This problem was the main cause of its complete
reengineering, and developers’ took about five years to rewrite
over 7 thousand source files and 2 million lines of code [12].
Another study [7] showed how the architecture design of a
large telecommunication system degraded in 7 years.
Particularly, the relationship between the system modules
increased over the time. This was the main cause why the
system modules were not independent anymore, and as
consequence, further changes were not possible. Finally, a
study performed in [14] investigated the main causes for
architecture degradation. As a result, the study indicated that
refactoring specific code anomalies could help to avoid it.
Another study [35] has identified that duplicated code was
related to design violations.
In this sense, several detection strategies have been
proposed in order provide means for the automatic detection of
code anomalies [18][25][28]. However, most of them is based
on source code information and relies on a combination of
static code metrics and thresholds into logical expressions. This
is the main limitation of those detection strategies, since they
disregard architecture information that could be exploited to
reveal architecturally relevant code anomalies. In addition,
current detection strategies only consider individual
occurrences of code anomalies, instead of analyzing the
relationships between them. Such limitations are the main
reasons why the current detection strategies are not able to
support the detection of code anomalies responsible for
inserting architectural problems [19]. Finally, a recent study
[19] investigated to what extent the architecture sensitive
detection strategies can better identify code anomalies related
to architectural problems [22].
C. Ranking Systems for Code Anomalies
As previously mentioned, many tools and techniques
provide support for automatically detecting code anomalies.
However, the number of anomalies tends to increase as the
system grows and, in some cases, the high number of
anomalies can be unmanageable. Moreover, maintainers are
expected to choose which code anomalies should be
prioritized. Some of the reasons why this is necessary are (i)
time constraints and (ii) attempts to find the correct solution
when restricting a large system in order to perform refactoring
operations to solve those code anomalies. The problem is that
the existing detection strategies do not focus on ranking or
prioritizing code anomalies. Nevertheless, there are two tools
that provide ranking capabilities for different development
platforms: Code Metrics and InFusion.
The first tool is a .NET based add-in for the Visual Studio
development environment and it is able to calculate a limited
set of metrics. Once the metrics are calculated, the tool assigns
a “maintainability index” score to each of the analyzed code
elements. This score is based on the combination of the
metrics for each code element. The second tool is the
InFusion, which can be used for analyzing Java, C and C++
systems. Moreover, it allows calculating more than 60 metrics.
Besides the statistical analysis for calculating code metrics, it
also provides numerical scores in order to detect the code
anomalies. Those scores provide means to measure the
negative impact of code anomalies in the software system.
When combining the scores, a deficit index is calculated for
the entire system. The index takes into consideration size,
encapsulation, complexity, coupling and cohesion metrics.
However, the main concern of using these tools is that the
techniques they implement have some limitations: (i) usually
it only considers the source code structure as input for
detecting code anomalies; (ii) the ranking system disregards
the architecture role of the code elements; and (iii) the user
cannot define or customize their own criteria for prioritizing
code anomalies. In this sense, our study proposes prioritization
heuristics for ranking and prioritizing code anomalies.
Moreover, our heuristics are not only based on source code
information for detecting code anomalies. It also considers
information about architecture relevance of the detected code
anomalies. For this, we analyze different properties of the
source code they affect, such as information about changes on
software modules, bugs observed during the system evolution
and the responsibility of module in the system architecture.
III.
STUDY DEFINITION USING GQM FORMAT
GQM (Goal, Question, Metric)
Analyze:
For the purpose of:
With respect to:
From the viewpoint of:
In the context of:
In this section, we describe the study hypothesis in order to
test the accuracy of the proposed heuristics for ranking code
anomalies based on their relevance. First, we have defined
some thresholds of what we consider as an acceptable
accuracy: (i) low accuracy, 0-40%; (ii) acceptable accuracy,
40-80%; and (iii) high accuracy, 80-100%. These thresholds
are based on the ranges defined in [37], where the values are
applied in statistical tests (e.g. Pearson’s correlation). We
adapted these values in order to better evaluate our heuristics,
since we are only interested in values that indicate a high
correlation.
Moreover, we analyzed the three levels of accuracy in order
to investigate to what extent the prioritization heuristics would
be helpful. For instance, a heuristic with an accuracy level of
50% means the ranking produced by the heuristic should be
able to identify at least half of the architecturally relevant code
anomalies. In order to test the accuracy of the prioritization
heuristics, we have defined 4 hypotheses (see Table II).
STUDY SETTINGS
This section describes our study hypotheses and variables
selection, as well as the target systems used to evaluate the
accuracy of the proposed heuristics. The main goal of this
study is to evaluate whether the proposed heuristics for
prioritization of architecturally relevant code anomalies can
help developers on the ranking and prioritization process. It is
important to emphasize that the analysis of the proposed
heuristics is carried out in terms of accuracy. Also, Table I
defines our study using the GQM format [34].
TABLE I.
A. Hypotheses
The proposed set of prioritization heuristics.
Understanding their accuracy for ranking
code anomalies based on their architecture
relevance.
Rankings previously defined by developers
or maintainers of each analyzed system.
Researchers, developers and architects
Four software systems from different
domains with different architectural
designs.
Our study was basically performed in three phases: (i) as
we have proposed prioritization heuristics for ranking code
anomalies, in the first phase we performed the detection and
classification of code anomalies according to their architecture
relevance for each of the target systems. For such detection, we
used a semi-automatic process based on strategies and
thresholds, which has been broadly used on previous studies
[2][11][16][20]; (ii) in the second phase, we applied the
proposed heuristics and computed their scores for each
detected code anomaly. The output of this phase is an ordered
list with the high-priority anomalies; finally, (iii) in the third
phase, we compared the heuristics results with rankings
previously defined by developers or maintainers of each target
system. The ranking list provided by developers represents the
“ground truth” data in our analysis, and were produced
manually.
TABLE II.
Hypothesis
H1.0
H1
H1.1
H2.0
H2
H2.1
H3.0
H3
H3.1
H4
H4.0
H4.1
STUDY HYPOTHESES
Description
The change-density heuristic cannot accurately identify
architecturally relevant code anomalies ranked as top ten.
The change-density heuristic can accurately identify
architecturally relevant code anomalies ranked as top ten.
The error-density heuristic cannot accurately identify
architecturally relevant code anomalies ranked as top ten.
The error-density heuristic can accurately identify
architecturally relevant code anomalies ranked as top ten.
The anomaly density heuristic cannot accurately identify
architecturally relevant code anomalies ranked as top ten
The anomaly density heuristic can accurately identify
architecturally relevant code anomalies ranked as top ten
The architecture role heuristic cannot accurately identify
architecturally relevant code anomalies ranked as top ten.
The architecture role heuristic can accurately identify
architecturally relevant code anomalies ranked as top ten
B. Target Systems
In order to test the study hypotheses, we selected 4 target
systems from different domains: (i) MIDAS [24], a lightweight
middleware for distributed sensors application; (ii) Health
Watcher [13], a web system used for registering complaints
about health issues in public institutions; (iii) PDP, a web
application for managing scenographic sets in television
productions; and (iv) Mobile Media [6], a software product line
that manages different types of media in mobile devices. All
the selected target systems have been previously analyzed in
other studies that address problems such as architectural
degradation and refactoring [11][20].
The target systems were selected based on 4 criteria: (i) the
availability of either architecture specification or original
developers. The architectural information is essential to the
application of the architecture role heuristic, which directly
depends on architecture information to compute the ranks of
code anomalies; (ii) availability of the source version control
systems of the selected applications; the information for the
version control system provides input for the change-density
heuristic; (iii) availability of an issue tracking system.
Although this is not a mandatory criterion, it is highly
recommended for providing input for the error-density
heuristic; and (iv) the applications should present different
design and architectural structures. This restriction allows us to
better understand the impact of the proposed heuristics for a
diverse set of code anomalies, emerging from different
architectural designs.
IV.
PRIORITIZATION OF CODE ANOMALIES
In this section, we describe 4 prioritization heuristics
proposed in this work. These heuristics are intentionally
simple, in order to be feasible on most software projects. Their
main goal is to help developers on identifying and ranking
architecturally relevant code anomalies.
A. Change Density Heuristic
This heuristic is based on the idea that anomalies infecting
unstable code elements are more likely to be architecturally
relevant. An unstable element can be defined as a code
element that suffers from multiple changes during the system
evolution [15]. In some cases, for instance, changes occur in
cascade and affect the same code elements. Those cases are a
sign that such changes are violating the "open-closed
principle", which according to [27] is the main principle for
the preservation of architecture throughout the system
evolution. In this sense, the change-density heuristic calculates
the ranking results based on the number of changes performed
on the anomalous code element. The change-density heuristics
is defined as follows: given a code element c, the heuristic will
look for every revision in the software evolution path where
the element c has been modified. That is, the number of
different revisions represents the number of changes
performed in the element. Thus, the higher the number of
changes, the higher is the element priority.
The only input required for this heuristic is the change sets
that occurred during the system evolution. The change sets is
composed by the list of existing revisions and the code
elements that were modified on each revision. For this
heuristic, we are only able to calculate the changes performed
to an entire file. For this scoring mechanism, all code
anomalies presented in the same file will receive the same
score. We adopted this strategy in our heuristics because none
of the studied code anomalies emerged as the best indicator of
architectural problems across the systems [20]. However, it is
possible to differentiate between two classes by ranking those
that have changed most as high-priority. In order to calculate
the score for each code anomaly, the heuristic assign the
number of changes that were performed in the infected class.
Once the number of changes was computed, we ordered the
list of resources and their respective number of changes, thus
producing our final ranking. This information was to extract
the change log from the version control systems for each of
the target applications.
B. Error-Density Heuristic
This heuristic is based on the idea that code elements that
have a high number of errors observed during the system
evolution might be considered high-priority. The error-density
heuristic is defined as follows: given a resolved bug b, the
heuristic will look for code elements c that was modified in
order to solve b. Thus, the higher the number of errors solved
as a consequence of changes applied to c, the higher is the
position in the prioritization ranking.
This heuristic requires two different inputs: (i) change log
inspection – our first analysis was based on change log
inspection, looking for common terms like bug or fix. Once
those terms are found on commit messages, we incremented
the scores for the classes involved in a given change. This
technique has been successfully applied in other relevant
studies [17]; and (ii) bug detection tool – as we could not rely
on the change log inspection for all system, we have decided
to use a bug detection tool, namely findBugs, for automatically
detecting blocks of code that could be related to bugs. Once
possible bugs are identified, we collect the code elements
causing them and increment their scores. Basically, the
heuristic works as follows: (i) firstly, the information of bugs
that were fixed is retrieved from the revisions; (i) after that,
the heuristic algorithm iterates over all classes changes on
those revisions and the score is incremented for each anomaly
that infect the classes. In summary, when a given class is
related to several bug fixes, the code anomaly will have a high
score.
C. Anomaly Density Heuristic
This heuristic is based on the idea that each code element
can be affected by many anomalies. Moreover, a high number
of anomalous elements concentrated in a single component
indicate a deeper maintainability problem. In this sense, the
classes internal to a component with a high number of
anomalies should be prioritized. Furthermore, it is known that
developers seem to care less about classes that present too
many code anomalies [27], when they need to modify them.
Thus, anomalous classes tend to remain anomalous or get
worse as the systems evolve. Thus, prioritizing classes with
many anomalies should avoid the propagation of problems.
This heuristic might also be worthy when classes have become
brittle and hard to maintain due to the number of anomalies
infecting them.
Computing the scores for this heuristic was rather
straightforward. Basically, it calculates the number of
anomalies found per code element. Thus, we consider that
elements with a high number of anomalies are high-priority
targets for refactoring. The anomaly density heuristic is defined
as follows: given a code element c, the heuristic will look to the
number of code anomalies that c contains. Thus, the higher the
number of anomalies found in c, the higher would be the
ranking in the prioritization heuristic result. This heuristic
requires only one input: the set of detected code anomalies for
each code element in the system. Moreover, the heuristic can
be customized to compute only architecture relevant anomalies,
instead of computing the set of all the anomalies infecting the
system. In order to define whether an anomaly is relevant or
not, our work relies on the detection mechanisms provided by
SCOOP tool [21].
D. Architecture Role Heuristic
Finally, this heuristic proposes a ranking mechanism based
on the architectural role a given class plays in the system. The
fact is that, when the architecture information is available, the
architectural role influences the priority level. The architecture
role heuristic is defined as follows: given a code element c, this
heuristic will examine the architectural role r performed by c.
The relevance of the architectural role in the system represents
the rank of c. In other words, if r is defined as a relevant
architecture role and it is performed by c, the code element c
will be ranked as high priority.
The architecture role heuristic depends on two kinds of
information, regarding the system’s design: (i) which roles
each class plays in the architecture; and (ii) how relevant those
roles are towards architecture maintainability. For this study
setting, we first had to leverage architecture design information
in order to map code elements to their architecture roles. Part
of this information extraction had already been performed on
our previous studies [19][20]. Then, we asked the original
architects to assign different levels of importance to those
roles, according to the architecture patterns implemented.
Moreover, we defined score levels to each architecture role.
For doing so, we considered the number of roles identified by
the architects, and distributed them according to a fixed
interval from 0 to 10. Code anomalies that infected elements
playing critical architecture roles were assigned to the highest
score. On the other hand, when the code anomaly affected
elements related to less critical architecture roles, they would
be assigned to lower scores, according to the number
architecture roles provided by the original architects.
V.
EVALUATION
This section describes the main steps for evaluating the
proposed heuristics, as well as testing the study hypotheses.
The evaluation is organized into three main activities: (i) detect
of code anomalies; (ii) identify of the rankings representing the
ground truth; and (iii) collect scores for each anomaly under
the perspective of the prioritization heuristics.
A. Detecting Code Anomalies
The first step was the automatic identification of code
anomalies for each of the 4 target systems by using well-known
detection strategies and thresholds [16][31]. These detection
strategies and thresholds used in our study have been used
previously in other studies [6][19][20]. The metrics required by
the detection strategies are mostly collected with current tools
[30][33]. After that, the list of code anomalies is checked and
refined by original developers and architects of each target
system. Through this validation we can make sure that results
produced by the detection tools do not include false positives
[19].
We have also a defined ground truth ranking in order to
compare the results of the analysis provided by the software
architects and the resulting ranking provided by each of the
proposed heuristics. The ground truth ranking is a list of
anomalous elements in the source code ordered by their
architecture relevance, defined by the original architects of
each target application. Basically, the architects were asked to
provide an ordered list of the top ten classes that, in their
beliefs, represented the main sources of maintainability
problems of those systems. Besides providing a list of the high
priority code elements, the architects were also asked to
provide information regarding the architectural design of each
target system. That is, they should provide a list of architectural
roles presented in each target system ordered by their relevance
from the architecture perspective.
B. Analysis Method
After applying the heuristics, we compared the rankings
produced by each of them with the ground truth ranking. We
decided to analyze only the top ten code elements ranked, for
three main reasons: (i) it would be unviable if we have asked
developers to rank an extensive list of elements; (ii) we
wanted to evaluate our prioritization heuristics mainly for their
abilities to improve refactoring effectiveness. Thus, the top ten
anomalous code elements represent a significant sample of
elements that could possibly cause architecture problems; and
(iii) we focused on analyzing the top 10 code elements for
assessing whether they represent a useful subset of sources of
architecturally relevant anomalies.
In order to analyze the rankings provided by the heuristics,
we have considered three measures: (i) Size of overlap –
measures the number of elements that appear both in the
ground truth ranking and in the heuristic ranking. It is fairly
simple to calculate and tells us whether the prioritization
heuristics are accurately distinguishing the top k items from
the others; (ii) Spearman’s footrule [5] – it is a well-known
metric for permutations. It measures the distance between two
ranked lists by computing the differences in the rankings of
each item; and (iii) Fagin’s extension to the Spearman’s
footrule [8] – it is an extension to Spearman’s footrule for top
k lists. Fagin extended Spearman’s footrule by assigning an
arbitrary placement to elements that belong to one of the lists
but not to the other. Such placement represents the position in
the resulting ranking for all of the items that do not overlap
when comparing both lists.
It is important to notice the main differences between the
three measures: the number of overlaps indicates how
effectively our prioritization heuristics are capable of
identifying a set of k relevant code elements, disregarding the
differences between them. This measure becomes more
important as the number of elements under analysis grows.
Thus, the number of overlaps might give us a good hint on the
heuristics capability for identifying good refactoring
candidates, disregarding the differences between them. The
purpose of the other two measures is to analyze the similarity
between two rankings. Unlike the number of overlaps, they
take into consideration the positions each item has in the
compared rankings. It is important to mention the main
differences between those two measures: when calculating
Spearman’s footrule, we are only considering the overlapping
items. When the lists are disjoint, the original ranks are lost,
and a new ranking is produced. On the other hand, Fagin’s
measure takes into consideration the positions of the
overlapping elements in the original lists. Finally, we used the
measures results to calculate the similarity accuracy – as
defined in our hypotheses.
VI.
EVALUATING THE PROPOSED PRIORITIZATION
HEURISTICS
The evaluation of the proposed heuristics involved two
separated activities: (i) quantitative analysis on the similarity
results; and (ii) quantitative evaluation of the results regarding
their relations to actual architecture problems.
A. Change-Density Heuristic
Evaluation. This heuristic was applied in 3 out of the 4 target
applications selected in our study. Our analysis was based on
different versions of Health Watcher (10 versions), Mobile
Media (8 versions) and PDP (409 versions). Our goal was to
check whether the prioritization heuristics performed well or
not on systems with shorter and longer longevity.
Additionally, it was not a requirement to only embrace
projects with long histories, once we wanted also to evaluate
whether the heuristics would be more effective in preliminary
versions of a software system. Table III shows the evolution
characteristics analyzed for each system.
TABLE III.
CHANGE CHARACTERISTICS FOR EACH TARGET APPLICATION
Name
Health Watcher
Mobile Media
PDP
CE
137
82
97
N-Revisions
10
9
409
M-Revisions
9
8
74
AVG
1.5
2.6
8.8
As we can observe, Mobile Media and Health Watcher
presented similar evolution behaviors. As the maximum
number of revisions (M-Revisions) was limited to the total
number of revisions for a system (AVG), neither Health
Watcher nor Mobile Media could have 10 or more versions of
a code element (CE). We can observe that Health Watcher had
more revisions than Mobile Media. However, those changes
were scattered between more files. Due to the reduced number
of revisions available for both systems, we have established a
criterion for selecting items when there were ties in the top 10
rankings. For instance, we can use alphabetical order when the
elements in the ground truth are ranked equally harmful.
TABLE IV.
Name
HW
MM
PDP
RESULTS FOR THE CHANGE-DENSITY HEURISTIC
Overlap
Value Accuracy
8
57%
5
50%
6
60%
Value
0.62
1
0.44
NSF
Accuracy
38%
0%
56%
Value
0.87
0.89
0.54
NF
Accuracy
13%
11%
46%
Table IV show the results observed when analyzing the
change-density heuristic. As we can observe, the highest
absolute overlap value was obtained for Health Watcher. It
can be explained by the fact that the Health Watcher system
has many files with the same number of changes. In this sense,
when computing the scores we did not consider only the 10
most changed files, as that approach would discard files with
as many changes as the ones selected. So, we decided to select
14 files, where the last 5 presented exactly the same number of
changes. Moreover, the Health Watcher presented the highest
number of code elements, having a total of 137 items (see
Table III) that could appear on the ranking produced by
applying the heuristic. Another interesting finding was
observed in the Mobile Media system. Although the changedensity heuristic detected 5 overlaps, all of them were shifted
exactly two positions, thus resulting in the 1 value for the NSF
measure. On the other hand, when we considered the nonoverlaps, the position for one item matched. Moreover, the
results show us that the NSF measure is not adequate when the
number of overlaps is small.
When we compare the results of Mobile Media and Health
Watcher to those obtained by PDP, we observed a significant
difference. All PDP measures performed above our acceptable
similarity threshold, which means a similarity value higher
than 45%. For this case, we observed that the similarity was
related to a set of classes that were deeply coupled: an
interface acting as Facade and three realizations of this
interface, implementing a server module, a client module and
a proxy. When performing changes on the interface, many
other changes were triggered in those three classes. For this
reason, they have suffered many other modifications during
the system evolution. Moreover, the nature of changes that the
target applications underwent is different. For instance, on
Health Watcher most part of changes was perfective (changes
made aiming to improve the overall structure of the
application). On the other hand, on Mobile Media, most part
of the changes was related to the addition of new
functionalities, which was also the case for PDP. However, we
observed that Mobile Media had also low accuracy rates.
In summary, the results on applying the change-density
heuristic showed us that it could be useful for detecting and
prioritizing architecturally relevant anomalies in the following
scenarios: (i) there are architecture problems involving groups
of classes changing together; (ii) there are problems in the
architecture related to Facade or communication classes; and
(iii) changes were predominantly perfective. In this sense, from
the results observed in the analysis, we can reject the null
hypothesis H1. The fact was that the change-density heuristic
was able to produce rankings for PDP with at least acceptable
accuracy in all the analyzed measures.
Correlation with Architectural Problems. Based on the
results produced by the change-density heuristic, we also
needed to evaluate whether there is a correlation between the
rankings with architectural problems. In this sense, we
performed the analysis by observing which ranked elements
are related to actual architectural problems (see Table V). We
can observe that elements containing architecturally relevant
anomalies (Arch-Relevant) were likely to be change-prone.
For PDP system, all of the top 10 most changed elements were
related to architectural problems. Also, if we consider that
PDP has 97 code elements, and 37 of them are related to
architectural problems, the results give us a hint that changedensity is a good heuristic for detecting them.
TABLE V.
Name
HW
MM
PDP
RESULTS FOR THE CHANGE-DENSITY HEURISTIC VS.
ARCHITECTURAL PROBLEMS
N-ranked CE
14
10
10
Arch-Relevant
10
7
10
% of Arch-Relevant
71%
70%
100%
B. Error-Density Heuristic
Evaluation. This heuristic is based on the assessment of bugs
that are introduced by a code element. So, the higher the
number of bugs observed in a code element, the higher is its
priority. Thus, in order to correctly evaluate the results
produced by the error-density heuristics, a reliable set of
detected bugs should be available for each target system. This
was the case for the PDP system, where the set of bugs was
well documented. On the other hand, for Mobile Media and
Health Watcher, where the documentation of bugs was not
available, we relied on the analysis of bug detection tools. The
results of applying the error-density heuristics are showed in
Table VI. It is important to highlight that for Health Watcher
there were 14 ranked items, due to ties between some of them.
Nevertheless, Health Watcher presented the highest overlap
measures. That happens because the detected bugs were
related to the behavior observed in every class implementing
the Command pattern. Furthermore, each of the classes
implementing this pattern was listed a high-priority in the
ground-truth ranking.
TABLE VI.
Name
HW
MM
PDP
RESULTS FOR THE ERROR-DENSITY HEURISTIC
Overlap
Value Accuracy
10
71%
3
30%
5
30%
Value
0
0
0.83
NSF
Accuracy
100%
100%
17%
Value
0.74
0.76
0.74
NF
Accuracy
26%
24%
26%
problems. When we take into consideration that the ranking
for Health Watcher was composed of 14 code elements
(instead of 10), this result is even more significant. As
mentioned before, the rankings for Health Watcher and
Mobile Media were built over automatically detected bugs. It
means that even when formal bug reports are not available, the
use of static analysis tool [3] for predicting possible bugs
might be useful.
TABLE VII.
Name
HW
MM
PDP
RESULTS FOR THE ERROR-DENSITY HEURISTIC VS. ACTUAL
ARCHITETURAL PROBLEMS
N-ranked CE
14
10
10
Arch-Relevant
10
8
8
% of Arch-Relevant
85%
80%
80%
On the other hand, for the PDP system where we
considered actual bug reports, the results were also promising.
From the top 10 ranked elements, 8 were related to
architecture problems. When we consider that PDP system has
97 code elements, with 37 of them related to architecture
problems, it means that the remaining 29 were distributed
among the 87 bottom ranked elements. Moreover, if we extend
the analysis over the top 20 elements, we observe a better
correlation factor. That is, in this case the correlation showed
us that around 85% of the top 20 most error-prone elements
were related to architecture problems.
C. Anomaly Density Heuristic
Another interesting finding we observed was that the
priority order for overlapping elements was exactly the same
as the one pointed out in the ground truth. However, the 4
remaining non-overlapping elements were the same 4
elements in the ground truth ranking. The fact that top 4
elements are not listed in the ranking list produced by the
heuristic resulted in a low accuracy for NF measure. For the
Mobile Media, we have applied the same strategy, but all the
measures also presented low accuracies. Due to the small
number of overlaps, the results for NSF may not confidently
represent the heuristics’ accuracy. Finally, for the PDP the
results were evaluated in a different perspective once we
considered the manually detected bugs. That is, the bugs were
collected through its issue tracking system, instead of using
automatic bug detection tools. However, even when we
performed the analysis using a reliable set of bugs, the overall
results presented low accuracy. That is, from the 5 nonoverlapping items, 4 of them were related to bugs on utility
classes. Once those classes were neither related to any
particular architectural role, nor implementing an architecture
component, they were not considered architecturally relevant.
Evaluation. The anomaly density heuristic was applied to the
4 target systems selected in our study. We have observed good
results in terms of accuracy on ranking the architecturally
relevant anomalies. As we can see in Table VIII, good results
were obtained not only on ranking the top 10 anomalies, but
also on defining its positions. We observed that only 2 of 8
measures had low accuracy when compared to the thresholds
defined in our work. Furthermore, the number of overlaps
achieved by this heuristic can be considered highly accurate in
3 of the 4 target systems. This indicates that code elements
affected by multiple code anomalies are often perceived as
high priority. It did not occur only in the case of Health
Watcher, where we observed only 5 overlaps. When analyzing
the number of anomalies for each element on the ranking
representing ground truth, we could observe that many of
them had exactly the same number of code anomalies, namely
8. Also, it is important to mention that for this heuristic, in
contrast to the change-density and error-density heuristics, we
only considered the top 10 elements for the Health Watcher
system - once there were not ties to be taken into
consideration.
Correlation with Architectural Problems. Based on the
results produced by the error-density heuristic, we could
investigate the correlation between the rankings with actual
architectural problems. That is, we could analyze whether the
error-density heuristics presented better results towards
detecting architecturally relevant anomalies. Table VII
presents the results from applying this heuristic. As we can
see, at least 80% of the ranked elements were related to
architecture problems for all the analyzed systems. Moreover,
Health Watcher system reached the most significant results
with 85% of the ranked elements related to architectural
When analyzing the MIDAS system, we could not observe
a significant number of overlaps, once 9 out of 10 elements
appeared in both rankings. However, this fact was expected as
the system is composed by only 21 code elements.
Nevertheless, we observed that both NSF and NF presented a
high accuracy, which means that the rankings were similarly
ordered. Moreover, the NF measure presented a better result,
which was influenced by the fact that the only mismatched
element was ranked in the 10th position. On the other hand,
when analyzing the Mobile Media we observed discrepant
results regarding two ranking measures. We found 59% of
accuracy for the NSF measure, and 30% for the NF measure.
This difference is also related to the position the nonoverlapping elements in the ranking generated by the heuristic.
Therefore, the ranks for those elements were assigned to k+1
in the developers’ list, which resulted in a huge distance from
their original positions. It is also important to mention that
those elements comprehended a data model class, a utility
class and a base class for controllers.
TABLE VIII.
Name
HW
MM
PDP
MIDAS
Val
ue
5
7
8
9
RESULTS FOR THE ANOMALY DENSITY HEURISTIC
Overlap
Accuracy
50%
70%
80%
90%
Value
0.66
0.41
0.37
0.4
NSF
Accuracy
34%
59%
63%
60%
Value
0.54
0.7
0.36
0.20
NF
Accuracy
46%
30%
64%
80%
By analyzing the results for this heuristic, we observed that
code elements infected by multiple code anomalies are often
perceived as high priority. We also identified that many false
positives could arise from utility classes, as those classes are
often large and not cohesive. Finally, the results obtained in
this analysis also helped us rejecting the null hypothesis H3 –
as the anomaly density heuristic was able to produce rankings
with at least acceptable accuracy in all of the systems we
analyzed for at least one measure. Furthermore, we obtained a
high accuracy rate for the MIDAS system in 2 out of 3
measures, which means 90% for the overlaps and 80% for NF.
Correlation with Architectural Problems. We also
performed an analysis in order to evaluate whether the
rankings produced by the anomaly density heuristic. However,
when evaluating the results produced by this heuristic, we
observed that they were not consistent if compared them with
architecturally relevant anomalies. This is valid conclusion for
all target systems.
TABLE IX.
Name
HW
MM
PDP
MIDAS
RESULTS FOR THE ANOMALY DENSITY HEURISTIC VS.
ACTUAL ARCHITETURAL PROBLEMS
N-ranked CE
10
10
10
10
Arch-Relevant
5
9
8
6*
% of Arch-Relevant
50%
90%
80%
60%*
For instance, Table IX shows that for the Health Watcher
system only 5 out of the top 10 ranked elements were related
to architectural problems. The 5 code elements related to
architectural problems are exactly the same overlapping items
between the compared ranks. It happens due the high number
of anomalies, which are concentrated in a small numbers of
elements that are not architecturally relevant. Moreover, all
the 5 non-architecturally relevant elements were data access
classes responsible for communicating to the database. For the
MIDAS system, we observed that from the top 10 code
elements with the higher number of anomalies, 6 were
architecturally relevant. In addition, the MIDAS system has
exactly 6 elements that contribute to the occurrence of
architecture problems. So, we can say that the anomaly density
heuristic correctly outlined all of them in the top 10 ranking.
D. Architecture Role Heuristic
Evaluation We analyzed 3 of the 4 systems in order to
evaluate the architecture role heuristic. As we can observe
(see Table X), PDP achieved the most consistent results
regarding the three similarity measures. The heuristic achieved
around 60% of accuracy when comparing the similarity
between the rankings. Also, the PDP is the only system where
it was possible to divide classes and interfaces in more than
three levels when analyzing the architectural roles. For
instance, Table XI illustrates the four different architectural
roles defined on the PDP system.
TABLE X.
Name
HW
MM
PDP
RESULTS FOR THE ARCHITECTURE ROLE HEURISTIC
Overlap
Value Accuracy
4
40%
6
60%
6
60%
TABLE XI.
Value
0.5
0.22
0.33
NSF
Accuracy
50%
78%
67%
Value
0.72
0.41
0.41
NF
Accuracy
28%
59%
59%
ARCHITECTURE ROLES IN PDP
Architecture Roles
Utility and Internal Classes
Presentation and Data access classes
Domain Model, Business classes
Public Interfaces, Communication classes, Facades
Score
1
2
4
8
# of CE
23
28
24
6
Based on the classification provided in Table XI, we can
draw the architecture role heuristic ranking for PDP. As we
can see, the ranking contains all of the 6 code elements (# of
CE) from the highest category and 4 elements from the
domain model and business classes. We ordered the elements
alphabetically for breaking ties. Therefore, although 23 classes
obtained the same score, we are only comparing 4 of them.
However, it is important to mention that some of the elements
ranked by the original architects belonged to the group of
discarded elements. Once we have chosen a different
approach, such as considering all the ties as one item, we
would turn our top ten ranking into a list of 30 items and have
a 100% overlap rate. On the other hand, we decided to follow
a different score approach for Mobile Media and Health
Watcher, by consulting original architects for each of the
target applications. The architects provided us the architecture
roles and their relevance on the system architecture. Once we
identified which classes were directly implementing which
roles, we were able to produce the rankings for this heuristic.
The worst results were observed in the Health Watcher
system, where almost 20 elements were ties with the same
scores. So, we first selected the top 10 elements, and broke the
ties according to the alphabetic order. This led us to an unreal
low number of overlaps, as some of the discarded items were
present in the ground truth ranking. In fact, due to low number
of overlaps, it would not be fair to evaluate the NSF measure
as well. Thus, we performed a second analysis, considering
the top 20 items instead of the top 10, for analyzing the whole
set of elements that had the same score. In this second
analysis, we observed the number of overlaps went up to 6,
but the accuracy for the NSF measure decreased to 17% which indicates a larger distance between the compared
rankings. In addition, this also shows us that the 50% accuracy
for NSF obtained in the first comparison round was
misleading, as expected, due the low number of overlaps. For
the Mobile Media system, we observed high accuracy rates for
both NSF and NF measures. Furthermore, we observed that
several elements of the Mobile Media were documented as
being of high priority on the implementation of architectural
components. More specifically, there were 8 architecture
components described in that document directly related to 9
out of the top 10 high priority classes.
It is important to notice that the results for this heuristic are
dependent on the quality of the architecture roles defined by
the software architect. Moreover, we observed that PDP
system achieved the best results, even with multiple
architecture roles defined, as well as different levels of
relevance. Finally, we conclude that the results of applying the
architecture role heuristic helped to reject the null hypothesis
H4. In other words, the heuristic was able to produce rankings
with at least acceptable accuracy in all of the target
applications.
Correlation with Architectural Problems. Similarly to the
other heuristics, we have also evaluated whether the rankings
produced by the architecture role heuristic are related to
actual architectural problems for each of the target
applications (see Table XII). As we can observe, the results
are discrepant between the Health Watcher and the other three
systems. However the problem in this analysis is related to the
analyzed data. We identified two different groups of
architecture roles among the top 10 elements for Health
Watcher, ranked as equally relevant. That is, 6 of the related
elements were playing the role of repository interfaces. The 4
remaining elements were Facades [10], or elements
responsible for communicating different architecture
components. We then asked the original architects to elaborate
on the relevance of those roles, as we suspected they were
unequal. They decided to differentiate the relevance between
them, and considered the repository role as less relevant. This
refinement led to a completely different ranking, which went
up from 4 to 7 elements related to architecture problems.
TABLE XII.
Name
HW
MM
PDP
ARCHITECTURE ROLE HEURISTIC AND ACTUAL
ARCHITECTURAL PROBLEMS
# of ranked CE
10
10
10
Arch-Relevant
4
9
10
% of Arch-Relevant
40%
90%
100%
The results obtained for Health Watcher show us the
importance of correctly identifying the architecture roles and
their relevancies for improving the accuracy of this heuristic.
When that information is accurate, the results for this heuristic
are highly positive. Furthermore, the other proposed
prioritization heuristics could benefit from information
regarding architecture roles in order to minimize the number
of false positives, like utility classes. This indicates the need to
further analyze different combinations of prioritization
heuristics.
VII. THREATS TO VALIDITY
This section describes some threats to validity observed in
our study. The first threat is related to possible errors on the
anomalies detection in each of the selected target systems. As
the proposed heuristics consist of ranking previously ranked
code anomalies, the method for detecting these anomalies must
be trustworthy. Although there are several kinds of detection
strategies in the state-of-art, many studies have proven that
they are inefficient for detecting architecturally relevant code
anomalies [19]. In order to reduce the risk of imprecision when
detecting code anomalies: (i) the original developers and
architects were involved in this process; and (ii) we used wellknown metrics and thresholds for constructing our detection
strategies [16][31]. The second threat is related to how we
identified errors in software systems in order to apply the errordensity heuristic. Firstly, we relied on commit messages for
identifying classes related to bug fixes. So, it implies that some
errors might be missed. In order to mitigate this threat, we also
investigated issue-tracking systems. Basically, we looked for
error reports and traces between these errors and the code
changed to fix them. Furthermore, we investigated test reports
in order to identify the causes for eventual broken tests.
Finally, for some cases where the information is not available,
we relied on the use of static analysis methods for identifying
bugs [3].
The third threat is related to the identification of the
architectural roles for each of the target systems. The
architecture role heuristic is based on identifying the relevance
of code elements regarding the system architectural design.
Thus, in order to compute the scores for this heuristic, we
needed to assess the roles that each code element plays in the
system architecture. In this sense, we considered the
identification of architectural roles as being a threat to
construct validity because the information regarding the
architectural roles was extracted differently depending on the
target system. Furthermore, we understand that the absence of
architecture documentation reflect a common situation that
might be inevitable when analyzing real world systems.
Finally, the fourth threat to validity is an external threat and it
is related to the choice of the target systems. The problem here
is that our results are limited to the scope of the 4 target
systems. But in order to minimize this threat, we selected
systems developed by different programmers, with different
domains, programming languages, environment and
architectural styles. In order to generalize our results, further
empirical investigation is still required. In this sense, our study
should be replicated with other applications, from different
domains.
VIII. FINAL REMARKS AND FUTURE WORK
The presence of architecturally relevant code anomalies
often leads to the decline of the software architecture quality.
Furthermore, the removal of those critical anomalies is not
properly prioritized, mainly due to the inability of current tools
to identify and rank architecturally relevant code anomalies.
Moreover, there is no sufficient empirical knowledge towards
factors that could make it easier the prioritization process. In
this sense, our work has shown that developers can be guided
through the prioritization of code anomalies according to
architectural relevance. The main contributions of this work
are: (i) four prioritization heuristics based on the architecture
relevance and (ii) the evaluation of the proposed heuristics on
four different software systems.
In addition, during the evaluation of the proposed
heuristics, we found that they were mostly useful in scenarios
where: (i) there are architectural problems involving groups of
classes that change together; (ii) there are architecture
problems related to Facades or classes responsible for
communicating different modules; (iii) changes are not
predominantly perfective; (iv) there are architecture roles
infected by multiple anomalies; and (v) the architecture roles
are well defined in the software system and have distinct
architecture relevance. Finally, in this work we evaluated the
proposed heuristics individually. Thus, we have not evaluated
how their combinations could benefit the prioritization results.
In that sense, as a future work, we aim to investigate whether
the combination of two or more heuristics would improve the
efficiency of the ranking results. We also intend to apply
different weights when combining the heuristics, enriching the
possible results and looking for an optimal combination.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
R. Arcoverde, A. Garcia and E. Figueiredo, “Understanding the
Longevity of Code Smells – Preliminary Results of an Survey,” in Proc.
of 4th Int’l Workshop on Refactoring Tools, May 2011
R. Arcoverde et al., “Automatically Detecting Architecturally-Relevant
Code Anomalies,” 3rd Int’l Workshop on Recommendation Systems for
Soft. Eng., June 2012.
N. Ayewah et al., “Using Static Analysis to Find Bugs,” IEEE Software,
Vol. 25, Issue 5, pp. 22-29, September 2008.
L. Bass, P. Clements and R. Kazman, “ Software Architecture in
Practice”, Second Edition, Addison-Wesley Professional, 2003.
P. Diaconis and R. Graham, “Spearman’s Footrule as a Measure of
Disarray”, in Journal of the Royal Statistic Society, Series B, Vol. 39,
pp. 262-268, 1977.
E. Figueiredo et al., “Evolving Software Product Lines with Aspects: An
Empirical Study on Design Stability,” in Proc. of 30th Int’l Conf. on
Software Engineering, New York, USA 2008.
S. Eick, T. Graves and A. Karr, “Does Code Decay? Assessing the
Evidence from Change Management Data”, IEEE Transactions on Soft.
Eng., Vol. 27, Issue 1, pp. 1-12, 2001
R. Faing, R. Kumar and D. Sivakumar, “Comparing Top K Lists”, in
Proc. of 14th Annual ACM-SIAM Symposium on Discrete Algorithms,
Society for Industrial and Applied Mathematics, pp. 28-36,, USA 2003.
M. Fowler, “Refactoring: Improving the Design of Existing Code,”
Addison-Wesley, 99.
E. Gamma et al., “Design Patterns: Elements of Reusable ObjectOriented Software”, Addison-Wesley, Boston, USA, 1995.
J. Garcia, D. Popescu, G. Edwards and N. Medvidovic, “Idenfitying
Architectural Bad Smells,” in Proc. of CSMR, Washington, USA 2009.
M. Godfrey and E. Lee, “Secrets from the Monster: Extracting Mozilla’s
Software Architecture”, in Proc. of 2nd Int’l Symp. On Constructing
Software Engineering Tools, 2000.
P. Greenwood et al., “On the Impact of Aspectual Decompositions on
Design Stability: An Empirical Study,” in Proc. of 21st Conf. of ObjectOriented Programming, Springer, pp. 176-200, 2007.
L. Hochestein and M. Lindvall, “Combating Architectural Degeneration:
A Survey,” Information of Software Technology, Vol. 47, July 2005.
D. Kelly, “A Study of Design Characteristics in Evolving Software
Using Stability as a Criterion,” IEEE Transactions on Software
Engineering, Vol. 32, Issue 5, pp. 315-329, 2006.
F. Khom, M. Penta and Y. Guéhenéuc, “An Exploratory Study of the
Impact of Code Smells on Software Change-Proneness,” in Proc. of 16th
Working Conf. on Reverse Eng., pp. 75-84, 2009
M. Kim, D. Cai and S. Kim, “An Empirical Investigation into the Role
of API-Level Refactorings during Software Evolution,” in Proc. of 33rd
Int’l Conf. on Software Engineering, USA 2011.
[18] M. Lanza and R. Marinescu, “Object-Oriented Metrics in Practice,”
Springer-Verlag, New York, USA 2006
[19] I. Macia et al., “Are Automatically-Detected Code Anomalies Relevant
to Architectural Modularity? An Exploratory Analysis of Evolving
Systems,” in Proc. of 11th AOSD, pp. 167-178, Germany, 2012.
[20] I. Macia et al., “On the Relevance of Code Anomalies for Identifying
Architecture Degradation Symptoms”, in Proc. of 16th CSMR, Hungary,
March 2012.
[21] I. Macia et al., “Supporting the Identification of ArchitecturallyRelevant Code Anomalies”, in Proc. of 28th IEEE Int’l Conf on Soft.
Maint., Italy, 2012.
[22] I. Macia et al., “Enhancing the Detection of Code Anomalies with
Architecture-Sensitive Strategies”, in Proc. of the 17th CSMR, Italy,
March 2013.
[23] A. MacCormack, J. Rusnak and C. Baldwin, “Exploring the Structure of
Complex Software Design: An Empirical Study of Open Source and
Proprietary Code”, in Management Science, Vol. 52, Issue 7, pp. 10151030, 2006.
[24] S. Malek et al., “Reconceptualizing a Family of Heterogeneous
Embedded Systems via Explicit Architectural Support”, in Proc. of the
29th Int’l Conf on Soft. Eng., IEEE Computing Society, USA 2007.
[25] M. Mantyla and C. Lassensius, “Subjective Evaluation of Software
Evolvability using Code Smells: An Empirical Study, Vol. 11, pp. 395431, 2006
[26] R. Marinescu, “Detection Strategies: Metrics-Based Rules for Detecting
Design Flaws,” in Proc. Int’l Conf. on Soft. Maint., pp. 350-359, 2004.
[27] R. Martin, “Agile,Software Software Development, Principles, Patterns
ans Practices. Prentice Hall, 2002.
[28] M. J. Munro, “Product Metrics for Automatic Identification of Bad
Smells Design problems in Java Source-Code”, In Proc. of 11th Int’l
Sympositum on Soft. Metrics, pp. 15, September 2005.
[29] E. Murphy-hill, C. Parnin and A. Black, “ How We Refactor and How
We Know it,” in Proc. of 31st Int’l Conf. on Software Engineering, 2009.
[30] NDepend. Available at http://www.ndepend.com. 2013.
[31] S. Olbrich, D. Cruzes and D. Sjoberg, “Are Code Smells Harmful? A
Study of God Class and Brain Class in the Evolution of Three Open
Source Systems,” in Proc. of 26th Int’l Conf. on Soft. Maint., 2010.
[32] J. Ratzinger, M. Fischer and H. Gall, “Improving Evolvability through
Refactoring,” in Proc. of 2nd Int’l Workshop on Mining Soft.
Repositories, ACM Press, pp. 69-73, New York, 2005.
[33] Understand, 2013. Available at: http://www.scitools.com/
[34] C. Wohlin, et al., “Experimentation in Software Engineering – An
Introduction”, Kluwer Academic Publisher, 2000.
[35] S. Wong, Y. Cai and M. Dalton, “Detecting Design Defects Caused by
Design Rule Violations,” in Proc. of 18th ESEC/ Foundations on
Software Engineering, 2010.
[36] Z. Xing and E. Stroulia, “Refactoring Practice: How it is and How it
should be Supported: An Eclipse Study,” in Proc. of 22nd IEEE Int’l
Conf. on Software Maintenance, pp. 458-468, 2000.
[37] D. Sheskin, “Handbook of Parametric and Nonparametric Statistical
Procedures”, Chapman & All, 4th Edition, 2007.
Are domain-specific detection strategies for code
anomalies reusable? An industry multi-project study
Reuso de Estratégias Sensíveis a Domínio para Detecção de Anomalias de Código:
Um Estudo de Múltiplos Casos
Alexandre Leite Silva, Alessandro Garcia, Elder José Reioli, Carlos José Pereira de Lucena
Opus Group, Laboratório de Engenharia de Software
Pontifícia Universidade Católica do Rio de Janeiro (PUC - Rio)
Rio de Janeiro/RJ, Brasil
{aleite, afgarcia, ecirilo, lucena}@inf.puc-rio.br
Resumo— Para promover a longevidade de sistemas de
software, estratégias de detecção são reutilizadas para identificar
anomalias relacionadas a problemas de manutenção, tais como
classes grandes, métodos longos ou mudanças espalhadas. Uma
estratégia de detecção é uma heurística composta por métricas de
software e limiares, combinados por operadores lógicos, cujo
objetivo é detectar um tipo de anomalia. Estratégias prédefinidas são usualmente aplicadas globalmente no programa na
tentativa de revelar onde se encontram os problemas críticos de
manutenção. A eficiência de uma estratégia de detecção está
relacionada ao seu reuso, dado o conjunto de projetos de uma
organização. Caso haja necessidade de definir limiares e métricas
para cada projeto, o uso das estratégias consumirá muito tempo e
será negligenciado. Estudos recentes sugerem que o reuso das
estratégias convencionais de detecção não é usualmente possível
se aplicadas de forma universal a programas de diferentes
domínios. Dessa forma, conduzimos um estudo exploratório em
vários projetos de um domínio comum para avaliar o reuso de
estratégias de detecção. Também avaliamos o reuso de estratégias
conhecidas, com calibragem inicial de limiares a partir do
conhecimento e análise de especialistas do domínio. O estudo
revelou que, mesmo que o reuso de estratégias aumente quando
definidas e aplicadas para um domínio específico, em alguns
casos o reuso é limitado pela variação das características dos
elementos identificados por uma estratégia de detecção. No
entanto, o estudo também revelou que o reuso pode ser
significativamente melhorado quando as estratégias consideram
peculiaridades dos interesses recorrentes no domínio ao invés de
serem aplicadas no programa como um todo.
reuse of previously-proposed detection strategies based on the
judgment of domain specialists. The study revealed that even
though the reuse of strategies in a specific domain should be
encouraged, their accuracy is still limited when holistically
applied to all the modules of a program. However, the accuracy
and reuse were both significantly improved when the metrics,
thresholds and logical operators were tailored to each recurring
concern of the domain.
Abstract— To prevent the quality decay, detection strategies
are reused to identify symptoms of maintainability problems in
the entire program. A detection strategy is a heuristic composed
by the following elements: software metrics, thresholds, and
logical operators combining them. The adoption of detection
strategies is largely dependent on their reuse across the portfolio
of the organizations software projects. If developers need to
define or tailor those strategy elements to each project, their use
will become time-consuming and neglected. Nevertheless, there is
no evidence about efficient reuse of detection strategies across
multiple software projects. Therefore, we conduct an industry
multi-project study to evaluate the reusability of detection
strategies in a critical domain. We assessed the degree of accurate
A automação do processo de detecção de anomalias em
programas é usualmente suportada através de métricas [2][11].
Cada métrica quantifica um atributo de elementos do código
fonte, tais como acoplamento [23], coesão [24] e complexidade
ciclomática [22]. A partir das métricas é possível identificar
uma relação entre os valores de atributos e um sintoma de
problema no código. Através dessa relação é possível definir
uma estratégia de detecção para apoiar a descoberta de
anomalias automaticamente [1][2]. Uma estratégia de detecção
é uma condição composta por métricas e limiares, combinados
através de operadores lógicos. Através desta condição é
possível filtrar um conjunto específico de elementos do
Palavras-chave— anomalias;detecção;reuso;acurácia
I.
INTRODUÇÃO
Na medida em que sistemas de software são alterados,
mudanças não planejadas podem introduzir problemas
estruturais no código fonte. Estes problemas representam
sintomas de manutenibilidade pobre do programa e, portanto,
podem dificultar as atividades subsequentes de manutenção e
evolução do programa [1]. Tais problemas são chamados de
anomalias de código ou popularmente de bad smells [1].
Segundo estudos empíricos, módulos de programas com
anomalias recorrentes, tais como métodos longos [1] e
mudanças espalhadas [1], estão usualmente relacionados com
introdução de falhas [17][25][26] e sintomas de degeneração de
projeto [17][20][27]. Quando tais anomalias não são
identificadas e removidas, é frequente a ocorrência de
degradação parcial ou total do sistema [21]. À medida que um
sistema cresce, identificar anomalias de código manualmente
fica ainda mais difícil ou impeditivo.
programa. Este conjunto de elementos representa candidatos a
anomalias de código nocivas a manutenibilidade do sistema
[2]. Mesmo assim, nem todo sintoma representa
necessariamente um problema relevante para o desenvolvedor
do sistema [8].
Para facilitar a identificação das anomalias, algumas
ferramentas foram propostas, a partir das estratégias de
detecção conhecidas: [3], [4], [5], [6] e [7]. Mesmo com o
apoio de ferramentas, detectar anomalias é difícil e custoso [8].
Além disso, a eficiência de uma estratégia de detecção está
relacionada à facilidade do seu reuso dado o conjunto de
projetos de uma organização. Em um extremo negativo, os
desenvolvedores precisariam definir uma estratégia de detecção
para cada tipo possível de anomalia, para cada projeto. Para
isso, seria preciso rever as métricas e limiares apropriados,
além das ocorrências identificadas pelas ferramentas que não
representam necessariamente problemas no código. Essa tarefa,
ao ser executada especificamente para cada projeto, vai custar
muito tempo e fatalmente será negligenciada. Além disso,
existem evidências empíricas de que o reuso das estratégias de
detecção não é possível se aplicada em vários projetos de
software de domínios totalmente diferentes.
Para que fosse possível investigar o reuso de estratégias de
detecção em vários projetos de software do mesmo domínio,
este artigo apresenta um estudo de múltiplos casos da indústria.
O estudo investigou o reuso de sete estratégias de detecção,
relacionadas a três anomalias, em seis projetos de um domínio
específico. O reuso das estratégias foi avaliado a partir do
percentual de falsos positivos, classificados segundo a análise
de três especialistas do domínio, a partir das ocorrências
encontradas pelas estratégias de detecção de anomalias. A
partir do grau de reuso das estratégias, foram investigadas as
situações em que fosse possível aumentar o grau de reuso,
tendo em vista os sistemas escolhidos para o estudo.
Dessa forma, o estudo revelou que, mesmo que o reuso de
estratégias de detecção em um domínio específico seja
incentivado, em alguns casos o reuso é limitado devido à
variação das características dos elementos identificados por
uma estratégia de detecção. No entanto, a acurácia e o reuso
foram ambos significativamente melhorados quando os
limiares foram adaptados para certos interesses recorrentes no
domínio. Nesse sentido, foi observado que os melhores
resultados se deram nos interesses em que as características dos
elementos variaram menos. Assim, o presente estudo inicia o
estudo de estratégias de detecção tendo em vista conjuntos de
elementos com responsabilidades bem definidas.
O artigo está estruturado da seguinte maneira. Na seção II é
apresentada a terminologia relacionada ao trabalho. Na seção
III é apresentada a definição do estudo de caso. Na sessão IV
são apresentados os resultados e as discussões e, na seção V, as
conclusões.
II.
TERMINOLOGIA
Esta seção apresenta conceitos associados a anomalias de
código (seção II.A) e estratégias de detecção de anomalias
(II.B).
A. Anomalias de código
Segundo Fowler, uma anomalia de código é um sintoma de
manutenibilidade pobre do programa que pode dificultar
futuras atividades de correção e evolução do código fonte [1].
Por exemplo, um sintoma que precisa ser evitado é a existência
de classes que centralizam muito conhecimento sobre as
funcionalidades do sistema. Este sintoma é muito conhecido
como God Class e tem um grande potencial de impacto
negativo no perfeito entendimento do sistema [2]. Outro
sintoma que deve ser evitado é Long Method [1]: quanto maior
um método é, mais difícil será entender o que ele se propõe a
fazer. Espera-se então uma maior longevidade de programas
com métodos curtos de código [1]. Estas anomalias estão
relacionadas de uma forma ou de outra a fatos sobre um único
elemento do código.
Por outro lado, certas anomalias procuram correlacionar
fatos sobre diversos elementos do código com possíveis
problemas de manutenção, como é o caso da Shotgun Surgery.
Esta anomalia identifica métodos que podem provocar muitas
alterações em cascata, isto é, manutenções no código que
levam a diversas mudanças pequenas em outras classes.
Quando essas alterações estão espalhadas pelo código, elas são
difíceis de encontrar assim como também é fácil neste caso
para o desenvolvedor esquecer alguma mudança importante
[1]. Para apoiar a descoberta de anomalias, Fowler propôs 22
metáforas de sintomas que indicam problemas no código,
sendo que, cada metáfora está relacionada a uma anomalia de
código [1][10].
B. Estratégias de detecção
A detecção de anomalias oferece aos desenvolvedores a
oportunidade de reestruturação do código para uma nova
estrutura que facilite manutenções futuras. Um mecanismo
bastante utilizado para detecção de anomalias é a descrição das
mesmas através da composição de métricas associadas aos
atributos dos elementos de código [2]. Uma composição de
métricas descreve uma estratégia de detecção. A partir de uma
estratégia de detecção é possível filtrar então um conjunto
específico de elementos do programa. Este conjunto de
elementos representa potenciais candidatos a anomalias de
código [2][12][13].
A Fig. 1 descreve, de maneira sucinta, o processo de
formação de uma estratégia de detecção, segundo [2], para
reuso em diferentes sistemas. Primeiro, um conjunto de
métricas relacionadas a sintomas que indicam um determinado
problema é identificado (Fig. 1–a). Em um segundo passo, as
métricas identificadas são associadas a limiares, para que seja
possível filtrar os elementos de código. Uma métrica associada
a um dado limiar elimina elementos para os quais os valores
das métricas excedam os limiares (Fig. 1–b). Para a final
formação de uma estratégia de detecção, as métricas e limiares
são combinados entre si através de operadores lógicos (e.g.,
AND, OR) (Fig. 1–c e d).
M1
M1, L1
M2
M2, L2
Mn
Mn, Ln
(a)
(b)
Sistema A
Composição
ED
Sistema B
Sistema C
(c)
Legenda:
Mi: Resultado da métrica i
Li: Limiar associado à métrica i
(d)
(e)
ED: Estratégia de detecção
Fig. 1. Processo de formação de uma estratégia de detecção[2] e seu uso em
diversos sistemas – Adaptado de [25] .
Como se pode observar, uma estratégia de detecção
codifica o conhecimento a respeito das características de uma
determinada anomalia. Logo, escolher métricas e limiares
apropriados é determinante para o sucesso da estratégia no
apoio a descoberta de sintomas de problemas no código [2]
[8][14]. Com isso, a principal intenção com esta abordagem é
permitir que uma estratégia de detecção possa ser
posteriormente aplicada em diversos sistemas (Fig. 1–e), isto é,
espera-se que as características de uma anomalia se mantenham
dentre diferentes sistemas. No entanto, observa-se que em
determinados contextos em que as estratégias são aplicadas,
algumas ocorrências não são necessariamente sintomas de
problemas, isto é, elas indicam, na realidade, falsos positivos
[8] [9].
III.
DEFINIÇÃO DO ESTUDO
O estudo objetiva investigar a viabilidade de reuso de
estratégias de detecção de anomalias em vários sistemas do
mesmo domínio. Portanto, a seção III.A descreve o objetivo do
estudo em mais detalhes. A seção III.B descreve o contexto em
que o estudo foi conduzido. A seção III.C descreve o projeto
do estudo.
A. Objetivo do estudo
De acordo com o formato proposto por Wohlin (1999), o
objetivo deste trabalho pode ser caracterizado da seguinte
forma: O objetivo é analisar a generalidade das estratégias de
detecção de anomalias de código para o propósito de reuso
das mesmas com respeito à diminuição da ocorrência de falsos
positivos do ponto de vista de mantenedores de software no
contexto de sistemas web de apoio à tomada de decisão. O
contexto desse estudo é formado por seis sistemas web de
apoio à tomada de decisão. Esse conjunto de sistemas opera em
um domínio crítico, pois realiza a análise de indicadores para o
mercado financeiro (seção III.B).
Em uma primeira etapa, busca-se calibrar ou definir
estratégias de detecção para sistemas desse domínio, a partir de
características conhecidas e observadas pelos desenvolvedores
de um subconjunto de sistemas neste domínio. Esta fase tem o
objetivo de calibrar estratégias existentes ou definir novas
estratégias para serem utilizadas em sistemas do domínio alvo.
Portanto, o conhecimento dos especialistas do domínio sobre o
código fonte foi utilizado primeiro para calibrar os limiares de
métricas usadas em estratégias convencionais existentes (ex.
[2][13][16]). Tal conhecimento do especialista sobre o código
foi também usado para definir novas estratégias com métricas
não exploradas em tais estratégias convencionais. Em uma
segunda etapa, avalia-se o reuso e a acurácia das estratégias em
uma família de outros sistemas do mesmo domínio. Além do
grau de reuso, a acurácia das estratégias é avaliada através da
quantidade de falsos positivos encontrados. Falsos positivos
são indicações errôneas de anomalias detectadas pela
estratégia.
Nossa pressuposição é que o reuso das estratégias aumenta
na medida em que as mesmas são definidas em função de
características de sistemas do mesmo domínio. Além disso,
certos conjuntos recorrentes de classes, que implementam um
mesmo interesse (responsabilidade) bem definido, em sistemas
de um mesmo domínio, tendem a possuir características
estruturais semelhantes. De fato, em sistemas web de apoio à
tomada de decisão, foco do nosso estudo (seção III.B), alguns
conjuntos de classes possuem responsabilidades semelhantes e
bem definidas. Portanto, também estudamos se o reuso e a
eficácia poderiam ser melhorados se estratégias fossem
aplicadas a classes com uma responsabilidade recorrente do
domínio.
Por exemplo, um conjunto de classes desse domínio é
formado por classes que recebem as requisições do usuário e
iniciam a geração de indicadores financeiros. Essas classes
recebem os parâmetros necessários, calculam uma grande
quantidade de informações e geram os resultados para serem
exibidos na interface. Além disso, essas classes desempenham
o papel de intermediário entre a interface do usuário e as
classes de negócio. Mesmo assim, é preciso evitar que fiquem
muito grandes. Além disso, é preciso evitar que o acoplamento
dessas classes fique muito disperso com relação a outras
classes da aplicação.
Outro conjunto desse domínio é formado por classes
responsáveis pela persistência dos dados. Assim, essas classes
são formadas por muitos métodos de atribuição e leitura de
valores de atributos (getters e setters). As classes de
persistência devem evitar métodos muito longos que possam
incorporar também a lógica da aplicação de forma indesejável.
Uma classe da camada de persistência com essas características
pode indicar um sintoma de problemas para a compreensão dos
métodos, bem como acoplamentos nocivos à manutenção do
programa.
Nesse sentido, esse trabalho visa responder às seguintes
questões de pesquisa:
(1) É possível reusar estratégias de detecção de anomalias
de forma eficaz em um conjunto de sistemas de um mesmo
domínio? A partir de estratégias calibradas ou definidas com o
apoio do especialista do domínio, faz-se necessário avaliar o
reuso das estratégias em outros sistemas do mesmo domínio.
Entretanto, o reuso de cada estratégia só é eficaz se a mesma é
aplicada em um novo programa do mesmo domínio com baixa
incidência de falsos positivos. Em nosso estudo, consideramos
que a estratégia foi eficaz se o uso desta não resulta em mais
que 33% de falsos positivos. Mais a frente justificamos o uso
deste procedimento.
(2) É possível diminuir a ocorrência de falsos positivos ao
considerar as características de classes com responsabilidade
bem definida do domínio? Como justificamos com os
exemplos acima, observa-se que certos elementos do programa
implementam um interesse recorrente de um domínio de
aplicações; estes elementos podem apresentar características
estruturais parecidas, que não são aplicáveis aos outros
elementos do programa como um todo. Portanto, também
verificamos se a associação de estratégias específicas para
classes de um mesmo interesse seriam mais reutilizáveis do
que as mesmas que são definidas para um programa como um
todo.
Para responder essas questões, foi conduzido um estudo
com múltiplos casos de programas do mesmo domínio. Esse
estudo foi realizado para avaliar o reuso de sete estratégias de
detecção, relacionados a três tipos de anomalias recorrentes em
um domínio específico.
B. Contexto de aplicação do estudo
O presente estudo foi conduzido em uma empresa de
consultoria e desenvolvimento em sistemas de missão-crítica.
A empresa é dirigida por doutores e mestres em informática, e
foi fundada em 2000. Em 2010, a empresa absorveu um
conjunto de sistemas web de apoio à tomada de decisão,
originalmente desenvolvidos por outra empresa. Esse conjunto
de sistemas opera em um domínio crítico, pois realiza a análise
de indicadores para o mercado financeiro. O tempo de resposta
e a precisão dos dados são importantes, pois a apresentação de
uma análise errada pode gerar uma decisão errada e a
consequente perda de valores financeiros. De forma a propiciar
a confiabilidade deste sistema em longo prazo, o mesmo
também deve permanecer manutenível. Caso contrário, as
dificuldades de manutenção facilitarão a introdução de faltas
nos programas ao longo do histórico do projeto. Além disso, a
baixa manutenibilidade dificulta que a empresa se adapte a
mudanças nas regras de negócio ou incorpore inovações,
perdendo, assim, competitividade no mercado. A seguir,
apresentamos várias características destes programas, algumas
delas sinalizando a importância de manter a manutenibilidade
dos mesmos através, por exemplo, de detecção de anomalias de
código.
Os seis sistemas escolhidos para o estudo estão divididos
entre três equipes distintas. Segundo a Tabela I, cada equipe é
responsável por dois sistemas e é representada por um líder.
Cada líder participa do estudo como especialista do domínio
(E1, E2 e E3). Além disso, oito programadores distintos
compõem as três equipes que mantêm os seis sistemas. Os
sistemas que fazem parte desse estudo possuem uma estrutura
direcionada à operação de grande quantidade de dados. A partir
desses dados é possível gerar indicadores para a tomada de
decisão no mercado financeiro. Os dados estão relacionados,
por exemplo, com informações históricas de ativos financeiros
e informações relacionadas à configuração e armazenamento
de estruturas utilizadas pelos usuários. A partir das estruturas
utilizadas pelos usuários é possível controlar: carteiras de
ativos financeiros, tipos de relatório, variáveis utilizadas nos
cálculos de indicadores, modos de interpolação de dados, entre
outras informações.
TABELA I.
COMPOSIÇÃO DAS EQUIPES QUE MANTÉM OS SISTEMAS
USADOS NO ESTUDO
Sistemas
AeB
CeD
EeF
Especialistas
E1
E2
E3
Programadores
P1, P2 e P3
P4 e P5
P6, P7 e P8
Nesses sistemas, como a interface do usuário é bastante
rica, também existem muitas classes que compõem os
elementos visuais. Esses elementos recebem as requisições do
usuário e dão início à geração de informações e processamento
de dados. Ao final das operações necessárias, os dados são
mostrados na interface e o usuário pode analisá-los através de
gráficos e relatórios em diferentes formatos.
O tempo de resposta das solicitações é fundamental para a
tomada de decisões. Desse modo, algumas operações
realizadas por esses sistemas utilizam tecnologias assíncronas e
client-side – operações executadas diretamente no navegador
do cliente como, por exemplo, javascript e JQuery. A
manutenibilidade das classes destes programas também é
importante para não acarretar potenciais efeitos colaterais ao
desempenho.
Ainda, existe um conjunto de classes que garante o controle
de acesso às informações por meio de autenticação. A
autenticação é necessária, pois existem restrições para os
diferentes perfis de usuários. Além disso, um grande conjunto
de classes é usado para refletir o modelo do banco de dados.
Da mesma forma que em vários outros sistemas existentes,
essas classes são necessárias para garantir a integridade das
informações.
Ainda, nesses sistemas, é importante garantir a frequente
comunicação com serviços de terceiros. Esses serviços
fornecem dados provenientes de algumas fontes de dados
financeiros
como,
por
exemplo,
Bloomberg
(www.bloomberg.com).
Outro ponto importante para a escolha destes sistemas é a
recorrência de conjuntos de elementos com responsabilidades
bem definidas. Dessa forma, é possível garantir a proximidade
estrutural dos conjuntos de classes dos sistemas em estudo, o
que é fundamental para avaliar o reuso das estratégias e
responder nossas duas questões de pesquisa (seção III.A).
Além disso, através da recorrência desses conjuntos de
elementos é possível avaliar o percentual de falsos positivos
das estratégias, considerando as características específicas dos
conjuntos de elementos.
C. Projeto do estudo
Segundo [13], um bom índice de acurácia de uma estratégia
de detecção deveria estar acima dos 60%. De qualquer forma, o
índice usado nesse estudo foi um pouco mais rigoroso e está
um pouco acima deste índice sugerido na literatura: 66%, isto
é, dois terços de acertos nas detecções feitas por cada
estratégia. A escolha do índice de acurácia de 66% também se
deu pelo fato de que é possível garantir que, a cada três
ocorrências identificadas pelas estratégias de detecção, apenas
uma é classificada como um falso positivo. Se o desenvolvedor
encontra um número de erros (falso positivos) maior que dois
terços, este será desencorajado a reusar a estratégia em outro
programa. Sendo assim, para avaliar se as estratégias de
detecção de anomalias escolhidas podem ser reusadas com, no
máximo, 33% de ocorrências de falsos positivos, foram
definidas três etapas.
O objetivo da primeira etapa, chamada de etapa de ajustes,
é definir estratégias de detecção de anomalias que tenham
percentual de falsos positivos abaixo de 33% para duas
aplicações do domínio em estudo. A segunda etapa, chamada
de etapa de reuso, tem por objetivo avaliar se as estratégias
definidas na etapa de ajustes podem ser reusadas em outros
quatro sistemas do mesmo domínio, com o resultado de falsos
positivos ainda abaixo de 33%. Finalmente, a última etapa é
chamada de etapa de análise por interesses do domínio. Esta
tem como objetivo verificar se o percentual de falsos positivos
das estratégias pode ser melhorado tendo em vista a aplicação
das estratégias apenas em classes de um mesmo interesse
recorrente nos programas do mesmo domínio.
Nesse estudo, o percentual de falsos positivos é definido
através da avaliação do especialista do domínio. Essa avaliação
é realizada durante uma “sessão de investigação”. Em cada
sessão de investigação realizada, as estratégias de detecção de
anomalias são aplicadas a um dos sistemas do domínio. Assim,
a partir de cada ocorrência indicada pela ferramenta de
detecção, o especialista faz uma avaliação qualitativa, para
indicar se a ocorrência é um falso positivo ou se realmente é
um sintoma de problema para o domínio das aplicações em
estudo. Dessa forma, o percentual de falsos positivos de uma
estratégia de detecção é definido pelo nº de ocorrências
classificadas como falso positivo pelo especialista do domínio,
em relação ao nº de ocorrências identificadas pela ferramenta
de detecção.
Etapa de Ajustes. Na etapa de ajustes, os especialistas do
domínio apoiaram as atividades de: (i) definição do domínio
em estudo, para que fosse possível caracterizar os sistemas para
os quais faria sentido avaliar o reuso de estratégias; (ii) escolha
dos sistemas que caracterizam o domínio, para que fosse
possível considerar sistemas que representam o domínio em
estudo; e (iii) identificação dos interesses (responsabilidades)
recorrentes do domínio, bem como do conjunto de classes que
contribuem para a implementação de cada interesse. Em
seguida, nessa mesma etapa, as definições de anomalias que
são recorrentes na literatura [19] foram apresentadas aos
especialistas do domínio. Isso foi feito para que fosse possível
avaliar as anomalias que seriam interessantes investigar no
domínio alvo, do ponto de vista dos especialistas.
A partir da escolha das anomalias, foram definidas as
estratégias de detecção de anomalias que seriam utilizadas.
Conforme mencionado anteriormente, foram utilizadas
estratégias definidas a partir da sugestão dos especialistas do
domínio, além de estratégias conhecidas da literatura
[2][13][16]. Neste último caso, os especialistas sugeriram
refinamentos de limiares de acordo com experiências e
observações feitas na etapa de ajustes. Ainda nesta etapa de
ajustes, foi escolhida uma ferramenta de detecção de anomalias
de código em que fosse possível avaliar as ocorrências
identificadas pelas estratégias, tendo em vista o mapeamento
das classes que implementavam cada interesse do domínio
(conforme discutido acima).
Também na etapa de ajustes, foram realizadas duas sessões
de investigação, com a participação do especialista do domínio,
para os dois sistemas escolhidos para essa etapa. A partir da
classificação do especialista, foi possível definir o percentual
de falsos positivos para cada uma das estratégias escolhidas,
para cada um dos dois sistemas. Finalizando a etapa de ajustes,
verificamos as estratégias que resultaram em no máximo 33%
de falsos positivos (na média). Dessa forma, as estratégias que
não excederam esse limiar para os dois sistemas foram
aplicadas na etapa seguinte, chamada etapa de reuso.
Etapa de Reuso. Como mencionamos, o objetivo da etapa de
reuso é avaliar se as estratégias definidas na etapa de ajustes
podem ser reusadas em outros quatro sistemas do mesmo
domínio, com o resultado de falsos positivos ainda abaixo de
33%. Nessa etapa, o reuso das estratégias é definido através
dos seguintes critérios:
• Reuso total: a estratégia foi aplicada nos sistemas do
domínio e resultou diretamente em no máximo 33% de
falsos positivos, em todos os sistemas;
• Reuso parcial: a estratégia foi aplicada nos sistemas do
domínio, porém o percentual de falsos positivos
excedeu 33% em um ou dois sistemas; isto é, as
estratégias foram reusadas de forma eficaz em, pelo
menos, a metade dos programas;
• Nenhum reuso: nesse caso, a estratégia foi aplicada nos
sistemas do domínio e o percentual de falsos positivos
excedeu 33% em mais de dois sistemas; isto é, as
estratégias foram reusadas de forma eficaz em menos da
metade dos programas.
Da mesma forma, na etapa de reuso, o percentual de falsos
positivos para todas as estratégias de detecção é determinado
pela avaliação qualitativa do especialista do domínio. Dessa
forma, o percentual de falsos positivos para cada estratégia de
detecção é definido pelo nº de ocorrências classificadas como
falso positivo pelo especialista do domínio, em relação ao nº
total de ocorrências identificadas pela ferramenta de detecção.
Assim, foram realizadas quatro sessões de investigação,
sendo uma para cada um dos quatro sistemas escolhidos para
essa etapa. A partir dos resultados das quatro sessões de
investigação, procurou-se identificar quais estratégias tiveram
um reuso total. Dessa forma, essa etapa procura indícios das
estratégias que tiveram bons resultados, considerando o
domínio das aplicações em estudo. Depois, em um segundo
momento, foram investigados os casos em que o percentual de
falsos positivos esteve acima de 33%. Nesses casos,
procuramos entender quais fatores influenciaram a alta
ocorrência de falsos positivos. Nesse sentido, foi realizada uma
investigação nos valores das métricas dos elementos
identificados, para observar quais fatores desmotivaram o reuso
da estratégia para os seis sistemas do domínio em estudo.
Etapa de Interesses do Domínio. Para finalizar o estudo, a
etapa chamada de etapa de interesses tem como objetivo
verificar se o percentual de falsos positivos das estratégias
diminui ao considerar a aplicação das estratégias apenas em
elementos de cada interesse recorrente nos sistemas do mesmo
domínio. Através da última etapa, investigamos se seria
possível diminuir o percentual de falsos positivos ao aplicar as
estratégias de detecção a um conjunto de elementos com
responsabilidades bem definidas.
Anomalias Investigadas. Nesse estudo as anomalias
investigadas forma definidas juntamente com o especialista do
domínio. Dessa forma, é possível investigar anomalias que são
interessantes do ponto de vista de quem acompanha o dia a dia
do desenvolvimento dos sistemas do domínio. Nesse estudo
foram investigadas: uma anomalia em nível de classe, uma
anomalia em nível de método e uma anomalia relacionada a
mudanças. São elas, nessa ordem:
1) God Class (seção II.A): com o passar do tempo, é
mais cômodo colocar apenas um método em uma classe que já
existe, a criar uma classe nova. Dessa forma, é preciso evitar
classes que concentram muito conhecimento, isto é, classes
com várias responsabilidades distintas, chamadas de God
Classes;
escolhidas para o estudo. Além disso, os limiares usados para
todas as estratégias escolhidas para o estudo foram definidos
segundo as opiniões dos três especialistas. Assim, as estratégias
escolhidas na fase de ajustes, para detecção de anomalias
definidas pelos especialistas, para o domínio em estudo, são
apresentadas nas Tabelas II e III.
Ferramenta de Detecção Escolhida. Entre as ferramentas
disponíveis para a detecção de anomalias de código, diversas
são baseadas nas estratégias de detecção propostas em [2].
Mesmo assim, para que fosse possível realizar um estudo de
estratégias através do mapeamento de interesses, foi necessário
escolher uma ferramenta que possibilitasse essa análise. Assim,
a ferramenta escolhida foi SCOOP (Smells Co-Occurrences
Pattern Analyzer) [15]. Além disso, SCOOP já foi usada com
sucesso em estudos empíricos anteriores, tais como aqueles
reportados em [16][17].
TABELA II.
2) Long Method (seção II.A): da mesma forma que a
anomalia anterior, existem métodos que acabam concentrando
muita lógica do domínio. Assim, é importante identificar
métodos que concentram muito conhecimento e dificultam a
compreensão e manutenção do programa. Ocorrências de
anomalias como estas são chamadas de Long Methods;
PELOS ESPECIALISTAS
Anomalia
God Class EspLoc
God Class EspNom
Long Method Esp
Shotgun Surgery Esp
a.
3) Shotgun Surgery (seção II.A): para prevenir que uma
mudança em um método possa gerar várias pequenas
mudanças em outros elementos do código, é preciso evitar que
um método possua relação com vários outros métodos
dispersos na aplicação. Caso esse relacionamento disperso
ocorra, podemos ter ocorrências de anomalias chamadas de
Shotgun Surgeries;
Estratégias de Detecção Escolhidas. A partir das anomalias
escolhidas, as estratégias de detecção definidas para o estudo
foram concebidas em conjunto com o especialista. A partir da
discussão com o especialista, foram escolhidas e calibradas
estratégias de detecção conhecidas da literatura (Seção III.A).
Foram formadas também novas estratégias de detecção, tendo
como orientação o processo de formação de estratégias de
detecção, proposto por [2] (seção II.B).
Para definir as estratégias em conjunto com os
especialistas, foi necessário decidir quais métricas identificam
os sintomas que devem ser evitados, tendo em vista as
características do domínio em estudo. Dessa forma, para definir
as estratégias que avaliam God Class, os especialistas
sugeriram uma métrica relacionada ao tamanho e uma métrica
relacionada ao acoplamento. Além disso, os especialistas
sugeriram que fosse possível variar a métrica de tamanho para
avaliar qual estratégia poderia apresentar melhores resultados,
tendo em vista os sistemas do domínio em estudo. Depois, para
identificar Long Method, os especialistas do domínio sugeriram
que fossem usadas uma métrica de tamanho e uma métrica de
complexidade. Por último, para identificar Shotgun Surgery, os
especialistas sugeriram uma métrica de complexidade e uma
métrica de acoplamento.
Depois de definir estratégias de detecção em conjunto com
os especialistas do domínio, foram escolhidas três estratégias
de detecção, a partir da literatura. Dessa forma, cada uma das
estratégias da literatura está relacionada a uma das anomalias
ESTRATÉGIAS DE DETECÇÃO SUGERIDAS INTEIRAMENTE
Estratégia
(LOC > 150) and (CBO > 6)
(NOM > 15) and (CBO > 6)
(LOC > 50) and (CC > 5)
(CC > 7) and (AMa > 7)
Accessed Methods (AM) representa a quantidade de métodos externos utilizados por um método[2].
TABELA III.
ESTRATÉGIAS DE DETECÇÃO SUGERIDAS NA LITERATURA
COM LIMIARES AJUSTADOS PELOS ESPECIALISTAS
Anomalia
God Class Lit
Estratégia
(ATFDb > 5) and (WMCc > 46) and (TCCd < 33)
(LOC > 50) and (CC > 6) and (MaxNesting > 5)
and (NOAVe > 3)
(FanOut > 16)
Long Method Lit
Shotgun Surgery Lit
b.
c.
Access to Foreign Data (ATFD) representa o nº de atributos de classes externas, que são acessados
diretamente ou através de métodos de acesso [12].
Weighted Method Count (WMC) representa a soma da complexidade ciclomática de todos os métodos
de uma classe [22][23].
d.
e.
Tight Class Cohesion (TCC) representa o nº relativo de pares de métodos de uma classe que acessam
em comum pelo menos um atributo da classe avaliada[24].
Number of Accessed Variables (NOAV) representa o nº total de variáveis acessadas diretamente pelo
método avaliado [2].
TABELA IV.
RELAÇÃO DOS INTERESSES MAPEADOS DOS SISTEMAS
USADOS NO ESTUDO
Sistema
Interesses mapeados
a
A
B
C
D
E
F
TABELA V.
x
x
x
x
x
x
b
x
x
x
x
x
x
c
x
x
x
x
x
x
d
x
x
x
x
x
x
e
f
x
x
x
x
x
x
x
x
x
x
x
x
g
x
h
x
i
j
x
x
x
x
x
x
x
DESCRIÇÃO DO TAMANHO DOS SISTEMAS USADOS NO ESTUDO
Sistema
A
B
C
D
E
F
NLOC
21599
10011
12504
5935
31766
21602
Nº de classes
161
81
130
41
150
149
Interesses Mapeados nos Sistemas em Estudo. Para avaliar
se é possível diminuir a ocorrência de falsos positivos, ao
considerar as características de conjuntos de elementos com
responsabilidades bem definidas, foi necessário realizar o
mapeamento dos interesses em classes dos seis sistemas,
através do acompanhamento dos especialistas do domínio.
Durante o mapeamento dos interesses, primeiro foram
observados os interesses mais gerais, como, por exemplo,
interface, persistência e recursos auxiliares. Em seguida, foi
realizado o mapeamento dos interesses relacionados
especificamente ao domínio das aplicações. Através do
acompanhamento dos especialistas do domínio pode-se garantir
a identificação dos elementos de código para cada um dos
interesses mapeados para o domínio.
Segundo os especialistas do domínio, existia, de fato, um
conjunto razoável de interesses recorrentes do domínio. A
partir da Tabela IV é possível observar o grau de recorrência
dos interesses mapeados nos sistemas escolhidos para o estudo.
Os interesses escolhidos são representados por letras
minúsculas na tabela, mas nomeados em próximas subseções
do artigo. A Tabela V descreve o tamanho dos sistemas
escolhidos para o estudo, em número de linhas de código
(NLOC) e número de classes. Mesmo que os sistemas variem
em tamanho, segundo os especialistas do domínio, a
proximidade estrutural dos sistemas é observada nos conjuntos
de classes mapeadas para interesses recorrentes nos sistemas.
IV.
RESULTADOS E DISCUSSÕES
Essa seção apresenta os resultados do estudo. A seção IV.A
apresenta os resultados da etapa de ajustes. A seção IV.B
apresenta os resultados sobre o reuso das estratégias e a seção
IV.C apresenta os resultados sobre o percentual de falsos
positivos ao considerar o mapeamento de interesses.
Nas tabelas a seguir, a coluna “NO/FP” indica os valores
de: nº de ocorrências de anomalias encontradas pelas
estratégias / nº de falsos positivos identificados pelo
especialista do domínio. A coluna “%FP” indica o percentual
de falsos positivos identificados pelo especialista do domínio,
em relação ao total de ocorrências de anomalias encontradas
pelas estratégias. Destacamos em negrito os percentuais de
falsos positivos acima de 33% para facilitar a identificação dos
casos em que o resultado da estratégia não foi eficaz.
A. Resultado da fase de ajustes
Como mencionado, nesta fase avaliamos o percentual de
falsos positivos de cada estratégia, de acordo com o julgamento
do especialista. A Tabela VI apresenta o número e o percentual
de falsos positivos indicados pelo especialista para cada
estratégia, quando aplicadas aos sistemas A e B. A escolha
desses sistemas para a fase de ajustes se deve especificamente à
disponibilidade imediata do especialista E1 (Tabela I).
Através da Tabela VI é possível perceber que apenas uma
das estratégias (God Class Lit) excedeu o percentual de falsos
positivos (33%), na média dos dois sistemas, proposto para o
estudo. Isso significa que, embora as métricas utilizadas sejam
recorrentes da literatura, os limiares propostos pelos
especialistas não foram muito bons.
TABELA VI.
Estratégia
God Class EspLoc
God Class EspNom
God Class Lit
Long Method Esp
Long Method Lit
Shotgun Surgery Lit
Shotgun Surgery Esp
RESULTADO DA FASE DE AJUSTES.
Sistema A
NO/FP
27/6
15/3
4/3
30/0
1/0
61/12
21/0
%FP
22%
20%
75%
0%
0%
20%
0%
Sistema B
NO/FP
%FP
17/7
5/1
2/0
19/3
0/0
25/6
7/1
41%
20%
0%
16%
0%
24%
14%
TABELA VII. ESTRATÉGIAS DE DETECÇÃO DA FASE DE REUSO
Anomalia
God Class EspLoc
God Class EspNom
God Class Lit
Long Method Esp
Long Method Lit
Shotgun Surgery Lit
Shotgun Surgery Esp
Estratégia
(LOC > 150) and (CBO > 6)
(NOM > 15) and (CBO > 6)
(ATFD > 6) and (WMC > 46)
and (TCC < 11)
(LOC > 50) and (CC > 5)
(LOC > 50) and (CC > 6) and (MaxNesting >
5) and (NOAV > 3)
(FanOut > 16)
(CC > 7) and (AM > 7)
Como um exercício, porém, investigamos se seria possível
reduzir o percentual de falsos positivos para a estratégia God
Class Lit apenas alterando os limiares dos seus componentes,
preferencialmente sem criar novos falsos negativos. Neste
caso, observamos que sim, realmente foi possível reduzir para
0% o percentual de falsos positivos da anomalia God Class Lit.
Na fase de reuso, como veremos a seguir, as estratégias foram
então reaplicadas a outros quatro sistemas do mesmo domínio.
Objetivamos na fase de reuso observar se é possível reusar as
estratégias diretamente (sem modificação nos limiares) em
outros sistemas do mesmo domínio.
B. Resultado da etapa de reuso
Na segunda fase, as estratégias apresentadas anteriormente
(Tabela VII) foram aplicadas aos sistemas C, D, E e F. As
estratégias God Class EspLoc e God Class EspNom, quando
aplicadas ao sistema D, resultaram em um percentual de falsos
positivos de 80%. A estratégia Shotgun Surgery Lit, quando
aplicada ao sistema C, resultou em 76% de falsos positivos.
Mesmo assim, nenhuma das estratégias definidas para a
segunda fase resultou em mais do que 30% de falsos positivos,
quando aplicadas aos sistemas A e E.
A partir da Tabela VIII, é importante observar então que
God Class Lit e Long Method Lit mantiveram os resultados
abaixo de 33% para todos os sistemas avaliados. As estratégias
que não sofreram qualquer adaptação, por outro lado, variaram
um pouco em termos do percentual de falsos positivos. De
forma geral, é possível perceber que houve um reuso
satisfatório (83%) tanto das estratégias definidas em conjunto
com os especialistas (God Class EspLoc e EspNom, Long
Method Esp e Shotgun Surgery Esp) quanto das estratégias
com limiares definidos na literatura (God Class Lit, Long
Method Lit e Shotgun Surgery Lit). Pode-se concluir pelos
resultados desta fase de análise do reuso que existe certa
tendência de comportamento padrão entre sistemas de um
mesmo domínio, apesar de uns poucos casos peculiares que
encorajaram e desencorajaram futuras adaptações nos limiares.
TABELA VIII. OCORRÊNCIAS DE FALSOS POSITIVOS E ANOMALIAS NA SEGUNDA FASE.
Sistemas
Estratégia
A
NO/FP
God Class EspLoc
God Class EspNom
God Class Lit
Long Method Esp
Long Method Lit
Shotgun Surgery Lit
Shotgun Surgery Esp
27/6
15/3
4/3
30/0
1/0
61/12
21/0
B
%FP
22%
20%
0%
0%
0%
20%
0%
NO/FP
17/7
5/1
2/0
19/3
0/0
25/6
7/2
C
%FP
NO/FP
41%
20%
0%
16%
0%
24%
28%
Aplicando novas adaptações nos limiares, observamos que
certas características comuns entre os sistemas certamente
podem influenciar positivamente no grau de reuso das
estratégias. Por exemplo, considerando a estratégia God Class
EspNom, quando aplicada o sistemas F, gerou um número de
falsos positivos onde 75% dos casos o valor do componente
CBO é igual a 10. Neste caso, alterando o valor do componente
CBO para 10, o número de falsos positivos cai para 20%, no
caso do sistema F , e aumenta para 27% no caso do sistema A.
No entanto para o sistema B, o número de falsos positivos cai
para 0%, para o sistema E 12% e sistema D 50%. Mesmo com
uma piora no caso do sistema C para 40%, um pequeno ajuste
nos limiares mostrou um melhor equilíbrio dentro dos sistemas
para um mesmo domínio.
Outro exemplo pode ser mais criterioso. Assim,
considerando a estratégia God Class EspNom, quando aplicada
ao sistema D, gerou um número de falsos positivos onde 100%
dos casos o valor do componente CBO é menor que 18. Neste
caso, alterando o valor do componente CBO para 18, o número
de falsos positivos cai para 0%, nos sistemas B, D, E e F.
Mesmo assim, o percentual de falsos positivos se mantém no
sistema C e aumenta para 25% no sistema A. Dessa forma,
com um ajuste mais criterioso é possível diminuir para 0% o
percentual de falsos positivos em quatro dos seis sistemas.
Em um segundo caso, analisando o resultado da aplicação
da estratégia God Class EspLoc nos sistemas C e D,
constatamos um número de falsos positivos relativamente
maior do que para os demais sistemas (E e F), onde ela
apresentou valores de falsos positivos menor que 33%.
C. Resultados da etapa de interesses
Ao avaliar a segunda questão de pesquisa, investigou-se a
possibilidade de diminuir a ocorrência de falsos positivos das
estratégias de detecção. Observamos se tal diminuição pode
ocorrer caso fossem definidas estratégias para as classes de
cada interesse do domínio. Com este propósito, nós aplicamos
cada uma das estratégias de detecção, apresentadas
anteriormente, em classes de cada interesse. As mesmas
métricas e limiares foram mantidos. Desta forma, conseguimos
observar se: (i) haveria potencial benefício em utilizar
estratégias de detecção para cada interesse do domínio – este
caso foi observado quando as estratégias tiveram um número
maior do que 33% falsos positivos, e (ii) foi suficiente o uso de
estratégias no programa como um todo – este caso foi
observado quando as estratégias tiveram um número menor do
que 33% falsos positivos.
17/7
6/2
0/0
6/1
0/0
17/13
0/0
D
%FP
NO/FP
41%
33%
0%
17%
0%
76%
0%
E
%FP
10/8
5/4
0/0
5/2
0/0
13/1
1/0
80%
80%
0%
40%
0%
8%
0%
NO/FP
30/5
10/3
3/0
40/2
4/0
48/1
12/0
F
%FP
NO/FP
17%
30%
0%
5%
0%
2%
0%
%FP
24/7
8/4
2/0
26/3
0/0
44/2
9/0
29%
50%
0%
12%
0%
5%
0%
As tabelas a seguir apresentam o número de ocorrências
(NO) e percentagem de falsos positivos (FP) para cinco
estratégias da fase anterior. Duas delas tiveram 0% de falsos
positivos na ampla maioria dos casos. A partir dos resultados
apresentados nas Tabelas IX a XIII, percebe-se que não haveria
necessidade de especialização das estratégias para cada
interesse: (i) tanto para os casos de interesses
Autenticação/Segurança e Auxiliar, que são mais gerais (isto é,
podem ocorrer frequentemente em aplicações de outros
domínios), (ii) como para os interesses Ações, Engine e
Serviços, que são características mais específicas deste
domínio. Nesse sentido, ajustar os limiares para as estratégias
considerando o mapeamento de interesses não seria benéfico
para reduzir significativamente o percentual de falsos positivos
nos casos acima. Por outro lado, note que o contrário pode ser
dito para o caso dos interesses Persistência, Interface,
Indicadores e Tarefas. Em todos estes casos de interesses, notase nas tabelas que os números de falsos positivos,
independentemente da anomalia analisada, estão bem acima do
limiar de 33% em vários casos.
TABELA IX.
OCORRÊNCIAS DA ESTRATÉGIA SHOTGUN SURGERY LIT
VISANDO O MAPEAMENTO DE INTERESSES
Interesse
Ações
Autenticação/segurança
Auxiliar
Engine
Exceção
Indicadores
Interface
Persistência
Serviços
Tarefas
TABELA X.
NO/FP
83/15
1/0
81/12
7/1
1/0
1/1
5/2
16/5
11/2
2/1
% FP
18%
0%
15%
14%
0%
100%
40%
31%
18%
50%
OCORRÊNCIAS DA ESTRATÉGIA GOD CLASS ESPLOC VISANDO
O MAPEAMENTO DE INTERESSES
Interesse
Ações
Autenticação/segurança
Auxiliar
Engine
Interface
Persistência
Serviços
Tarefas
NO/FP
30/7
2/0
52/12
1/0
10/6
18/11
10/3
2/1
% FP
23%
0%
23%
0%
60%
61%
30%
50%
TABELA XI.
OCORRÊNCIAS DA ESTRATÉGIA LONG METHOD ESP VISANDO
O MAPEAMENTO DE INTERESSES
Interesse
NO/FP
Ações
Autenticação/segurança
Auxiliar
Engine
Persistência
Serviços
Tarefas
% FP
46/3
2/0
52/5
4/0
13/5
7/0
2/0
7%
0%
10%
0%
38%
0%
0%
TABELA XII. OCORRÊNCIAS DA ESTRATÉGIA GOD CLASS ESPNOM
VISANDO O MAPEAMENTO DE INTERESSES
Interesse
NO/FP
Ações
Auxiliar
Indicadores
Interface
Persistência
Servicos
Tarefas
6/2
19/3
2/1
10/6
7/6
4/0
1/0
% FP
33%
16%
50%
60%
86%
0%
0%
TABELA XIII. OCORRÊNCIAS DA ESTRATÉGIA SHOTGUN SURGERY ESP
VISANDO O MAPEAMENTO DE INTERESSES
Interesse
Ações
Auxiliar
Engine
Persistência
Serviços
NO/FP
% FP
24/0
16/2
3/0
2/0
5/0
0%
13%
0%
0%
0%
D. Trabalhos relacionados
Em 2011 [19], Zhang, Hall e Baddoo, realizaram uma
revisão sistemática da literatura, para descrever o estado da arte
sobre anomalias de código e refatoração. Esse trabalho foi
baseado em artigos de conferências e revistas, entre 2000 e
Junho de 2009. Segundo os autores, poucos trabalhos que
relatam estudos empíricos sobre a detecção de anomalias. A
grande maioria dos trabalhos tem o objetivo de mostrar novas
ferramentas e métodos para apoiar a detecção de anomalias.
Em 2010 [18], Guo, Seaman, Zazworka e Shull,
propuseram a análise de características do domínio, para a
adaptação das estratégias de detecção de anomalias. Esse
trabalho foi realizado em um ambiente real de manutenção de
sistemas. Além disso, a adaptação dos limiares das estratégias
foi apoiada pela análise de especialistas do domínio. Mesmo
assim, esse trabalho não avalia o reuso das estratégias de
detecção para outras aplicações do mesmo domínio.
Em 2012 [28], Ferreira, Bigonha, Bigonha, Mendes e
Almeida, identificaram limiares para métricas de software
orientado a objetos. Esse trabalho foi realizado em 40 sistemas
Java,
baixados
a
partir
do
SourceForge
(www.sourcegourge.net). Nesse trabalho foram identificados
limiares para seis métricas, para onze domínios de aplicações.
A partir desse trabalho é necessário investigar o reuso desses
limiares em projetos da indústria.
Em 2012 [29], Fontana, Braione e Zanoni, revisaram o
cenário atual das ferramentas de detecção automática de
anomalias. Para isso, realizaram a comparação de quatro
ferramentas de detecção, em seis versões de projetos de
software de tamanho médio. Segundo os autores, é interessante
refinar o uso das estratégias, considerando informações do
domínio dos sistemas analisados. Ainda, existe um esforço
manual para avaliar as anomalias que são caracterizadas como
falsos positivos. Nesse sentido, percebe-se o esforço investido
na adaptação das estratégias de detecção. Dessa forma, torna-se
motivador investigar estratégias de detecção que possam ser
reusadas com sucesso.
E. Ameaças à validade
Ameaças à Validade de Construto. Durante o experimento,
os três especialistas do domínio participaram da definição das
características do domínio em estudo, da escolha dos seis
sistemas, do mapeamento de interesses de cada sistema, da
escolha das anomalias, da definição das estratégias e os
limiares e a classificação das ocorrências de anomalias. Ao
avaliar um domínio específico, é necessária a participação de
alguém que vive o desenvolvimento neste domínio no seu dia a
dia. Além disso, os especialistas possuem conhecimento sobre
boas práticas e experiências profissionais prévias no domínio
escolhido de mais de dois anos.
Validade de Conclusão e Validade Externa. Para a conclusão
do estudo, o percentual de falsos positivos das estratégias é
avaliado a partir da relação entre a quantidade de falsos
positivos classificados pelos especialistas e a quantidade de
ocorrências identificadas pela ferramenta. O limiar que define o
reuso das estratégias é de 33% de falsos positivos. Dessa forma
é possível garantir que a estratégia é capaz de identificar
apenas um falso positivo, a cada três ocorrências. Para
amenizar as ameaças à validade externa, é importante ratificar
que os seis sistemas em estudo foram escolhidos a partir da
especificação do domínio em estudo. Ainda, escolha dos
sistemas teve o apoio de especialistas que possuem mais de
dois anos de experiência no domínio.
V.
CONCLUSÕES
Para que fosse possível investigar o reuso de estratégias de
detecção em vários projetos de software do mesmo domínio,
foi conduzido um estudo de múltiplos casos da indústria. O
estudo investigou o reuso de sete estratégias de detecção,
relacionadas a três anomalias, em seis projetos de um domínio
específico. Segundo o nosso estudo, em alguns casos, o reuso
das estratégias de detecção pode ser melhorado, se aplicadas a
programas do mesmo domínio, sem gerar um efeito colateral.
Mesmo assim, em outros casos, para realizar uma melhoria no
reuso das estratégias, é possível que sejam criados falsos
negativos.
No total, dos sete casos que excederam o limiar de 33%, em
quatro casos existe pelo menos um cenário onde duas classes
com estruturas similares foram classificadas uma como
anomalia e outra como falso positivo. Isso mostrou que em
certos casos é impossível definir um limiar que elimine boa
parte dos falsos positivos sem gerar falsos negativos. Como
uma consequência direta, pode-se afirmar que existe um limite
no grau de reuso das estratégias, isto é, uma nova adaptação na
tentativa de diminuir o percentual de falsos positivos pode
aumentar o número de falsos negativos.
Além disso, percebe-se que tanto em interesses como
Autenticação/Segurança e Auxiliar, que são mais gerais,
quanto os interesses Ações, Engine e Serviços, que são
características mais específicas deste domínio, não existe a
necessidade de especialização das estratégias. Nesse sentido,
ajustar os limiares para as estratégias considerando o
mapeamento de interesses não seria benéfico para reduzir
significativamente o percentual de falsos positivos nos casos
acima. Por outro lado, o contrário pode ser dito para o caso dos
interesses Persistência, Interface, Indicadores e Tarefas.
Ainda, a partir dos resultados, percebeu-se que duas
estratégias de detecção de anomalias escolhidas a partir da
literatura, resultaram em 0% de falsos positivos em todos os
casos em que encontraram ocorrências. Mesmo assim, essas
estratégias não detectaram algumas ocorrências identificadas
pelas estratégias mais simples, para a mesma anomalia. Essas
ocorrências das anomalias mais simples já haviam sido
classificadas pelo especialista do domínio e não eram falsos
positivos. Essa evidência motiva trabalhos futuros sobre a
variedade da complexidade das estratégias de detecção de
anomalias. Ainda, como trabalho futuro, o presente trabalho
pode ser estendido a outros cenários (porém não limitados),
como: (i) a investigação de estratégias de detecção com reuso
em outros domínios e (ii) a investigação de outras estratégias
de detecção neste e em outros domínios.
REFERÊNCIAS
[1]
M. Fowler: “Refactoring: Improving the Design of Existing Code”. New
Jersey: Addison Wesley, 1999. 464 p.
[2] R. Marinescu, M. Lanza: “Object-Oriented Metrics in Practice”.
Springer, 2006. 206 p.
[3] N. Tsantalis, T. Chaikalis, A. Chatzigeorgiou: “JDeodorant:
Identification and removal of typechecking bad smells”. In Proceedings
of CSMR 2008, pp 329–331.
[4] PMD. Disponível em http://pmd.sourceforge.net/.
[5] iPlasma.
Disponível
em
http://loose.upt.ro/reengineering/research/iplasma
[6] InFusion. Disponível em http://www.intooitus.com/inFusion.html.
[7] E. Murphy-Hill, A. Black: “An interactive ambient visualization for
code smells”, Proceedings of SOFTVIS '10, USA, October 2010.
[8] F. Fontana, E. Mariani, A. Morniroli, R. Sormani, A. Tonello: “An
Experience Report on Using Code Smells Detection Tools”. IEEE
Fourth International Conference on Software Testing, Verification and
Validation Workshops (ICSTW), 2011.
[9] F. Fontana, V. Ferme, S. Spinelli: “Investigating the impact of code
smells debt on quality code evaluation”. Third International Workshop
on Managing Technical Debt, 2012.
[10] E. Emden, L. Moonen: “Java quality assurance by detecting code
smells”. In in Proceedings of the 9th Working Conference on Reverse
Engineering, 2002.
[11] N. Fenton, S. Pfleeger: “Software metrics: a rigorous and practical
approach”. PWS Publishing Co., 1998.
[12] R. Marinescu: “Measurement and Quality in Object-Oriented Design”.
Proceedings of the 21st IEEE International Conference on Software
Maintenance, 2005.
[13] R. Marinescu: “Detection strategies: Metrics-based rules for detecting
design flaws. Proceedings of the 20th IEEE International Conference on
Software Maintenance, 2004.
[14] N. Moha, Y. Gu´eh´eneuc, A. Meur, L. Duchien, A. Tiberghien: “From
a domain analysis to the specification and detection of code and design
smells“, Formal Aspects of Computing, 2009.
[15] Scoop. Disponível em: http://www.inf.puc-rio.br/~ibertran/SCOOP/
[16] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvicovic, A. Staa: "Are
Automatically-Detected Code Anomalies Relevant to Architectural
Modularity? An Exploratory Analysis of Evolving Systems". In
Proceedings of the 11th International Conference on Aspect-Oriented
Software Development (AOSD'12), Postdam, Germany, March 2012.
[17] I. Macia, A. Garcia, A. Staa: “An Exploratory Study of Code Smells in
Evolving Aspect-Oriented Systems”. Proceedings of the 10th
International Conference on Aspect-Oriented Software Development,
2011.
[18] Y. Guo, C. Seaman, N. Zazworka, F. Shull: “Domain-specific tailoring
of code smells: an empirical study”. Proceedings of the 32nd
ACM/IEEE International Conference on Software Engineering - Volume
2, 2010.
[19] M. Zhang, T. Hall, e N. Baddoo: “Code bad smells: a review of current
knowledge”. Journal of Software Maintenance and Evolution: research
and practice, 23(3), 179-202, 2011.
[20] I. Macia, R. Arcoverde A. Garcia, C. Chavez e A. von Staa: “On the
Relevance of Code Anomalies for Identifying Architecture Degradation
Symptoms”. In Software Maintenance and Reengineering (CSMR),
2012 16th European Conference on (pp. 277-286). IEEE.
[21] L. Hochstein e M. Lindvall: "Combating architectural degeneration: a
survey." Information and Software Technology 47.10 (2005): 643-656.
[22] T. J. McCabe: “A Complexity Measure”. IEEE Transactions on
Software Engineering, 2(4):308–320, 1976.
[23] S. R. Chidamber e C. F. Kemerer: “A metrics suite for object oriented
design”. Software Engineering, IEEE Transactions on, v. 20, n. 6, p.
476-493, 1994.
[24] J. Bieman e B. Kang: “Cohesion and reuse in an object-oriented
system.” In Proceedings ACM Symposium on Software Reusability,
1995.
[25] S. Olbrich, D. S. Cruzes, V. Basili, e N. Zazworka: “The evolution and
impact of code smells: A case study of two open source systems”. In
Proceedings of the 2009 3rd International Symposium on Empirical
Software Engineering and Measurement (pp. 390-400). IEEE Computer
Society, 2009.
[26] F. Khomh, M. Di Penta, e Y. G. Guéhéneuc: “An exploratory study of
the impact of code smells on software change-proneness.” In Reverse
Engineering, WCRE'09. 16th Working Conference on (pp. 75-84).
IEEE. 2009.
[27] A. Lozano, M. Wermelinger, e B. Nuseibeh: "Assessing the impact of
bad smells using historical information." In Ninth international
workshop on Principles of software evolution: in conjunction with the
6th ESEC/FSE joint meeting (pp. 31-34). ACM, 2007.
[28] K. A. Ferreira, M. A. Bigonha, R. S. Bigonha, L. F. Mendes e H. C.
Almeida: “Identifying thresholds for object-oriented software metrics.”
Journal of Systems and Software, 85(2), 244-257, 2012.
[29] F. Fontana, P. Braione, M. Zanoni: “Automatic detection of bad smells
in code: An experimental assessment.”. Publicação Electrônica em JOT:
Journal of Object Technology, v. 11, n. 2, ago. 2012
F3T: From Features to Frameworks Tool
Matheus Viana, Rosângela Penteado, Antônio do Prado
Rafael Durelli
Department of Computing
Federal University of São Carlos
São Carlos, SP, Brazil
Email: {matheus viana, rosangela, prado}@dc.ufscar.br
Institute of Mathematical and Computer Sciences
University of São Paulo
São Carlos, SP, Brazil
[email protected]
Abstract—Frameworks are used to enhance the quality of
applications and the productivity of development process since
applications can be designed and implemented by reusing framework classes. However, frameworks are hard to develop, learn
and reuse, due to their adaptive nature. In this paper we present
the From Features to Framework Tool (F3T), which supports
the development of frameworks in two steps: Domain Modeling,
in which the features of the framework domain are modeled;
and Framework Construction, in which the source-code and the
Domain-Specific Modeling Language (DSML) of the framework
are generated from the features. In addition, the F3T also
supports the use of the framework DSML to model applications
and generate their source-code. The F3T has been evaluated in
a experiment that is also presented in this paper.
I.
I NTRODUCTION
Frameworks are reusable software composed of abstract
classes implementing the basic functionality of a domain.
When an application is developed through framework reuse,
the functionality provided by the framework classes is complemented with the application requirements. As the application is
not developed from scratch, the time spent in its development
is reduced and its quality is improved [1]–[3].
Frameworks are often used in the implementation of common application requirements, such as persistence [4] and user
interfaces [5]. Moreover, a framework is used as a core asset
when many closely related applications are developed in a
Software Product Line (SPL) [6], [7]. Common features of
the SPL domain are implemented in the framework and applications implement these features reusing framework classes.
However, frameworks are hard to develop, learn and reuse.
Their classes must be abstract enough to be reused by applications that are unknown beforehand. Framework developers
must define the domain of applications for which the framework is able to be instantiated, how the framework is reused
by these applications and how it accesses application-specific
classes, among other things [7], [8]. Frameworks have a steep
learning curve, since application developers must understand
their complex design. Some framework rules may not be
apparent in its interface [9]. A framework may contain so
many classes and operations that even developers who are
conversant with it may make mistakes while they are reusing
this framework to develop an application.
In a previous paper we presented an approach for building
Domain-Specific Modeling Languages (DSML) to support
framework reuse [10]. A DSML can be built by identifying
framework features and the information required to instantiate
them. Thus, application models created with a DSML can
be used to generate application source-code. Experiments
have shown that DSMLs protect developers from framework
complexities, reduce the occurrence of mistakes made by
developers when they are instantiating frameworks to develop
applications and reduce the time spent in this instantiation.
In another paper we presented the From Features to
Framework (F3) approach, which aims to reduce framework
development complexities [11]. In this approach the domain
of a framework is defined in a F3 model, which is a extended
version of the feature model. Then a set of patterns, named F3
patterns, guides the developer to design and implement a white
box framework according to its domain. One of the advantages
of this approach is that, besides showing how developers
can proceed, the F3 patterns systematizes the process of
framework development. This systematization allowed that the
development of frameworks could be automatized by a tool.
Therefore, in this paper we present the From Features to
Framework Tool (F3T), which is a plug-in for the Eclipse IDE
that supports the use of the F3 approach to develop and reuse
frameworks. This tool provides an editor for developers to
create a F3 model of a domain. Then, the source-code and
the DSML of a framework can be generated from the domain
defined in this model. The source-code of the framework is
generated as a Java project, while the DSML is generated as
a set of Eclipse IDE plug-ins. After being installed, a DSML
can be used to model applications. Then, the F3T can be used
again to generate the application source-code from models
created with the framework DSML. This application reuses
the framework previously generated.
We also have carried out an experiment in order to evaluate
whether the F3T facilitates framework development or not. The
experiment analyzed the time spent in framework development
and the number of problems found the source-code of the
outcome frameworks.
The remainder of this paper is organized as follows: background concepts are discussed in Section II; the F3 approach
is commented in Section III; the F3T is presented in Section
IV; an experiment that has evaluated the F3T is presented
in Section V; related works are discussed in Section VI; and
conclusions and future works are presented in Section VII.
II.
BACKGROUND
The basic concepts applied in the F3T and its approach
are presented in this section. All these concepts have reuse as
their basic principle. Reuse is a practice that aims: to reduce
time spent in a development process, because the software
is not developed from scratch; and to increase the quality
of the software, since the reusable practices, models or code
were previously tested and granted as successful [12]. Reuse
can occur in different levels: executing simple copy/paste
commands; referencing operations, classes, modules and other
blocks in programming languages; or applying more sophisticated concepts, such as patterns, frameworks, generators and
domain engineering [13].
Patterns are successful solutions that can be reapplied
to different contexts [3]. They provide reuse of experience
helping developers to solve common problems [14]. The
documentation of a pattern mainly contains its name, the
context it can be applied, the problem it is intended to
solve, the solution it proposes, illustrative class models and
examples of use. There are patterns for several purposes, such
as design, analysis, architectural, implementation, process and
organizational patterns [15].
Frameworks act like skeletons that can be instantiated to
implement applications [3]. Their classes embody an abstract
design to provide solutions for domains of applications [9].
Applications are connected to a framework by reusing its
classes. Unlike library classes, whose execution flux is controlled by applications, frameworks control the execution flux
accessing the application-specific code [15]. The fixed parts of
the frameworks, known as frozen spots, implement common
functionality of the domain that is reused by all applications.
The variable parts, known as hot spots, can change according
to the specifications of the desired application [9]. According
to the way they are reused, frameworks can be classified as:
white box, which are reused by class specialization; black box,
which work like a set of components; and gray box, which are
reused by the two previous ways [2].
Generators are tools that transform an artifact into another
[16], [17]. There are many types of generators. The most common are Model-to-Model (M2M), Model-to-Text (M2T) and
programming language translators [18]. Such as frameworks,
generators are related to domains. However, some generators
are configurable, being able to change their domain [19]. In
this case, templates are used to define the artifacts that can be
generated.
Domains can also be modeled with metamodel languages,
which are used to create Domain-Specific Modeling Languages
(DSML). Metamodels, such as defined in the MetaObject
Facility (MOF) [25], are similar to class models, which makes
them more appropriate to developers accustomed to the UML.
While in feature models, only features and their constraints are
defined, metaclasses in the metamodels can contain attributes
and operations. On the other hand, feature models can define
dependencies between features, while metamodels depend on
declarative languages to do it [18]. A generator can be used
along with a DSML to transform models created with this
DSML into code. When these models represent applications,
the generators are called application generators.
III.
The F3 is a Domain Engineering approach that aims to
develop frameworks for domains of applications. It has two
steps: 1) Domain Modeling, in which framework domain is
determined; and 2) Framework Construction, in which the
framework is designed and implemented according to the
features of its domain.
In Domain Modeling step the domain is defined in a feature
model. However, an extended version of feature model is used
in the F3 approach, because feature models are too abstract
to contain information enough for framework development
and metamodels depend on other languages to define dependencies and constraints. These extended version, called F3
model, incorporates characteristics of both feature models and
metamodels. As in conventional feature models, the features
in the F3 models can also be arranged in a tree-view, in which
the root feature is decomposed in other features. However, the
features in the F3 models do not necessarily form a tree, since
a feature can have a relationship targeting a sibling or even
itself, as in metamodels. The elements and relationships in F3
models are:
•
Feature: graphically represented by a rounded square,
it must have a name and it can contain any number of
attributes and operations;
•
Decomposition: relationship that indicates that a feature is composed of another feature. This relationship
specifies a minimum and a maximum multiplicity.
The minimum multiplicity indicates whether the target
feature is optional (0) or mandatory (1). The maximum
multiplicity indicates how many instances of the target
feature can be associated to each instance of the source
feature. The valid values to the maximum multiplicity
are: 1 (simple), for a single feature instance; * (multiple), for a list of a single feature instance; and **
(variant), for any number of feature instances.
•
Generalization: relationship that indicates that a feature is a variation generalized by another feature.
•
Dependency: relationship that defines a condition for
a feature to be instantiated. There are two types of dependency: requires, when the A feature requires the
B feature, an application that contains the A feature
also has to include the B feature; and excludes, when
the A feature excludes the B feature, no application
can include both features.
A domain of software consists of a set of applications that
share common features. A feature is a distinguishing characteristic that aggregates value to applications [20]–[22]. For
example, Rental Transaction, Destination Party and Resource
could be features of the domain of rental applications. Different
domain engineering approaches can be found in the literature
[20], [22]–[24]. Although there are differences between them,
their basic idea is to model the features of a domain and
develop the components that implement these features and are
reused in application engineering.
The features of a domain are defined in a feature model, in
which they are arranged in a tree-view notation. They can be
mandatory or optional, have variations and require or exclude
other features. The feature that most represents the purpose
of the domain is put in the root and a top-down approach
is applied to add the other features. For example, the main
purpose of the domain of rental applications is to perform
rentals, so Rental is supposed to be the root feature. The other
features are arranged following it.
T HE F3 A PPROACH
Framework Construction step has as output a white box
framework for the domain defined in the previous step. The F3
approach defines a set of patterns to assist developers to design
and implement frameworks from F3 models. The patterns treat
problems that go from the creation of classes for the features
to the definition of the framework interface. Some of the F3
patterns are presented in Table I.
TABLE I: Some of the F3 patterns.
Pattern
Domain Feature
Mandatory Decomposition
Optional Decomposition
Simple Decomposition
Multiple Decomposition
Variant Decomposition
Variant Feature
Modular Hierarchy
Requiring Dependency
Excluding Dependency
Purpose
Indicates structures that should be created for a
feature.
Indicates code units that should be created when there
is a mandatory decomposition linking two features.
Indicates code units that should be created when there
is an optional decomposition linking two features.
Indicates code units that should be created when there
is a simple decomposition linking two features.
Indicates code units that should be created when there
is a multiple decomposition linking two features.
Indicates code units that should be created when there
is a variant decomposition linking two features.
Defines a class hierarchy for features with variants.
Defines a class hierarchy for features with common
attributes and operations.
Indicates code units that should be created when a
feature requires another one.
Indicates code units that should be created when a
feature excludes another one.
In addition to indicate the code units that should be created
to implement the framework functionality, the F3 patterns
also determine how the framework can be reused by the
applications. For example, some patterns suggest to include
abstract operations in the classes of the framework that allows
it to access application-specific information. In addition, the
F3 patterns make the development of frameworks systematic,
allowing it to be automatized. Thus, the F3T tool was created
to automatize the use of the F3 approach, enhancing the
processes of framework development.
IV.
T HE F3T
The F3T assists developers to apply the F3 approach in
the development of white box frameworks and to reuse these
frameworks through their DSMLs. The F3T is a plug-in for the
Eclipse IDE. So developers can make use of the F3T resources,
Fig. 1: Modules of the F3T.
such as domain modeling, framework construction, application
modeling through framework DSML and application construction, as well the other resources provided by the IDE. The
F3T is composed of three modules, as seen in Figure 1: 1)
Domain Module (DM); 2) Framework Module (FM); and 3)
Application Module (AM).
A. Domain Module
The DM provides a F3 model editor for developers to
define domain features. This module has been developed with
the support of the Eclipse Modeling Framework (EMF) and
the Graphical Modeling Framework (GMF) [18]. The EMF
was used to create a metamodel, in which the elements,
relationships and rules of the F3 models were defined as
described in the Section III. The metamodel of F3 models is
shown in Figure 2. From this metamodel, the EMF generated
the source-code of the Model and the Controller layers of the
F3 model editor.
The GMF has been used to define the graphical notation of
the F3 models. This graphical notation also can be seen as the
View layer of the F3 model editor. With the GMF, the graphical
figures and the menu bar of the editor were defined and linked
to the elements and relationships defined in the metamodel of
the F3 models. Then, the GMF generates the source-code of
the graphical notation. The F3 model editor is shown in Figure
3 with an example of F3 model for the domain of trade and
rental transactions.
Fig. 2: Metamodel containing elements, relationships and rules of F3 models.
Fig. 3: F3 model for the domain of trade and rental transactions.
B. Framework Module
The FM is a M2T generator that transforms F3 models into
framework source-code and DSML. Despite their graphical
notation, F3 models actually are XML files. It makes them
more accessible to other tools, such as a generator. The FM
was developed with the support of the Java Emitter Templates
(JET) in the Eclipse IDE [26].
The JET plug-in contains a framework that is a generic
generator and a compiler that translate templates into Java files.
These templates are XML files, in which tags are instructions
to generate an output based on information in the input and
text is a fixed content inserted in the output independently of
input. The Java files originated from the JET templates reuse
the JET framework to compose a domain-specific generator.
Thus, the FM depend on the JET plug-in to work.
The templates of the FM are organized in two groups:
one related to framework source-code; and another related to
framework DSML. Both groups are invoked from the main
template of the DM generator. Part of the JET template which
generates Java classes in the framework source-code from the
features found in the F3 models can be seen as follows:
public
<c:if test="($feature/@abstract)">abstract </c:if>
class <c:get select="$feature/@name"/> extends
<c:choose select="$feature/@variation">
<c:when test="’true’">DVariation</c:when>
<c:otherwise> <c:choose>
<c:when test="$feature/dSuperFeature">
<c:get select="$feature/dSuperFeature/@name"/>
</c:when>
<c:otherwise>DObject</c:otherwise> </c:choose>
</c:otherwise>
</c:choose> { ... }
The framework source-code that is generated by the FM
is organized in a Java project identified by the domain name
and the suffix “.framework”. The framework source-code is
generated according to the patterns defined by the F3 approach.
For example, the FM generates a class for each feature
found in a F3 model. These classes contain the attributes
and operations defined in its original feature. All generated
classes also, directly or indirectly, extend the DObject class,
which implements non-functional requirements, such as persistence and logging. Generalization relationships result in
inheritances and decomposition relationships result in associations between the involving classes. Additional operations are
included in the framework classes to treat feature variations
and constraints of the domains defined in the F3 models. For
example, according to the Variant Decomposition F3 pattern,
the getResourceTypeClasses operation was included in the
code of the Resource class so that the framework can recognize which classes implement the ResourceType feature in
the applications. Part of the code of the Resource class is
presented as folows:
/** @generated */
public abstract class Resource extends DObject {
/** @generated */
private int id;
/** @generated */
private Sting name;
/** @generated */
private List<ResourceType> types;
/** @generated */
public abstract Class<?>[] getResourceTypeClasses();
The framework DSML is generated as a EMF/GMF project
identified only by the domain name. The FM generates the
EMF/GMF models of the DSML, as seen in Figure 4.a, which
was generated from the F3 model shown in Figure 3. Then,
source-code of the DSML must be generated by using the
generator provided by the EMF/GMF in three steps: 1) using
the EMF generator from the genmodel file (Figure 4.a); 2)
using the GMF generator from the gmfmap file (Figure 4.b);
and 3) using the GMF generator from the gmfgen file (Figure
4.c). After this, the DSML will be composed of 5 plug-in
projects in the Eclipse IDE. The projects that contain the
source-code and the DSML plug-ins of the framework for the
trade and rental transactions domain are shown in Figure 4.d.
Fig. 4: Generation of the DSML plugins.
A. Planning
Fig. 5: Application model created with the framework DSML.
The experiment was planned to answer two research questions: RQ1 : “Does the FT3 reduce the effort to develop
a framework?”; and RQ2 : “Does the F3T result in a
outcome framework with a fewer number of problems?”.
All subjects had to develop two frameworks, both applying the
F3 approach, but one manually and the other with the support
of the F3T. The context of our study corresponds to multi-test
within object study [27], hence the experiment consisted of
experimental tests executed by a group of subjects to study a
single tool. In order to answer the first question, we measured
the time spent to develop each framework. Then, to answer
the second question, we analyzed the frameworks developed
by the subjects, then we identified and classified the problems
found in the source-code. The planning phase was divided into
seven parts, which are described in the next subsections:
C. Application Module
1. Context Selection
The AM has been also developed with the support of
JET. It generates application source-code from an application model based on a framework DSML. The templates of
the AM generate classes that extend framework classes and
override operations that configure framework hot spots. After
the DSML plug-ins are installed in the Eclipse IDE, the
AM recognizes the model files created from the DSML. An
application model created with the DSML of the framework
for the domain of trade and rental transactions is shown in
Figure 5.
26 MSc and PhD students of Computer Science participated in the experiment, which has been made in an off-line
situation. All participants had prior experience in software development, Java programming, patterns and framework reuse.
Application source-code is generated in the source folder of
the project where the application model is. The AM generates
a class for each feature instantiated in the application model.
Since the framework is white box, the application classes
extend the framework classes indicated by the stereotypes in
the model. It is expected that many class attributes requested by
the application requirements have been defined in the domain.
Thus, these attributes are in the framework source-code and
they must not be defined in the application classes again. Part
of the code of the Product class is presented as follows:
public class Product extends Resource {
/** @generated */
private float value;
/** @generated */
public Class<?>[] getResourceTypeClasses() {
return new Class<?>[] {
Category.class, Manufacturer.class };
};
V.
E VALUATION
In this section we present an experiment, in which we
evaluated the use of the F3T to develop frameworks, since the
use of DSMLs to support framework reuse has been evaluated
in a previous paper [10]. The experiment was conducted
following all steps described by Wohlin et al. (2000) and
it can be summarized as: (i) analyse the F3T, described in
Section IV; (ii) for the purpose of evaluation; (iii) with
respect to time spent and number of problems; (iv) from the
point of view of the developer; and (v) in the context of MSc
and PhD Computer Science students.
2. Formulation of Hypotheses
The experiment questions have been formalized as follows:
RQ1 , Null hypothesis, H0 : Considering the F3 approach,
there is no significant difference, in terms of time, between
developing frameworks with the support of F3T and doing it
manually. Thus, the F3T does not reduce the time spent to
develop frameworks. This hypothesis can be formalized as:
H0 : µF3T = µmanual
RQ1 , Alternative hypothesis, H1 : Considering the F3
approach, there is a significant difference, in terms of time,
between developing frameworks with the support of F3T and
doing it manually. Thus, the F3T reduces the time spent to
develop frameworks. This hypothesis can be formalized as:
H1 : µF3T 6= µmanual
RQ2 , Null hypothesis, H0 : Considering the F3 approach,
there is no significant difference, in terms of problems found
in the outcome frameworks, between developing frameworks
using the F3T and doing it manually. Thus, the F3T does
not reduce the mistakes made by subjects while they are
developing frameworks. This hypothesis can be formalized as:
H0 : µF3T = µmanual
RQ2 , Alternative hypothesis, H1 : Considering the F3
approach, there is a significant difference, in terms of problems found in the outcome frameworks, between developing
frameworks using the F3T and doing it manually. Thus, the
F3T reduces the mistakes made by subjects while they are
developing frameworks. This hypothesis can be formalized as:
H1 : µF3T 6= µmanual
3. Variables Selection
The dependent variables of this experiment were “time
spent to develop a framework” and “number of problems
found in the outcome frameworks”. The independent variables were as follows:
•
Application: Each subject had to develop two frameworks: one (Fw1) for the domain of trade and rental
transactions and the other (Fw2) for the domain of automatic vehicles. Both Fw1 and Fw2 had 10 features.
presentation about frameworks, which included the description
of some known examples and their hot spots. The subjects
were also trained on how to develop frameworks using the F3
approach with or without the support of the F3T.
•
Development Environment: Eclipse 4.2.1, Astah
Community 6.4, F3T.
•
Technologies: Java version 6.
Following the training, the pilot experiment was executed.
The subjects were split into two groups considering the results
of the characterization forms. Subjects were not told about the
nature of the experiment, but were verbally instructed on the
F3 approach and its tool. The pilot experiment was intended
to simulate the real experiments, except that the applications
were different, but equivalent. Beforehand, all subjects were
given ample time to read about approach and to ask questions
on the experimental process. This could affect the experiment
validity, then, the data from this activity was only used to
balance the groups.
4. Selection of Subjects
The subjects were selected through a non probabilist approach by convenience, i.e., the probability of all population
elements belong to the same sample is unknown.
5. Experiment Design
The subjects were carved up in two blocks of 13 subjects:
•
Block 1, development of Fw1 manually and development of Fw2 with the support of the F3T;
•
Block 2, development of Fw2 manually and development of Fw1 with the support of the F3T.
We have chosen use block to reduce the effect of the
experience of the students, that was measured through a form
in which the students answered about their level of experience
in software development. This form was given to the subjects
one week before the pilot experiment herein described. The
goal of this pilot experiment was to ensure that the experiment
environment and materials were adequate and the tasks could
be properly executed.
6. Design Types
The design type of this experiment was one factor with
two treatments paired [27]. The factor in this experiment
is the manner how the F3 approach was used to develop a
framework and the treatments are the support of the F3T
against the manual development.
7. Instrumentation
All necessary materials to assist the subjects during the
execution of this experiment were previously devised. These
materials consisted of forms for collecting experiment data, for
instance, time spent to develop the frameworks and a list of the
problems were found in the outcome frameworks developed by
each subject. In the end of the experiment, all subjects received
a questionnaire to report about the F3 approach and the F3T.
B. Operation
The operation phase was divided into two parts, as described in the next subsections:
1. Preparation
Firstly, the subjects received a characterization form, containing questions regarding their knowledge about Java programming, Eclipse IDE, patterns and frameworks. Then, the
subjects were introduced to the F3 approach and the F3T.
2. Execution
Initially, the subjects signed a consent form and then
answered a characterization form. After this, they watched a
When the subjects understood what their had to do, they
received the description of the domains and started timing the
development of the frameworks. Each subject had to develop
the frameworks applying the F3 approach, i.e., creating its F3
model from a document which describes its domain features
and then applying the F3 patterns to implement it.
C. Analysis of Data
This section presents the experimental findings. The analysis is divided into two subsections: (1) Descriptive Statistics
and (2) Hypotheses Testing.
1. Descriptive Statistics
The time spent by each subject to develop a framework
and the number of problems found in the outcome frameworks
are shown in Table II. From this table, it can be seen that the
subjects spent more time to develop the frameworks when they
were doing it manually, approximately 72.5% against 27.5%.
This result was expected, since the F3T generates framework
source-code from F3 models. However, it is worth highlighting
that most of the time spent in the manual framework development was due to framework implementation and the effort
to fix the problems found in the frameworks, while most of
the time spent in the framework development supported by the
F3T was due to domain modeling. The dispersion of time spent
by the subjects are also represented graphically in a boxplot
on left side of Figure 6.
In Table II it is also possible to visualize four types of
problems that we analyzed in the outcome frameworks: (i)
incoherence, (ii) structure, (iii) bad smells, (iv) interface.
The problem of incoherence indicates that, during the
experiment, the subjects did not model the domain of the
framework as expected. Consequently, the subjects did not
develop the frameworks with the correct domain features and
constraints (mandatory, optional, and alternative features). As
the capacity to model the framework domains depend more on
the subject skills than on tool support, incoherence problems
could be found in equivalent proportions, approximately 50%,
when the framework was developed either manually or with
the support of the F3T.
The problem of structure indicates that the subjects did
not implement the frameworks properly during the experiment.
For example, they implemented classes with no constructor
TABLE II: Development timings and number of problems.
.
.
.
.
.
.
.
.
.
.
.
.
.
The problem of interface indicates absence of getter/setter
operations and the lack of operations that allows the applications to reuse the framework and so on. Usually, this kind of
problem is a consequence of problems of structure, hence the
number of problems of these two types are quite similar. As it
can be observed in Table II that the F3T helped the subjects to
design a better framework interface than when they developed
the framework manually, i.e., 8.6% against 91.4%.
In the last two columns of Table II it can be seen that
the F3T reduced the total number of problems found in the
frameworks developed by the subjects. It is also graphically
represented in the boxplot on right side of Figure 6.
2. Testing the Hypotheses
Fig. 6: Dispersion of the total time and number of problems.
and incorrect relationships or when they forgot to declare
the classes as abstract. This kind of problem occurred when
the subjects did not properly follow the instructions provided
by the F3 patterns. In Table II it can be seen that the F3T
helped the subjects to develop frameworks with less structure
problems, i.e., 10% in opposition to 90%.
The problem of bad smells indicates design weaknesses
that do not affect functionality, but make the frameworks
harder to maintain. In the experiment, this kind of problem
occurred when the subjects forgot to apply some F3 patterns
related to the organization of the framework classes, such as
the Modular Hierarchy F3 pattern. By observing Table II we
can remark that the F3T made a design with higher quality
than the manual approach, i.e, 0% against 100%, because the
F3T automatically identified which patterns should be applied
from the F3 models.
The objective of this section is to verify, based on the data
obtained in the experiment, whether it is possible to reject
the null hypotheses in favor of the alternative hypotheses.
Since some statistical tests are applicable only if the population
follows a normal distribution, we applied the Shapiro-Wilk test
and created a Q-Q chart to verify whether or not the experiment
data departs from linearity before choosing a proper statistical
test. The tests has been carried out as follows:
1)
Time: We have applied the Shapiro-Wilk test on the
experiment data that represents the time spent by
each subject to develop a framework manually or
using the F3T, as shown in Table II. Considering an α
= 0.05, the p-values are 0.878 and 0.6002 and Ws are
0.9802 and 0.9691, respectively, for each approach.
The test results confirmed that the experiment data
related to the time spent in framework development
is normally distributed, as it can be seen in the Q-Q
charts (a) and (b) in Figure 7. Thus, we decided to
apply the Paired T-Test to these data. Assuming a
Paired T-Test, we can reject H0 if | t0 | > tα/2,n−1 .
these data are S/R of | t problemsmanual − t problemsF3T |
= { +3.5, +7.5, +7.5, +16.5, -3.5, +23, +3.5, +3.5,
+10.5, +10.5, +3.5, +18.5, +10.5, +14, +24, +18.5,
+3.5, +21, +21, +14, +21, +10.5, +14, +16.5}, S/R
stand for “signed rank”. As result we got a p-value =
0.001078 with a significance level of 1%. Based on
these data, we conclude there is considerable difference between the means of the two treatments. We
were able to reject H0 at 1% significance level. The pvalue is very close to zero, which further emphasizes
that the F3T reduces the number of problems found
in the outcome frameworks.
D. Opinion of the Subjects
We analyzed the opinion of the subjects in order to evaluate
the impact of using the approaches considered in the experiment. After the experiment operation, all subjects received
a questionnaire, in which they could report their perception
about applying the F3 approach manually or with the support
of the F3T.
Fig. 7: Normality tests.
In this case, tα, f is the upper α percentage point
of the t-distribution with f degrees of freedom.
Therefore, based on the samples, n = 26 and d =
{46, 42, 52, 49, 41, 49, 55, 50, 53, 42, 42, 52, 48, 43, 45,
42, 47, 48, 44, 49, 51, 48, 52, 51, 48, 45}, Sd = 9.95
and t0 = 1.6993. The average values of each data
set are µmanual = 76.42 and µF3T = 28.96. So,
d = 76.42 − 28.96 = 47.46, which implies that
Sd = 3.982 and t0 = 60.7760. The number of degrees
of freedom is f = n − 1 = 26 − 1 = 25. We take
α = 0.025. Thus, according to StatSoft1 , it can be
seen that t 0.025,9 = 2.05954. Since | t0 | > t0.025,9
it is possible to reject the null hypothesis with
a two sided test at the 0.025 level. Therefore,
statistically, we can assume that, when the F3
approach is applied, the time needed to develop
a framework using F3T is less than doing it manually.
2)
Problems: Similarly, we have applied the ShapiroWilk test on the experiment data shown in the last
two columns of Table II, which represent the total
number of problems found in the outcome frameworks that were developed whether manually or using
the F3T. Considering an α = 0.05, the p-values are
0.1522 and 0.007469, and Ws are 0.9423 and 0.8853,
respectively, for each approach. As it can be seen in
the Q-Q charts (c) and (d) in Figure 7, the test results
confirmed that date related to manual development is
normally distributed, but the data related to the F3T
can not be considered as normally distributed. Therefore we applied a non-parametric test, the Wilcoxon
signed-rank test in these data. The signed rank of
1 http://www.statsoft.com/textbook/distribution-tables/#t
The answers in the questionnaire has been analyzed in
order to identify the difficulties in the use of the F3 approach
and its tool. As it can be seen in Figure 8, when asked if they
encountered difficulties in the development of the frameworks
by applying the F3 approach manually, approximately 52%
of the subjects reported having significant difficulty, 29%
mentioned partial difficulty and 19% had no difficulty. In
contrast, when asked the same question with respect to the
use of the F3T, 73% subjects reported having no difficulty,
16% subjects mentioned partial difficulty and only 11% of all
subjects had significant difficulty.
Fig. 8: Level of difficulty of the subjects.
The reduction of the difficulty to develop the frameworks,
shown in Figure 8, reveals that the F3T assisted the subjects in
this task. The subjects also answered in the questionnaire about
the difficulties they found during framework development. The
most common difficulties pointed out by the subjects when
they developed the frameworks manually were: 1) too much
effort spent on coding; 2) mistakes they made due to lack
of attention; 3) lack of experience for developing frameworks;
and 4) time spent identifying the F3 patterns in the F3 models.
In contrast, the most common difficulties faced by the subjects
when they used the F3T were: 1) lack of practice with the tool;
and 2) some actions in the tool interface, for instance, opening
the F3 model editor, take many steps to be executed. The
subjects said that the F3 patterns helped them to identify which
structures were necessary to implement the frameworks in the
manual development. They also said the F3T automatized the
tasks of identifying which F3 patterns should be used and of
implementing the framework source-code. Then, they could
keep their focus on domain modeling.
two tests: T-Tests to statistically analyze the time spent
to develop the frameworks and Wilcoxon signed-rank
test to statistically analyze the number of problems
found in the outcome frameworks.
VI.
E. Threats to Validity
Internal Validity:
•
Experience level of the subjects: the subjects had
different levels of knowledge and it could affect the
collected data. To mitigate this threat, we divided the
subjects in two balanced blocks considering their level
knowledge and rebalanced the groups considering the
preliminary results. Moreover, all subjects had prior
experience in application development reusing frameworks, but not for developing frameworks. Thus, the
subjects were trained in common framework implementation techniques and how to use the F3 approach
and the F3T.
•
Productivity under evaluation: there was a possibility that this might influence the experiment results
because subjects often tend to think they are being
evaluated by experiment results. In order to mitigate
this, we explained to the subjects that no one was
being evaluated and their participation was considered
anonymous.
•
Facilities used during the study: different computers
and installations could affect the recorded timings.
Thus, the subjects used the same hardware configuration and operating system.
Validity by Construction:
•
Hypothesis expectations: the subjects already knew the
researchers and knew that the F3T was supposed to
ease framework development, which reflects one of
our hypothesis. These issues could affect the collected
data and cause the experiment to be less impartial.
In order to keep impartiality, we enforced that the
participants had to keep a steady pace during the
whole study.
External Validity:
•
Interaction between configuration and treatment: it is
possible that the exercises performed in the experiment
are not accurate for every framework development
for real world applications. Only two frameworks
were developed and they had the same complexity.
To mitigate this threat, the exercises were designed
considering framework domains based on the real
world.
Conclusion Validity:
•
Measure reliability: it refers to metrics used to measuring the development effort. To mitigate this threat,
we used only the time spent which was captured in
forms fulfilled by the subjects;
•
Low statistic power: the ability of a statistic test in
reveal reliable data. To mitigate this threat, we applied
R ELATED W ORKS
In this section some works related to the F3T and the F3
approach are presented.
Amatriain and Arumi [28] proposed a method for the
development of a framework and its DSL through iterative and
incremental activities. In this method, the framework has its
domain defined from a set of applications and it is implemented
by applying a series of refactorings in the source-code of these
applications. The advantage of this method is a small initial
investment and the reuse of the applications. Although it is not
mandatory, the F3 approach can also be applied in iterative
and incremental activities, starting from a small domain and
then adding features. Applications can also be used to facilitate
the identification of the features of the framework domain.
However, the advantage of the F3 approach is the fact that
the design and the implementation of the frameworks are
supported by the F3 patterns and it is automatized by the F3T.
Oliveira et al. [29] presented the ReuseTool, which assists
framework reuse by manipulating UML diagrams. The ReuseTool is based in the Reuse Description Language (RDL), a
language created by these authors to facilitate the description
of framework instantiation processes. Framework hot spots
can be registered in the ReuseTool with the use of the RDL.
In order to instantiate the framework, application models can
be created based on the framework description. Application
source-code is generated from these models. Thus, the RDL
works as a meta language that registers framework hot spots
and the ReuseTool provides a more friendly interface for
developers to develop applications reusing the frameworks. In
comparison, the F3T supports framework development through
domain modeling and application development through framework DSML.
Pure::variants [30] is a tool that supports the development of applications by modeling domain features (Feature
Diagram) and the components that implement these features
(Family Diagram). Then the applications are developed by selecting a set of features of the domain. Pure::variants generates
only application source-code, maintaining all domain artifacts
in model-level. Besides, this tool has private license and its
free version (Community) has limitations in its functionality.
In comparison, the F3T is free, uses only one type of domain
model (F3 model) and generates frameworks as domain artifacts. Moreover, the frameworks developed with he support of
the F3T can be reused in the development of applications with
or without the support of the F3T.
VII.
C ONCLUSIONS
The F3T support framework development and reuse
through code generating from models. This tool provides an
F3 model editor for developers to define the features of the
framework domain. Then, framework source-code and DSML
can be generated from the F3 models. Framework DSML
can be installed in the F3T to allow developers to model
and to generate the source-code of applications that reuses
the framework. The F3T is a free software available at:
http://www.dc.ufscar.br/∼matheus viana.
[8]
The F3T was created to semi-automatize the applying
of the F3 approach. In this approach, domain features are
defined in F3 models in order to separate the elements of the
framework from the complexities to develop them. F3 models
incorporate elements and relationships from feature models
and properties and operations from metamodels.
[9]
Framework source-code is generated based on patterns that
are solutions to design and implement domain features defined
in F3 models. A DSML is generated along with the sourcecode and includes all features of the framework domain and
in the models created with it developers can insert application
specifications to configure framework hot spots. Thus, the
F3T supports both Domain Engineering and Application Engineering, improving their productivity and the quality of the
outcome frameworks and applications. The F3T can be used
to help the construction of software product lines, providing
an environment to model domains and create frameworks to
be used as core assets for application development.
[10]
[11]
[12]
[13]
[14]
[15]
[16]
The experiment presented in this paper has shown that,
besides the gain of efficiency, the F3T reduces the complexities
surrounding framework development, because, by using this
tool, developers are more concerned about defining framework
features in a graphical model. All code units that compose
these features, provide flexibility to the framework and allows
it to be instantiated in several applications are properly generated by the F3T.
[17]
The current version of the F3T generates only the model
layer of the frameworks and applications. In future works we
intend to include the generation of a complete multi-portable
Model-View-Controller architecture.
[20]
ACKNOWLEDGMENT
[18]
[19]
[21]
[22]
The authors would like to thank CAPES and FAPESP for
sponsoring our research.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
V. Stanojevic, S. Vlajic, M. Milic, and M. Ognjanovic. Guidelines for
Framework Development Process. In 7th Central and Eastern European
Software Engineering Conference, pages 1–9, Nov 2011.
M. Abi-Antoun. Making Frameworks Work: a Project Retrospective. In
ACM SIGPLAN Conference on Object-Oriented Programming Systems
and Applications, 2007.
R. E. Johnson. Frameworks = (Components + Patterns). Communications of ACM, 40(10):39–42, Oct 1997.
JBoss Community. Hibernate. http://www.hibernate.org, Jan 2013.
Spring
Source
Community.
Spring
Framework.
http://www.springsource.org/spring-framework, Jan 2013.
S. D. Kim, S. H. Chang, and C. W. Chang. A Systematic Method
to Instantiate Core Assets in Product Line Engineering. In 11th AsiaPacific Conference on Software Engineering, pages 92–98, Nov 2004.
David M. Weiss and Chi Tau Robert Lai. Software Product Line
Engineering: A Family-Based Software Development Process. AddisonWesley, 1999.
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
D. Parsons, A. Rashid, A. Speck, and A. Telea. A Framework for
Object Oriented Frameworks Design. In Technology of Object-Oriented
Languages and Systems, pages 141–151, Jul 1999.
S. Srinivasan. Design Patterns in Object-Oriented Frameworks. ACM
Computer, 32(2):24–32, Feb 1999.
M. Viana, R. Penteado, and A. do Prado. Generating Applications:
Framework Reuse Supported by Domain-Specific Modeling Languages.
In 14th International Conference on Enterprise Information Systems,
Jun 2012.
M. Viana, R. Durelli, R. Penteado, and A. do Prado. F3: From
Features to Frameworks. In 15th International Conference on Enterprise
Information Systems, Jul 2013.
Sajjan G. Shiva and Lubna Abou Shala. Software Reuse: Research
and Practice. In Fourth International Conference on Information
Technology, pages 603–609, Apr 2007.
W. Frakes and K. Kang. Software Reuse Research: Status and Future.
IEEE Transactions on Software Engineering, 31(7):529–536, Jul 2005.
M. Fowler. Patterns. IEEE Software, 20(2):56–57, 2003.
R. S. Pressman. Software Engineering: A Practitioner’s Approach.
McGraw-Hill Science, 7th edition, 2009.
A. Sarasa-Cabezuelo, B. Temprado-Battad, D. Rodrguez-Cerezo, and
J. L. Sierra. Building XML-Driven Application Generators with
Compiler Construction. Computer Science and Information Systems,
9(2):485–504, 2012.
S. Lolong and A.I. Kistijantoro. Domain Specific Language (DSL)
Development for Desktop-Based Database Application Generator. In
International Conference on Electrical Engineering and Informatics
(ICEEI), pages 1–6, Jul 2011.
R. C. Gronback. Eclipse Modeling Project: A Domain-Specific Language (DSL) Toolkit. Addison-Wesley, 2009.
I. Liem and Y. Nugroho. An Application Generator Framelet. In
9th International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD’08),
pages 794–799, Aug 2008.
J. M. Jezequel. Model-Driven Engineering for Software Product Lines.
ISRN Software Engineering, 2012, 2012.
K. Lee, K. C. Kang, and J. Lee. Concepts and Guidelines of Feature
Modeling for Product Line Software Engineering. In 7th International
Conference on Software Reuse: Methods, Techniques and Tools, pages
62–77, London, UK, 2002. Springer-Verlag.
K. C. Kang, S. G. Cohen, J. A. Hess, W. E. Novak, and A. S. Peterson.
Feature-Oriented Domain Analysis (FODA): Feasibility Study. Technical report, Carnegie-Mellon University Software Engineering Institute,
Nov 1990.
H. Gomaa. Designing Software Product Lines with UML: From Use
Cases to Pattern-Based Software Architectures. Addison-Wesley, 2004.
J. Bayer, O. Flege, P. Knauber, R. Laqua, D. Muthig, K. Schmid,
T. Widen, and J. DeBaud. PuLSE: a Methodology to Develop Software
Product Lines. In Symposium on Software Reusability, pages 122–131.
ACM, 1999.
OMG. OMG’s MetaObject Facility. http://www.omg.org/mof, Jan 2013.
The Eclipse Foundation. Eclipse Modeling Project, Jan 2013.
C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and
A. Wesslén. Experimentation in Software Engineering: an Introduction.
Kluwer Academic Publishers, Norwell, MA, USA, 2000.
X. Amatriain and P. Arumi. Frameworks Generate Domain-Specific
Languages: A Case Study in the Multimedia Domain. IEEE Transactions on Software Engineering, 37(4):544–558, Jul-Aug 2011.
T. C. Oliveira, P. Alencar, and D. Cowan. Design Patterns in ObjectOriented Frameworks. ReuseTool: An Extensible Tool Support for
Object-Oriented Framework Reuse, 84(12):2234–2252, Dec 2011.
Pure
Systems.
Pure::Variants.
http://www.puresystems.com/pure variants.49.0.html, Fev 2013.
A Metric of Software Size as a Tool for IT
Governance
Marcus Vinícius Borela de Castro, Carlos Alberto Mamede Hernandes
Tribunal de Contas da União (TCU)
Brasília, Brazil
{borela, carlosmh}@tcu.gov.br

INTRODUCTION
RGANIZATIONS need to leverage their technology to
create new opportunities and produce change in their
capabilities [1, p. 473]. According to ITGI [2, p. 7],
information technology (IT) has become an integral part of
business for many companies with key role in supporting and
promoting their growth. In this context, IT governance fulfills
an important role of directing and boosting IT in order to
achieve its goals aligned with the company’s strategy.
In order for IT governance to foster the success of IT and of
the organization, ISO 38500 [3, p. 7] proposes three main
activities: to assess the current and future use of IT; to direct
the preparation and implementation of plans and policies to
ensure that IT achieves organizational goals; to monitor
performance and compliance with those policies (Fig. 1).
A metric of software size can compose several indicators to
help reveal the real situation of the systems development area
for the senior management of an organization, directly or
through IT governance structures (e.g., IT steering
committee). Measures such as the production of software in a
period (e.g., measure of software size per month) and the
productivity of an area (e.g., measure of software size per
effort) are examples of indicators that can support the three
activities of governance proposed by ISO 38500.
For the formation of these indicators, one can use Function
Point Analysis (FPA) to get function points (FP) as a metric of
software size. Created by Albrecht [4], FPA has become an
This work has been supported by the Brazilian Court of Audit (TCU).
IT Corporate
Governance
Assess
Direct
Monitor
Performance
Accordance
O
I.
IT
Needs of
Business
Proposals
Index Terms—Function Points, IT governance,
performance, Software engineering, Software metrics.
Pressures of
Business
Plans and
Policies
Abstract— This paper proposes a new metric for software
functional size, which is derived from Function Point Analysis
(FPA), but overcomes some of its known deficiencies. The
statistical results show that the new metric, Functional Elements
(EF), and its submetric, Functional Elements of Transaction
(EFt), have higher correlation with the effort in software
development than FPA in the context of the analyzed data. The
paper illustrates the application of the new metric as a tool to
improve IT governance specifically in assessment, monitoring,
and giving directions to the software development area.
Business Process
negócio
IT
IT
Projects
Operati
Fig. 1. Cycle Assess-Direct-Monitor of IT Governance
Source:: ISO 38500 [3, p. 7]
international standard for measuring the functional size of a
software with the ISO 20926 [5] designation. Its rules are
maintained and enhanced by a nonprofit international group of
users called International Function Point Users Group
(IFPUG), responsible for publishing the Counting Practices
Manual (CPM), now in version 4.3.1 [6].
Because it has a direct correlation with the effort expended
in software development [7]-[8], FPA has been used as a tool
for information technology management, not only in Brazil
but worldwide. As identified in the Quality Research in
Brazilian Software Industry report, 2009 [9, p. 93], FPA is the
most widely used metric to evaluate the size of software
among software companies in Brazil, used by 34.5% of the
companies. According to a survey carried out by Dekkers and
Bundschuh [10, p. 393], 80% of projects registered on the
International Software Benchmarking Standards Group
(ISBSG), release 10, which applied metric used the FPA.
The FPA metric is considered a highly effective instrument
to measure contracts [11, p. 191]. However, it has the
limitation of not treating non-functional requirements, such as
quality criteria and response-time constraints. Brazilian federal
government institutions also use FPA for procurement of
development and maintenance of systems. The Brazilian
Federal Court of Audit (TCU) points out FPA as an example
of metric to be used in contracts. 2 The metrics roadmap of
SISP [12], a federal manual for software procurement,
recommends its application to federal agencies.
Despite the extensive use of the FPA metric, a large number
of criticism about its validity and applicability, described in
Section II-B, put in doubt the correctness of its use in
contracts and the reliability of its application as a tool for IT
management and IT governance.
So the question arises for the research: is it possible to
propose a metric for software development, with the
acceptance and practicality of FPA, that is, based on its
concepts already widely known, without some of the flaws
identified in order to maximize its use as a tool for IT
governance, focusing on systems development and
maintenance?
The specific objectives of this paper are: 1) to present an
overview of software metrics and FPA; 2) to
present
the
criticisms to the FPA technique that motivated the proposal of
a new metric; 3) to derive a new metric based on FPA; 4)to
evaluate the new metric against FPA in the correlation with
effort; 5) to illustrate the use of the proposed metric in IT
governance in the context of systems development and
maintenance.
Following, each objective is covered in a specific section.
II.
DEVELOPMENT
A. Software Metrics
1) Conceptualization, categorization, and application
Dekkers and Bundschuh [10, p. 180-181] describe various
interpretations for metric, measure, and indicator found in the
literature. Concerning this study, no distinction is made among
these three terms. We used Fenton and Pfleeger’s definition
[13, p. 5] for measure: a number or symbol that characterizes
an attribute of a real world entity, object or event, from
formally defined rules. Kitchenham et al. [14] present a
framework for software metrics with concepts related to the
formal model in which a metric is based, for example, the type
of scale used.
According to Fenton and Pfleeger [13, p. 74], software
metrics can be applied to three types of entities: processes,
products, and resources. The authors also differentiate direct
metrics when only one attribute of an entity is used, from
indirect metrics, the other way around [13, p. 39]. Indirect
metrics are derived by rules based on other metrics. The speed
of delivery of a team (entity type: resource) is an example of
indirect metric because it is calculated from the ratio of two
measures: size of developed software (product) development
and elapsed time (process). The elapsed time is an example of
direct metric. Moser [15, p. 32] differentiates size metrics
from quality metrics: size metrics distinguish between the
smallest and the largest whereas quality metrics distinguish
between good and bad. Table I consolidates the mentioned
categories of software metrics.
2
There are several rulings on the subject: 1.782/2007, 1.910/2007,
2.024/2007, 1.125/2009, 1.784/2009, 2.348/2009, 1.274/2010, 1.647/2010, all
of the Plenary of the TCU.
Moser [15, p.31] notes that, given the relationship between
a product and the process that produced it, a product measure
can be assigned to a process, and vice versa. For example, the
percentage of effort in testing, which is a development process
attribute, can be associated with the generated product as an
indicator of its quality. And the number of errors in production
in the first three months, a system attribute (product), can be
associated to the development process as an indicative of its
quality.
Fenton and Pfleeger [13, p. 12] set three goals for software
metric: to understand, to control, and to improve the targeted
entity. They call our attention to the fact that the definition of
the metrics to be used depends on the maturity level of the
process being measured: the more mature, more visible, and
therefore more measurable [13, p. 83]. Chikofsky and Rubin
[16, p. 76] highlight that an initial measurement program for a
development and maintenance area should cover five key
dimensions that address core attributes for planning,
controlling, and improvement of products and processes: size,
effort, time, quality, and rework. The authors remind us that
what matters are not the metric itself, but the decisions that
will be taken from them, refuting the possibility of measuring
without foreseeing the goal [16, p. 75].
According to Beyers [17, p. 337], the use of estimates in
metric (e.g., size, time, cost, effort, quality, and allocation of
people) can help in decision making related to software
development and to the planning of software projects.
2) FPA overview
According to the categorization of in previous section, FPA
is an indirect measure of product size. It measures the
functional size of an application (system) as a gauge of the
functionality requested and delivered to the user of the
software. 3 This is a metric understood by users, regardless of
the technology used.
According to Gencel and Demirors [18, p. 4], all functional
metrics ISO standards estimate software size based on the
functionality delivered to users, 4 differing in the considered
objects and how they are measured.
TABLE I
EXAMPLES OF CATEGORIES OF SOFTWARE METRICS
Criterion
Category
Source
Entity
Of process
[13, p. 74]
Of product
Of resource
Number of attributes
Direct
[13, p. 39\
involved
Indirect
Target of
Size
[15, p. 32]
differentiation
Quality
3
The overview presented results from the experience of the author Castro
with FPA. In 1993, he coordinated the implementation of FPA in the area of
systems development at the Brazilian Superior Labor Court (TST). At TCU,
he works with metric, albeit sporadically, without exclusive dedication.
4
Besides FPA, there are four other functional metrics that are ISO
standards, as they meet the requirements defined in the six standards of ISO
14143: MKII FPA, COSMIC-FFP, FISMA, and NESMA. Non-functional
attributes of a development process (e.g., development team experience,
chosen methodology) are not in the scope of functional metrics. Functional
requirements are only one dimension of several impacting the effort. All of
them have to be taken into account in estimates. Estimates and non-functional
requirements evaluations are not the goal of this paper.
Functionalities can be of two types: transactions, that
implement data exchanges with users and other systems, and
data files, which indicate the structure of stored data. There
are three types of transactions: external inquiry (EQ), external
outputs (EO), and external inputs (EI), as the primary intent of
the transaction is, respectively, a simple query, a more
elaborate query (e.g., with calculated totals) or data update.
There are two types of logical data files: internal logical files
(ILF) and external interface files (EIF), as their data are,
respectively, updated or just referenced (accessed) in the
context of the application.
Fig. 2 illustrates graphically these five function types. To
facilitate understanding, we can consider an example of EI as
an employee inclusion form which includes information in the
employees data file (ILF) and validates the tax code (CPF)
informed by the user accessing the external file taxpayers
(EIF), external to the application. Also in the application we
could have, hypothetically, an employee report, a simple query
containing the names of the employees of a given
organizational unit (EQ) and a more complex report with the
number of employees per unit (EO).
In the FPA calculating rule, each function is evaluated for
its complexity and takes one of three classifications: low,
medium or high complexity. Each level of complexity is
associated with a size in function points. Table II illustrates
the derivation rule for external inquiries, according to the
number of files accessed (File Type Referenced - FTR) and
the number of fields that cross the boundary of the application
(Data Element Type - DET).
As for EQ, each type of functionality (EO, EI, ILF, and
EIF) has its specific rules for derivation of complexity and
size, similar to Table II. Table III summarizes the categories
of attributes used for calculating function points according to
each type of functionality.
The software size is the sum of the sizes of its
functionalities. This paper is not an in-depth presentation of
concepts associated with FPA. Details can be obtained in the
Counting Practices Manual, version 4.3.1 [6].
Employee
Inclusion
EI
Taxpayer
EIIF
Employee
Report
EQ
Employee
ILF
Totals per
Unit
EO
Application Boundary
Fig. 2. Visualization of the five types of functions in FPA
User or
External
System
TABLE II
DERIVATION RULE FOR COMPLEXITY AND SIZE IN FUNCTION POINTS OF AN
EXTERNAL INQUIRY
DET (field)
1-5
6 - 19
20 or more
FTR (file)
1
low (3)
low (3)
medium (4)
2-3
low (3)
medium (4)
high (6)
4 or more
medium (4)
high (6)
high (6)
B. Criticisms to the FPA technique that motivated the
proposal of a new metric
Despite the extensive use of the metric FPA, mentioned in
Section I, there are a lot of criticism about its validity and
applicability that call into question the correctness of its use in
contracts and the reliability of its application as a tool for IT
management and governance ( [19], [13], [20], [21], [14],
[22]; [23], [24], [25]).
Several metrics have been proposed taking FPA as a basis
for their derivation, either to adapt it to particular models, or to
improve it, fixing some known bugs. To illustrate, there is
Antoniol et al. [26] work proposing a metric for objectoriented model and Kralj et al. [22] work proposing a change
in FPA to measure more accurately high complexity functions
(item 4 below).
The objective of the metric proposed in this paper is not to
solve all faults of FPA, but to help to reduce the following
problems related to its definition:
1) low representation: the metric restricts the size of a
function to only three possible values, according to its
complexity (low, medium, or high). But there is no limit on
the number of possible combinations of functional elements
considered in calculating the complexity of a function in FPA;
2) functions with different functional complexities have the
same size: as a consequence of the low representation.
Pfleeger et al. [23, p. 36] say that if H is a measure of size, and
if A is greater than B, then HA should be greater than HB.
Otherwise, the metric would be invalid, failing to capture in
the mathematical world the behavior we perceive in the
empirical world. Xia et al. [25, p. 3] show examples of
functions with different complexities that were improperly
assigned the same value in function points because they fall
into the same complexity classification, thus exposing the
problem of ambiguous classification;
3) abrupt transition between functional element ranges: Xia
et al. [25, p. 4] introduced this problem. They present two
logical files, B and C, with apparent similar complexities,
differing only in the number of fields: B has 20 fields and C
has 19 fields. The two files are classified as low (7 fp, function
points) and medium complexity (10 fp), respectively. The
difference lies in the transition of the two ranges in the
complexity derivation table: up to 19 fields, it is considered
low complexity; from 20 fields, it is considered medium
complexity. The addition of only one field leading to an
increase in 3 pf is inconsistent, since varying from 1 to 19
fields does not involve any change in the function point size.
A similar result occurs in other ranges of transitions;
4) limited sizing of high (and low) complexity functions:
FPA sets an upper (and a lower) limit for the size of a function
TABLE III
CATEGORIES OF FUNCTIONAL ATTRIBUTES FOR EACH TYPE OF FUNCTIONALITY
Function
Functional Attributes
Transactions: EQ, EO, EI
referenced files (FTR) and fields (DET)
Logical files: ILF, EIF
logical registers (Record Element Type - RET)
and fields (DET)
in 6, 7, 10 or 15, according to its type. Kralj et al. [22, p. 83]
describe high complexity functions with improper sizes in
FPA. They propose a change in the calculation of FPA to
support larger sizes for greater complexity;
5) undue operation on ordinal scale: as previously seen,
FPA involves classifying the complexity of functions in low,
medium or high complexity, as a ordinal scale. These labels in
the calculated process are substituted by numbers. An internal
logical file, for example, receives 7, 10 or 15 function points,
as its complexity is low, medium or high, respectively.
Kitchenham [20, p. 29] criticizes the inadequacy of adding up
values of ordinal scale in FPA . He argues that it makes no
sense to add the complex label with the simple label, even if
using 7 as a synonym for simple and 15 as a synonym for
complex;
6) inability to measure changes in parts of the function: this
characteristic, for example, does not allow to measure function
points of part of a functionality that needs to be changed in
one maintenance operation. Thus, a function addressed in
several iterations in an agile method or other iterative process
is always measured with full size, even if the change is
considered small in each of them. For example, consider three
maintenance requests at different moments for a report already
with the maximum size of 7 fp, which initially showed 50
distinct fields. Suppose each request adds a single field. The
three requests would be dimensioned with 7 fp each, the same
size of the request that created the report, and would total 21
fp. Aware of this limitation, PFA [6, vol. 4, p. 94] points to the
Netherlands Software Metrics Association (NESMA) metric
as an alternative for measuring maintenance requests. NESMA
presents an approach to solve this problem. According to the
Function Point Analysis for Software Enhancement [27],
NESMA measures a maintenance request as the multiplication
of the original size of a function by a factor of impact of the
change. The impact factor is the ratio of the number of
attributes (e.g., fields and files) included, changed or deleted
by the original number of attributes of the function. The
adjustment factor assumes multiple values of 25%, varying up
to a maximum of 150%.
Given the deficiencies reported, the correlation between the
size in function points of software and the effort required for
the development tends not to be appropriate, since FPA has
these deficiencies in the representation of the real functional
size of software. If there are inaccuracies in the measuring of
the size of what must be done, it is impossible to expect a
proper definition of the effort and therefore accuracy in
defining the cost of development and maintenance. The
mentioned problems motivated the development of this work,
in order to propose a quantitative metric, with infinite values,
called Functional Elements (EF).
C. Derivation process of the new metric
The proposed metric, Functional Elements, adopts the same
concepts of FPA but changes the mechanism to derive the size
of functions. The use of concepts widely known to metric
specialists will enable acceptance and adoption of the new
metric among these professionals.
The reasoning process for deriving the new metric, as
described in the following sections, implements linear
regression similar to that seen in Fig. 3. The objective is to
derive a formula for calculating the number of EF for each
type of function (Table VII in Section II-C-4) from the
number of functional attributes considered in the derivation of
its complexity, as indicated in Table III in Section II-A-2. In
this paper, these attributes correspond to the concept of
functional elements, which is the name of the metric proposed.
The marked points in Fig. 3 indicate the size in fp (Z axis)
of an external inquiry derived from the number of files (X
axis) and the number of fields (Y axis), which are the
attributes used in the derivation of its complexity (see Table II
in Section II-A-2). The grid is the result of a linear regression
of these points, and represents the new value of the metric.
1) Step 1 - definition of the constants
If the values associated with the two categories of
functional attributes are zero, the EF metric assumes the value
of a constant. Attributes can be assigned value zero, for
example, in the case of maintenance limited to the algorithm
of a function not involving changes in the number of fields
and files involved.
The values assigned to these constants come from the
NESMA functional metric mentioned in Section 2-B. This
metric was chosen because it is an ISO standard and supports
the maintenance case with zero-value attributes. For each type
of functionality, the proposed metric uses the smallest possible
value by applying NESMA, that is, 25% of the number of fp
of a low complexity function of each type: EIF - 1.25 (25% of
5); ILF - 1.75 (25% of 7); EQ - 0.75 (25% of 3); EI - 0.75
(25% of 3), and EO - 1 (25% of 4).
Fig. 3. Derivation of number of fp of an external inquiry from the attributes
used in the calculation
2) Step 2 - treatment of ranges with unlimited number of
elements
In FPA, each type of function has its own table to derive the
complexity of a function. Table II in Section II-A-2 presents
the values of the ranges of functional attributes for the
derivation of the complexity of external inquiries. The third
and last range of values of each functional element of all types
of functions is unlimited. We see 20 or more TD in the first
cell of the fourth column of the same table, and 4 or more
ALR in the last cell of the first column.
The number of elements in the greater range, that is, the
highest value among the first two ranges, was chosen for
setting a upper limit for the third range. In the case of ranges
for external inquiries, the number of fields was limited to 33,
having 14 elements (20 to 33) as the second range (6 to 19),
the largest one. The number of referenced files was limited to
5, using the same reasoning.
The limitation of the ranges is a mathematical artifice to
prevent imposing an upper limit for the new metric (4th
criticism in Section II-B).
3) Step 3 - generation of points for regression
The objective of this phase was to generate, for each type of
function, a set of data records with three values: the values of
the functional attributes and the derived fp, already decreased
from the value of the constant in step 1. Table IV illustrates
some points generated for the external inquiry.
An application developed in MS Access generated a dataset
with all possible points for the five types of functions, based
on the tables of complexity with bounded ranges developed in
the previous section. Table V shows all considered
combinations of ranges for EQ.
4) Step 4 - linear regression
The several points obtained by the procedure described in
the previous section were imported into MS Excel for linear
regression using the ordinary least squares method (OLS).
The regression between the size fp, which is the dependent
variable, and the functional attributes, which are the dependent
variables, held constant with value zero, since these constants
were already defined in step 1 and decreased from the
expected value in step 3. The statistical results of the
regression are shown in Table VI for each type of function.
Table VII shows the derived formula for each type of
function with coefficient values rounded to two decimal place
values. Each formula calculates the number of functional
elements, which is the proposed metric, based on the
functional attributes impacting the calculation and the
constants indicated in step 1. The acronym EFt and EFd
represent the functional elements associated with transactions
(EQ, EI, and EO) and data (ILF and EIF), respectively.
The functional elements metric, EF, is the sum of the
functional elements transaction, EFT, with the functional
TABLE IV
PARTIAL EXTRACT OF THE DATASET FOR EXTERNAL INQUIRY
FTR DET PF (decreased of constant of step 1)
1
1
2.25
1
2
2.25 (...)
1
33
3.25
2
1
2.25 (...)
TABLE V
COMBINATIONS OF RANGES FOR CALCULATING FP OF EQ
Function
Initial
Final Initial Final Original PF decreased
type
FTR
FTR
DET
DET
FP
of constant
EQ
1
1
1
5
3
2.25
EQ
1
1
6
19
3
2.25
EQ
1
1
20
33
4
3.25
EQ
2
3
1
5
3
2.25
EQ
2
3
6
19
4
3.25
EQ
2
3
20
33
6
5.25
EQ
4
5
1
5
4
3.25
EQ
4
5
6
19
6
5.25
EQ
4
5
20
33
6
5.25
TABLE VI
STATISTICAL REGRESSION - COMPARING RESULTS PER TYPES OF FUNCTIONS
ILF
2
R
Records
Coefficient pvalue (FTR or
RET)
Coefficient pvalue (DET)
EIF
EO
EI
EQ
0.96363
729
0.96261
729
0.95171
198
0.95664
130
0.96849
165
3.00E-21
1.17E-21
7.65E-57
1.70E-43
4.30E-60
2.28E-23
2.71E-22
1.44E-59
2.76E-39
2.95E-45
TABLE VII
CALCULATION FORMULAS OF FUNCTIONAL ELEMENTS BY TYPE OF FUNCTION5
Function type
Formula
ILF
𝐸𝐹𝑑 = 1.75 + 0.96 ∗ 𝑅𝐸𝑇 + 0.12 ∗ 𝐷𝐸𝑇
EIF
𝐸𝐹𝑑 = 1.25 + 0.65 ∗ 𝑅𝐸𝑇 + 0.08 ∗ 𝐷𝐸𝑇
EO
𝐸𝐹𝑡 = 1.00 + 0.81 ∗ 𝐹𝑇𝑅 + 0.13 ∗ 𝐷𝐸𝑇
EI
𝐸𝐹𝑡 = 0.75 + 0.91 ∗ 𝐹𝑇𝑅 + 0.13 ∗ 𝐷𝐸𝑇
EQ
𝐸𝐹𝑡 = 0.75 + 0.76 ∗ 𝐹𝑇𝑅 + 0.10 ∗ 𝐷𝐸𝑇
elements of data, EFd, as explained in the formulas of Table
VII. So the proposed metric is: EF = EFt + EFd.
The EFt submetric considers logical files (ILF and EIF) as
they are referenced in the context of transactions. Files are not
counted in separate as in the EFd submetric. Similar to two
other ISO standard metrics of functional size [10, p. 388],
MKII FPA [28] and COSMIC-FFP [29], EFt does not take
into account logical files. EFt is indicated for the cases where
the effort of dealing with data structures (EFd) is not subject to
evaluation or procurement.
In the next section, the EF and EFt metrics were tested,
counting and not counting logical files, respectively. Results
show stronger correlation with effort for EFt. Although not
evaluated, the EFd submetric has its role as it reflects the
structural complexity of the data of an application.
D. Evaluation of the new metric
The new EF metric and its submetric EFt were evaluated for
their correlation with effort in comparison to the FPA metric.6
The goal was not to evaluate the quality of these correlations,
but to compare their ability to explain the effort.
We obtained a spreadsheet from a federal government
agency with records of Service Orders (OS) contracted with
private companies for coding and testing activities. An OS
5
The size of a request for deleting a function is equal to the constant value,
since no specific attributes are impacted by this operation.
6
Kemerer [8, p. 421] justified linear regression as a means of measuring
this correlation.
contained one or more requests for maintenance or
development of functions of one system, such as: create a
report, change a transaction. The spreadsheet showed for each
OS the real allocated effort and, for each request, the size of
the function handled. The only fictitious data were the system
IDs, functionality IDs and OS IDs, as they were not relevant to
the scope of this paper. Each system was implemented in a
single platform: Java, DotNet or Natural. The spreadsheet
showed the time spent in hours and the number of people
allocated for each OS. The OS effort, in man-hours, was
derived from the product of time by team size. Table VIII
presents the structure of the received data.
Data from 183 Service Orders were obtained. However, 12
were discarded for having dubious information, for example,
undefined values for function type, number of fields, and
operation type. The remaining 171 service orders were related
to 14 systems and involved 505 requests that dealt with 358
different functions. To achieve higher quality in the
correlation with effort, we decided to consider only the four
systems associated with at least fifteen OS, namely, systems
H, B, C, and D. Table IX indicates the number of OS and
requests for each system selected.
The data were imported into MS Excel to perform the linear
regression using the ordinary least squares method after
calculating the size in EF and EFt metrics for each request in
an MS-Access application developed by the authors.7 The
regression considered the effort as the independent variable
and the size calculated in the PF, EF, and EFT metrics as the
dependent ones. As there is no effort if there is no size, the
regression considered the constant with value zero, that is, the
straight line crosses the origin of the axes. Independent
regressions were performed for each system, since the
variability of the factors that influence the effort is low within
a single system, because the programming language is the
same and the technical staff is generally also the same.8 Fig. 4
illustrates the dispersion of points (OS) on the correlation
between size and effort in EFt (man-hour) and the line derived
by linear regression in the context of system H.
The coefficient of determination R2 was used to represent
the degree of correlation between effort and size calculated for
each of the evaluated metrics. According to Sartoris [30, p.
244], R2 indicates, in a linear regression, the percentage of the
variation of a dependent variable Y that is explained by the
variation of a second independent variable X. Table IX shows
the results of the linear regressions performed.
From the results presented on Table IX, comparing the
correlation of the metrics with effort, we observed that:
1) correlations of the new metrics (EF, EFt) were
considered significant at a confidence level of 95% for all
7
A logistic nonlinear regression with constant was also performed using
Gretl, a free open source tool (http://gretl.sourceforge.net). However, the R2
factor proved that this alternative was worse than the linear regression for all
metrics.
8
The factors that influence the effort and the degree of this correlation are
discussed in several articles. We suggest the articles available in the BestWeb
database (http://www.simula.no/BESTweb), created as a result of the research
of Jorgensen and Shepperd [31].
TABLE VIII
STRUCTURE OF THE RECEIVED DATA TO EVALUATE THE METRIC
Abbreviation
Description
Domain
OS
Function
Type
Operation
Final FTR
RET
Operation
FTR RET
Original FTR
RET
Final DET
Operation
DET
Original TD
FP
%Impact
PM
System
Hours
Team
Identification Number of a service
order
Identification Number of a function
Type (categorization) of a functionality
according to FPA
Operation performed, which may be
inclusion (I) of a new feature or change
(A) of a function (maintenance)
Value at the conclusion of the request
implementation: if the function is a
transaction, indicates the number of
referenced logical files (FTR); if it is a
logical file, indicates the number of
logical records (RET)
Number of FTR or RET that were
included, changed or deleted in the
scope of a maintenance of a
functionality (only in change operation)
Number of FTR or RET originally
found in the functionality (only in
change operation)
Number of DET at the conclusion of
the request implementation
Number of DET included, changed or
deleted in the scope of a functionality
maintenance (only in change operation)
Number of DET originally found in a
functionality (only in change operation)
Number of function points of the
functionality at the conclusion of the
request
Percentage of the original function
impacted by the maintenance, as
measured by NESMA [27]
Number of maintenance points of the
functionality handled, as measured by
NESMA [27]
Identification of a system
Hours dedicated by the team to
implement the OS
Number of team members responsible
for the implementation of the OS
up to 10
numbers
up to 10
numbers
ALI, AIE,
EE, SE or CE
I or A
up to 3
numbers
up to 3
numbers
up to 3
numbers
up to 3
numbers
up to 3
numbers
up to 3
numbers
up to 2
numbers
25, 50, 75,
100, 125, 150
up to 4
numbers
one char
up to 5
numbers
up to 2
numbers
systems (p-value less than 0.05).9 However, the correlation of
FPA was not significant for system B (p-value 0.088 > 0.05);
2) correlations of the new metrics were higher in both
systems with the highest number of OS (H and B). A better
result in larger samples is an advantage, because the larger the
sample size, the greater the reliability of the results, since the
p-value has reached the lowest values for these systems;
3) although no metric got a high coefficient of
determination (R2 > 0.8), the new metrics achieved medium
correlation (0.8 > R2 > 0.5) in the four systems evaluated,
whereas FPA obtained weak correlation (0.2 > R2) in system
B. We considered the confidence level of 91.2% in this
correlation (p-value 0.88);
4) the correlation of the new metrics was superior in three
out of the four systems (H, B, and D). (A correlation C1 is
classified as higher than a correlation C2 if C1 is significant
and C2 is not significant or if both correlations are significant
and C1 has a higher R2 than C2.)
9
To be considered a statistically significant correlation at a confidence
level of X%, the p-value must be less than 1 - X [30, p.11]. For a 95%
confidence level, the p-value must be less than 0.05.
OS Effort (mh)
1500
1000
500
0
0
10
20
30
40
OS size (EFt)
Fig.4. Dispersion of points (OS) of H system: effort (man-hour) x size
(Functional Element of Transaction)
TABLE IX
RESULTS OF LINEAR REGRESSIONS - EFFORT VERSUS METRICS OF SIZE
System
H
B
C
D
Quantity of OS
45
25
21
15
Quantity of Requests
245
44
60
20
R2
59.3%
11.2%
67.7%
51.8%
FP
p-value (test-f)
4.6E-10 8.8E-02 3.3E-06 1.9E-03
R2
65,1%
60.3%
53.0%
54.7%
EF p-value (test-f)
1.5E-11 2.3E-06 1.4E-04 1.2E-03
Proportion to FP’s R2
+10%
+438%
-22%
+5%
R2
66.1%
60.3%
53.0%
54.7%
EFt p-value (test-f)
8.5E-12 2.3E-06 1.4E-04 1.2E-03
Proportion to FP’s R2
+11%
+438%
-22%
+5%
Given the observations listed above, , we conclude for the
analyzed data that the metrics proposed, EF and EFt, have
better correlation with effort in comparison to FPA. A higher
correlation of the EFt metric in comparison to the EF was
perceived for system H. Only system H allowed a
differentiation of the result for the two metrics by presenting
requests for changing logical files in its service orders.
Therefore, we see that the EFt submetric tends to yield better
correlations if compared to the EF. This result reinforces the
hypothesis that the EFd submetric, which composes the EF
metric, does not impact the effort, at least not for coding and
testing, which are tasks addressed in the evaluated service
orders.
Table X contains the explanation of how the proposed
metrics, EF and EFt, address the criticisms presented in
Section II-B.
E. Illustration of the use of the new metrics in IT governance
Kaplan and Norton [31, p. 71] claim that what you measure
is what you get. According to COBIT 5 [34, p. 13],
governance aims to create value by obtaining the benefits
through optimized risks and costs. In relation to IT
governance, the metrics proposed in this paper not only help to
assess the capacity of IT but also enable the optimization of its
processes to achieve the results.
Metrics support the communication between the different
actors of IT governance (see Fig. 5) by enabling the translation
of objectives and results in numbers. The quality of a process
can be increased by stipulating objectives and by measuring
results through metrics [15, p. 19]. So, the production capacity
of the process of information systems development can be
enhanced to achieve the strategic objectives with the
appropriate use of metrics and estimates.
TABLE X
JUSTIFICATIONS OF HOW THE NEW METRICS ADDRESS THE CRITIQUES
PRESENTED IN SECTION II-B
Critique
Solution
Low representation
Each possible combination of the functional
attributes considered in deriving the complexity in
FPA is associated with a distinct value.
Functions
with Functionalities with different complexities, as
different complexities determined by the number of functional attributes,
have the same size
assume a different size.
Abrupt
transition By applying the formulas of calculation described in
between
functional Section II-C-4, the variation in size is uniform for
element ranges
each variation of the number of functional attributes,
according to its coefficients.
Limited sizing of There is no limit on the size assigned to a function
high
(and
low) by applying the calculation formulas described in
complexity functions
Section II-C-4.
Undue operation on The metrics do not have a ordinal scale with finite
ordinal scale
values, but rather a quantitative scale with infinite
discrete values, which provide greater reliability in
operations with values.
Inability to measure Enables the measurement of changes in part of a
changes in parts of functionality considering in the calculation only the
the function
functional attributes impacted by the amendment.
Software metrics contribute to the three IT governance
activities proposed by ISO 38500, mentioned in Section I: to
assess, to direct and to monitor. These activities correspond,
respectively, to the goals of software metrics mentioned in
Section II-A-1: to understand, to improve, and to control the
targeted entity of a measurement.
Regarding the directions of IT area, Weill and Ross [36, p.
188] state that the creation of metrics for the formalization of
strategic choices is one of four management principles that
summarize how IT governance helps companies achieve their
strategic objectives. Metrics must capture the progress toward
strategic goals and thus indicate if IT governance is working
or not [36, p. 188].
Kaplan and Norton [37, pp. 75-76] claim that strategies
need to be translated into a set of goals and metrics in order to
have everyone’s commitment. They claim that the Balanced
Scorecard (BSC) is a tool which provides knowledge of longterm strategies at all levels of the organization and also
promotes the alignment of department and individual goals
with those strategies. According to ITGI [2, p. 29], BSC,
besides being a holistic view of business operations, also
contributes to connect long-term strategic objectives with
short-term actions.
To adapt the concepts of the BSC for the IT function, the
perspectives of a BSC were re-established [38, p. 3]. Table XI
presents the perspectives of a BSC-IT and their base
questions.
Owners
and
Stake
holders
Dele
gate
Account
table
Gover
ning
Body
Set di
rection
Monitor
Ma
na
ge
ment
Fig.5. Roles, activities and relationships of IT governance.
Source: ISACA [35, p. 24]
In
struct
Report
Ope
ra
tions
Perspective
TABLE XI
PERSPECTIVES OF A BSC-IT
Base question
Contribution to
How do business executives see
the business
the IT area?
Customer
How do customers see the IT
orientation
area?
Operational
How effective and efficient are
excellence
the IT processes?
Future
How IT is prepared for future
orientation
needs?
Source: inspired in ITGI [2, p. 31]
BSC corporative
perspective
Financial
Customer
Internal Processes
Learning
According to ITGI [2, p. 30], BSC-IT effectively helps the
governing body to achieve alignment between IT and the
business. This is one of the best practices for measuring
performance [2, p. 46]. BSC-IT is a tool that organizes
information for the governance committee, creates consensus
among the stakeholders about the strategic objectives of IT,
demonstrates the effectiveness and the value added by IT and
communicates information about capacity, performance and
risks [2, p. 30].
Van Grembergen [39, p.2] states that the relationship
between IT and the business can be more explicitly expressed
through a cascade of scorecards. Van Grembergen [39, p.2]
divides BSC-IT into two: BSC-IT-Development and BSC-ITOperations. Rohm and Malinoski [40], members of the
Balanced Scorecard Institute, present a process with nine steps
to build and implement strategies based on scorecard.
Bostelman and Becker [41] present a method to derive
objectives and metrics from the combination of BSC and the
Goal Question Metric (GQM) technique proposed by Basili
and Weiss [42]. The association with GQM method is
consistent to what ISACA [43, p. 74] says: good strategies
start with the right questions. The metric proposed in this
paper can compose several indicators that can be used in BSCIT - Development.
Regarding the activities of IT monitoring and assessment [3,
p. 7], metrics enable the monitoring of the improvement rate
of organizations toward a mature and improved process [1, p.
473]. Performance measurement, which is object of
monitoring and assessment, is one of the five focus areas of IT
governance, and it is classified as a driver to achieve the
results [2, p. 19].
To complement the illustration of the applicability of the
new metric for IT governance, Table XII shows some
indicators based on EF. 10 The same indicator can be used on
different perspectives of a BSC-IT-Development, depending
on the targeted entity and the objective of the measurement,
such as the following examples. The productivity of a resource
(e.g., staff, technology) may be associated with the Future
Orientation perspective, as it seeks to answer whether IT is
prepared for future needs. The same indicator, if associated
with an internal process, encoding, for example, reflects a
vision of its production capacity, in the Operational
Excellence perspective. In the Customer Orientation
10
The illustration is not restricted to EF, as the indicators could use others
software size metrics.
perspective, production can be divided by client, showing the
proportion of IT production to each business area. The
evaluation of the variation in IT production in contrast to the
production of business would be an example of using the
indicator in the Contribution to the Business perspective.
The choice of indicators aimed to encompass the five
fundamental dimensions mentioned in Section II-A-1: size,
effort, time, quality, and rework. A sixth dimension was
added: the expected benefit. According to Rubin [44, p. 1],
every investment in IT, from a simple training to the creation
of a corporate system, should be aligned to a priority of the
business whose success must be measured in terms of a
specific value. Investigating the concepts and processes
associated with the determination of the value of a function (or
a system or the IT area) is not part of the scope of this work.
This is a complex and still immature subject. The dimension
of each indicator is shown in the third column of Table XII.
Some measurements were normalized by being divided by
the number of functional elements of the product or process,
tactics used to allow comparison across projects and systems
of different sizes. The ability to standardize comparisons, as in
a BSC, is one of the key features of software metrics [45, p.
493]. It is similar to normalize construction metrics based on
square meter, a common practice [46, p. 161].
As Dennis argues [47, p. 302], one should not make
decisions based on a single indicator, but from a vision formed
by several complementary indicators. As IT has assumed
greater prominence as a facilitator to the achievement of
business strategy, the use of dashboards to monitor its
performance, under appropriate criteria, has become popular
among company managers [43, p. 74]. Abreu and Fernandes
[48, p. 167] propose some topics that may compose such
strategic and tactical control panels of IT.
Fig. 6 illustrates the behavior of the indicators shown in
Table XII with annual verification for hypothetical systems A,
TABLE XII
DESCRIPTION OF ILLUSTRATIVE INDICATORS
Metric
Unit
Dimension Description of the calculation for a
system
Functional
EF
Size
sum of the functional size of the
size
functionalities that compose the
system at the end of the period
Production
EF
Effort
sum of the functional size of requests
in the period
for inclusion, deletion, and change
implemented in the period
Production
EF
Rework
sum of the functional size of requests
on rework
for deletion and change implemented
in the period
Productivity Functional
Effort
sum of the functional size of requests
Elements /
implemented in the period / sum of
Man–hour
the efforts of all persons allocated to
the system activities in the period
Error density Failures /
Quality
number of failures resulting from the
Functional
use of the system in a period / size of
Element
the system at the end of the period
Delivery
Functional
Time
sum of the size of the features
speed
Elements /
implemented in the period / elapsed
Hour
time
Density of
$ / EF
Expected benefit expected by the system in the
the expected
benefit
period / system size
benefit
B, C, and D.11 The vertical solid line indicates how the
indicator to the system was in the previous period, allowing a
view of the proportion of the increasing or decreasing of the
values over the period. In the productivity column (column 4),
a short line at its base indicates, for example, a pattern value
obtained by benchmark. The vertical dashed line metric
associated with the production in the period (2) indicates the
target set in the period for each system: system A reached it,
system D exceeded it, and systems B and C failed.
In one illustrative and superficial analysis of the indicators
for system C, one can associate the cause of not achieving the
production goal during that period (2) with the decrease of the
delivery speed (6) and the increase of the production on
rework (3), resulted, most likely, from the growth in the error
density (5). The reduction on the delivery speed (6) which can
be associated with decreased productivity (4) led to a low
growth of the functional size of the system (1) during that
period. These negative results led to a decrease in the density
of the expected benefit (7).
Fig. 6 represents an option of visualization of the
governance indicators shown in Table XII: a multi-metrics
chart of multi-instances of a targeted entity or a targeted
attribute. The vertical column width is variable depending on
the values of the indicators (horizontal axis) associated with
the different instances of entities or attributes of interest
(vertical axis). The same vertical space is allocated for each
entity instance. The width of the colored area, which is traced
from the left to the right, indicates graphically the value of the
indicator for the instance.
In the hands of the governance committee, correct
indicators can help senior management, directly or through
any governance structure, to identify how IT management is
behaving and to identify problems and the appropriate course
of action when necessary.
D
C
B
A
1
Functio
nal Size
2
Produc
tion in
the
period
3
Produc
tion on
rework
4
Producti
vity
5
Error
density
6
Deli
very
speed
7
Density of
the
expected
benefit
Fig. 6. Annual indicators of systems A, B, C and D
11
The fictitious values associated with the indicators were adjusted so that
all vertical columns had the same maximum width. The adjustment was done
by correlating the maximum value for the indicator with the width defined for
the column. The other values were derived by a simple rule of three.
III. FINAL CONSIDERATIONS
The five specific objectives proposed for this work in
Section I were achieved, albeit with limitations and with
possibilities for improvement that are translated into proposals
for future work.
The main result was the proposition of a new metric EF and
its submetric EFt. The new metrics, free of some deficiencies
of the FPA technique taken as a basis for their derivation,
reached a higher correlation with effort than the FPA metric,
in the context of the analyzed data.
The paper also illustrated the connection found between
metrics and IT governance activities, either in assessment and
monitoring, through use in dashboards, or in giving direction,
through use in BSC-IT.
There are possibilities for future work in relation to each of
the five specific objectives.
Regarding the conceptualization and the categorization of
software metrics, a comprehensive literature research is
necessary to the construction of a wider and updated
categorization of software metrics.
Regarding the presentation of the criticisms to FPA, only
the criticisms addressed by the new proposed metrics were
presented. Research in the theme, as a bibliographic research
to catalog the criticisms, would serve to encourage other
propositions of software metrics.
Regarding the process of creating the new metric, it could
be improved or it could be applied to other metrics of any area
of knowledge based on ordinal values derived from tables of
complexity as FPA (e.g., metric proposed by KARNER [49]:
Use Case Points). Future works may also propose and evaluate
changes in the rules and in the scope of the new metrics.
Regarding the evaluation of the new metric, the limitation
in using data from only one organization could be overcome in
new works. Practical applications of the metric could also be
illustrated. New works could compare the results of EF with
the EFt submetric as well as compare both with other software
metrics. Different statistical models could be used to evaluate
its correlation with effort even in specific contexts (e.g.,
development, maintenance, development platforms). We
expect to achieve a higher correlation of the new metric with
effort in agile methods regarding to the APF, considering its
capacity of partial functionality sizing. (6th criticism in Section
II-B.)
Regarding to the connection with IT governance, a work
about the use of metrics in all IT governance activities is
promising. The proposed graph for visualization of multiple
indicators of multiple instances through columns with varying
widths along their length can also be standardized and
improved in future work.12
A suggestion for future work is noteworthy: the definition
12
In http://learnr.wordpress.com (accessed on 04 November 2012) is
located a graph that functionally reminds the proposed one: heatmap plotting.
However it is different in the format and in the possibilities of evolution. As
we did not find any similar graph, we presume to be a new format for viewing
the behavior of Multiple Indicators about Multiple Instances through Columns
with Varying Widths along their Extension (MIMICoVaWE). An example of
evolution would be a variation in the color tone of a cell according to a
specific criterion (eg in relation to achieving of a specified goal).
of an indicator that shows the level of maturity of a company
regarding to the use of metrics in IT governance. Among other
aspects, it could consider: the breadth of the entities evaluated
(e.g., systems, projects, processes, teams), the dimensions
treated (e.g., size, rework, quality, effectiveness) and the
effective use of the indicators (e.g., monitoring, assessment).
Finally, we expect that the new metric EF and its submetric
EFt help increase the contribution of IT to the business, in an
objective, reliable, and visible way.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
H. A. Rubin, ―
Software process maturity: measuring its impact on
productivity and quality,‖ in Proc. of the 15th int. conf. on Softw. Eng.,
IEEE Computer Society Press, pp. 468-476, 1993.
ITGI - IT Governance Institute. Board briefing on IT Governance, 2nd
ed, Isaca, 2007.
ISO/IEC, 38500: Corporate governance of information technology,
2008.
A. J. Albrecht, "Measuring application development productivity" in
Guide/Share Application Develop. Symp. Proc., pp.83-92 1979.
ISO/IEC, 20926: Software measurement - IFPUG functional size
measurement method, 2009.
IFPUG - International Function Point Users Group, Counting Practices
Manual, Version 4.3.1, IFPUG, 2010.
A. Albrecht and J. Gaffney Jr., ―
Software function, source lines of code,
and development effort prediction: A software science validation,‖ IEEE
Trans. Software Eng.,vol. 9, pp. 639-648, 1983.
C. F. Kemerer, ―
An empirical validation of software cost estimation
models,‖ Communications of the ACM, vol. 30, no. 5, pp. 416-429,
1987.
Brazil. MCT - Ministério da Ciência e Tecnologia. ―
Quality Research in
Brazilian Software Industry; Pesquisa de Qualidade no Setor de
Software Brasileiro – 2009,‖ Brasília. [Online]. 204p. Available:
http://www.mct.gov.br/upd_blob/0214/214567.pdf
M. Bundschuh and C. Dekkers, The IT measurement compendium:
estimating and benchmarking success with functional size measurement,
Springer, 2008.
C. E. Vazquez, G. S. Simões and R. M. Albert, Function Point Analysis:
Measurement, Estimates and Project Management Software; Análise de
Pontos de Função: Medição, Estimativas e Gerenciamento de Projetos
de Software. Editora Érica, São Paulo, 2005.
Brazil. SISP - Sistema de Administração dos Recursos de Tecnologia da
Informação. (2012). ―
Metrics Roadmap of SISP - Version 2.0; Roteiro
de Métricas de Software do SISP – Versão 2.0,‖ Brasília: Ministério do
Planejamento, Orçamento e Gestão.Secretaria de Logística e Tecnologia
da Informação. [Online]. Available: http://www.sisp.gov.br/ctgcie/download/file/Roteiro_de_Metricas_de_Software_do_SISP__v2.0.pdf
N. E. Fenton and S. L. Pfleeger, Software metrics: a rigorous and
practical approach, PWS Publishing Co, 1998.
B. Kitchenham, S. L. Pfleeger and N. Fenton, ―
Towards a framework for
software measurement validation,‖ IEEE Trans. Softw. Eng., vol. 21, no.
12, pp. 929-944, 1995.
S. MOSER, ―
Measurement and estimation of software and software
processes,‖ Ph.D. dissertation, University of Berne, Switzerland, 1996.
E. Chikofsky and H. A. Rubin, ―
Using metrics to justify investment in
IT,‖ IT professional, vol. 1, no. 2, pp. 75-77. 1999.
C. P. Beyers, "Estimating software development projects." in IT
measuremen,. Addison-Wesley Longman Publishing Co., Inc., pp. 337362, 2002.
C. Gencel and O. Demirors, ―
Functional size measurement revisited,‖
ACM Transactions on Software Engineering and Methodology (TOSEM)
vol. 17, no. 3, p. 15, 2008.
A. Abran and P. N. Robillard, "Function Points: A Study of Their
Measurement Processes and Scale Transformations", Journal of Systems
and Software, vol. 25, pp.171 -184 1994.
B. Kitchenham, ―
The problem with function points,‖ IEEE Software,
vol. 14, no. 2, pp. 29-31, 1997.
B. Kitchenham, K. Känsälä, ―
Inter-item correlations among function
points,‖ in Proc.15th Int. Conf. on Softw. Eng., , IEEE Computer
Society Press, pp. 477-480, 1993.
[22] T. Kralj, I. Rozman, M. Heričko and A.Živkovič, ―
Improved standard
FPA method—resolving problems with upper boundaries in the rating
complexity process,‖ Journal of Systems and Software, vol. 77, no. 2,
pp. 81-90, 2005.
[23] S. L. Pfleeger, R. Jeffery, B. Curtis and B. Kitchenham, ―
Status report
on software measurement,‖ IEEE Software, vol. 14, no. 2, pp. 33-43,
1997.
[24] O. Turetken, O. Demirors, C. Gencel, O. O. Top, and B. Ozkan, ―
The
Effect of Entity Generalization on Software Functional Sizing: A Case
Study,‖ in Product-Focused Software Process Improvement, Springer
Berlin Heidelberg, pp. 105-116, 2008.
[25] W. Xia, D. Ho, L. F. Capretz, and F. Ahmed, ―
Updating weight values
for function point counting,‖ International Journal of Hybrid Intelligent
Systems, vol. 6, no. 1, pp. 1-14, 2009.
[26] G. Antoniol, R. Fiutem and C. Lokan, "Object-Oriented Function Points:
An Empirical Validation," Empirical Software Engineering, vol. 8, no.
3, pp. 225-254, 2003
[27] NESMA - Netherlands Software Metrics Association, ―
Function Point
Analysis For Software Enhancement,‖ [Online], Available:
http://www.nesma.nl/download/boeken_NESMA/N13_FPA_for_Softwa
re_Enhancement_(v2.2.1).pdf
[28] ISO/IEC, 20968: MkII Function Point Analysis - Counting Practices
Manual, 2002.
[29] ISO/IEC, 19761: COSMIC: a functional size measurement method,
2011.
[30] A. Sartoris, Estatística e introdução à econometria; Introduction to
Statistics and Econometrics, Saraiva S/A Livreiros Editores, 2008.
[31] M. Jorgensen and M. Shepperd, ―
A systematic review of software
development cost estimation studies,‖ IEEE Trans. Softw. Eng., vol. 33,
no. 1, pp. 33-53, 2007.
[32] M. L. Orlov, ―
Multiple Linear Regression Analysis Using Microsoft
Excel,‖ Chemistry Department, Oregon State University, 1996.
[33] R. S. Kaplan and D. P. Norton, ―
The balanced scorecard - measures that
drive performance,‖ Harvard business review, vol. 70, no. 1, pp. 71-79,
1992.
[34] Isaca, COBIT 5: Enabling Processes, Isaca, 2012.
[35] Isaca, COBIT 5: A Business Framework for the Governance and
Management of IT, Isaca, 2012.
[36] P. Weill and J. W. Ross, IT governance: How top performers manage IT
decision rights for superior results, Harvard Business Press, 2004.
[37] R. S. Kaplan and D. P. Norton, ―
Using the balanced scorecard as a
strategic management system,‖ Harvard business review, vol.74, no. 1,
pp. 75-85, 1996.
[38] W. Van Grembergen and R. Van Bruggen, ―
Measuring and improving
corporate information technology through the balanced scorecard,‖ The
Electronic Journal of Information Systems Evaluation, vol. 1, no. 1.
1997.
[39] W. Van Grembergen, ―
The balanced scorecard and IT governance,‖
Information Systems Control Journal, Vol 2, pp.40-43, 2000.
[40] H. Rohm and M. Malinoski, ―
Strategy-Based Balanced Scorecards for
Technology,‖ Balanced Scorecard Institute, 2010.
[41] S. A. Becker and M. L. Bostelman, "Aligning strategic and project
measurement systems," IEEE Software, vol. 16, no. 3, pp. 46-51,
May/Jun 1999.
[42] V. R. Basili, and D. M. Weiss, "A Methodology for Collecting Valid
Software Engineering Data," IEEE Trans. Softw. Eng., vol. SE-10, no. 6,
pp. 728-738, Nov. 1984.
[43] Isaca,. CGEIT Review Manual 2010, ISACA.
[44] H. A. Rubin, ―
How to Measure IT Value,‖ CIO insight, 2003.
[45] B. Hufschmidt, ―
Software balanced scorecards: the icing on the cake,‖ in
IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp.
491-502. 2002.
[46] C. A. Dekkers, ―
How and when can functional size fit with a
measurement program?," in IT measurement, Addison-Wesley Longman
Publishing Co., Inc., pp. 161-170, 2002.
[47] S. P. Dennis, ―
Avoiding obstacles and common pitfalls in the building of
an effective metrics program,‖ in IT measurement, Addison-Wesley
Longman Publishing Co., Inc., pp. 295-304, 2002.
[48] A. A. Fernandes and V. F. Abreu, Deploying IT governance: from
strategy to process and services management; Implantando a
governança de TI: da estratégia à gestão de processos e serviços,
Brasport, 2009.
[49] G. Karner. ―
Metrics for Objectory,‖ Diploma thesis, University of
Link ping, Sweden, No. LiTH-IDA-Ex9344:21, December 1993.
An Approach to Business Processes Decomposition
for Cloud Deployment
Uma Abordagem para Decomposição de Processos de Negócio para Execução em Nuvens
Computacionais
Lucas Venezian Povoa, Wanderley Lopes de Souza,
Antonio Francisco do Prado
Departamento de Computação (DC)
Universidade Federal de São Carlos (UFSCar)
São Carlos, São Paulo - Brazil
{lucas.povoa, desouza, prado}@dc.ufscar.br
Resumo—Devido a requisitos de segurança, certos dados
ou atividades de um processo de negócio devem ser mantidos
nas premissas do usuário, enquanto outros podem ser alocados
numa nuvem computacional. Este artigo apresenta uma
abordagem genérica para a decomposição de processos de
negócio que considera a alocação de atividades e dados. Foram
desenvolvidas transformações para decompor processos
representados em WS-BPEL em subprocessos a serem
implantados nas premissas do usuário e numa nuvem
computacional. Essa abordagem foi demonstrada com um
estudo de caso no domínio da Saúde.
Palavras-chave—Gerenciamento de Processos de Negócio;
Computação em Nuvem; Decomposição de Processos; WSBPEL; Modelo Baseado em Grafos.
Abstract—Due to safety requirements, certain data or
activities of a business process should be kept within the user
premises, while others can be allocated to a cloud environment.
This paper presents a generic approach to business processes
decomposition taking into account the allocation of activities
and data. We designed transformations to decompose business
processes represented in WS-BPEL into sub-processes to be
deployed on the user premise and in the cloud. We
demonstrate our approach with a case study from the
healthcare domain.
Keywords—Business
Process
Management;
Cloud
Computing; Process Decomposition; WS-BPEL; Graph-based
model.
I.
INTRODUÇÃO
Atualmente várias organizações dispõem de grandes
sistemas computacionais a fim de atenderem à crescente
demanda por processamento e armazenamento de um
volume cada vez maior de dados. Enquanto na indústria
grandes companhias constroem centros de dados em larga
escala, para fornecerem serviços Web rápidos e confiáveis,
na academia muitos projetos de pesquisa envolvem
conjuntos de dados em larga escala e alto poder de
processamento, geralmente providos por supercomputadores.
Dessa demanda por enormes centros de dados emergiu o
conceito de Computação em Nuvem [1], onde tecnologias de
Luís Ferreira Pires, Evert F. Duipmans
Faculty of Electrical Engineering, Mathematics and
Computing Science (EEMCS)
University of Twente (UT)
Enschede, Overijssel - The Netherlands
[email protected],
[email protected]
informação e comunicação são oferecidas como serviços via
Internet. Google App Engine, Amazon Elastic Compute
Cloud (EC2), Manjrasoft Aneka e Microsoft Azure são
alguns exemplos de nuvens computacionais [2].
O cerne da Computação em Nuvem é oferecer recursos
computacionais, de forma que seus usuários paguem
somente pelo seu uso e tendo a percepção de que estes são
ilimitados. O National Institute of Standards and Technology
(NIST) identifica três modelos de serviço [3]: (a) Softwareas-a-Service (SaaS), um software hospedado num servidor é
oferecido e usuários acessam-no via alguma interface através
de uma rede local ou Internet (e.g., Facebook, Gmail); (b)
Platform-as-a-Service (PaaS), uma plataforma é oferecida,
usuários implantam suas aplicações na mesma e esta oferece
recursos como servidor Web e bases de dados (e.g.,
Windows Azure, Google AppEngine); (c) Infrastructure-asa-Service (IaaS), uma máquina virtual com certa capacidade
de armazenamento é oferecida e usuários alugam esses
recursos (e.g., Amazon EC2, GoGrid).
Embora muito promissora, a Computação em Nuvem
enfrenta obstáculos que devem ser transpostos para que não
impeçam o seu rápido crescimento. Segurança dos dados é
uma grande preocupação dos usuários, quando estes
armazenam informações confidenciais nos servidores das
nuvens computacionais. Isto porque geralmente esses
servidores são operados por fornecedores comerciais, nos
quais os usuários não depositam total confiança [4]. Em
alguns domínios de aplicação, a confidencialidade não é só
uma questão de segurança ou privacidade, mas também uma
questão jurídica. A Saúde é um desses domínios, já que a
divulgação de informações devem satisfazer requisitos
legais, tais como os presentes no Health Insurance
Portability and Accountability Act (HIPAA) [5].
Business Process Management (BPM) tem sido bastante
empregado por diversas empresas, nesta última década, para
gerenciar e aperfeiçoar seus processos de negócio [6]. Um
processo de negócio consiste de atividades exercidas por
humanos ou sistemas e um Business Process Management
System (BPMS) dispõe de um executor (engine), no qual
instâncias de um processo de negócio são coordenadas e
monitoradas. A compra de um BPMS pode ser um alto
investimento para uma empresa, já que software e hardware
precisam ser adquiridos e profissionais qualificados
contratados. Escalabilidade também pode ser um problema,
já que um executor é somente capaz de coordenar um
número limitado de instâncias de um processo
simultaneamente, sendo necessária a compra de servidores
adicionais para lidar com situações de pico de carga.
BPMSs baseados em nuvens computacionais e oferecidos
como SaaS via Internet podem ser uma solução para o
problema de escalabilidade. Entretanto, o medo de perder ou
expor dados confidenciais é um dos maiores obstáculos para
a implantação de BPMSs em nuvens computacionais, além
do que há atividades num processo de negócio que podem
não se beneficiar dessas nuvens. Por exemplo, uma atividade
que não exige intensa computação pode tornar-se mais
onerosa se colocada numa nuvem, já que os dados a serem
processados por essa atividade devem ser enviados à nuvem,
o que pode levar mais tempo para a sua execução e custar
mais caro, uma vez que transferência de dados é um dos
fatores de faturamento das nuvens computacionais [7].
Outros modelos de seviço em nuvens computacionais,
além dos identificados pelo NIST, são encontrados na
literatura. Por exemplo, no modelo Process-as-a-Service um
processo de negócio é executado parcial ou totalmente numa
nuvem computacional [8]. Devido a requisitos de segurança,
nesse modelo certos dados ou atividades devem ser mantidos
nas premissas do usuário enquanto outros podem ser
alocados numa nuvem, o que requer uma decomposição
desse processo. Neste sentido, este artigo apresenta uma
abordagem genérica para a decomposição de processos de
negócio, oferecendo uma solução técnica para esse
problema. A sequência do mesmo está assim organizada: a
Seção II discorre sobre BPM; a Seção III apresenta a
abordagem proposta; a Seção IV descreve um estudo de caso
acompanhado de análises de desempenho e custo; a Seção V
trata de trabalhos correlatos; e a Seção VI expõe as
considerações finais apontando para trabalhos futuros.
II.
BUSINESS PROCESS MANAGEMENT
BPM parte do princípio que cada produto oferecido por
uma empresa é o resultado de um determinado número de
atividades desempenhadas por humanos, sistemas ou ambos,
e as metas do BPM são identificar, modelar, monitorar,
aperfeiçoar e revisar processos de negócio dessa empresa.
Identificando essas atividades via workflows, a empresa tem
uma visão de seus processos, e monitorando e revisando os
mesmos esta pode detectar problemas e realizar melhorias. O
ciclo de vida de um processo de negócio possui as fases:
 Projeto, os processos de negócio são identificados e
capturados em modelos geralmente gráficos,
possibilitando aos stakeholders entendê-los e refinálos com certa facilidade. As atividades de um
processo são identificadas supervisionando o
processo existente e considerando a estrutura da
empresa e os seus recursos técnicos, sendo que
Business Process Model and Notation (BPMN)[9] é a
linguagem mais usada nessa fase. Uma vez
capturados nos modelos, os processos podem ser
simulados e validados, fornecendo aos stakeholders
uma visão da sua correção e adequação;
 Implementação, um modelo de processo de negócio é
implementado
manual,
semi-automática
ou
automaticamente. Quando automação não é requerida
ou possível, listas de trabalho são criadas com tarefas
bem definidas, as quais são atribuídas a funcionários
da empresa. O problema é que não há um sistema
central para o monitoramento das instâncias do
processo, devendo este ser realizado por cada
funcionário envolvido. Com a participação de
sistemas de informação, um BPMS pode usar o
modelo desse processo e criar instâncias do mesmo,
sendo capaz de monitorar cada uma destas e fornecer
uma visão das atividades realizadas, do tempo
consumido e da sua conclusão ou falha;
 Promulgação, o processo de negócio é executado e
para cada iniciação uma instância do mesmo é criada.
Tais instâncias são gerenciadas por um BPMS, que as
acompanha via um monitor, fornecendo um quadro
das que estão em execução e das que terminaram, e
detectando eventuais problemas que podem ocorrer
com essas instâncias; e
 Avaliação, a informação monitorada e coletada pelo
BPMS é usada para revisar o processo de negócio,
sendo que as conclusões obtidas nessa fase serão as
entradas da próxima interação no ciclo de vida.
A. WS-BPEL
BPMSs precisam de linguagens executáveis, sobretudo
nas três últimas fases, e uma vez que as usadas na fase de
projeto são geralmente muito abstratas, linguagens tais como
Web Services Business Process Execution Language (WSBPEL) [10] tornam-se necessárias.
Concebida pela Organization for the Advancement of
Structured Information Standards (OASIS) para a descrição
de processos de negócio e de seus protocolos, WS-BPEL foi
definida a partir dos padrões Web WSDL 1.1, XML Schema,
XPath 1.0, XSLT 1.0 e Infoset. As suas principais
construções serão ilustradas com o exemplo do Picture
Archiving and Communication System (PACS) [11], um
sistema de arquivamento e comunicação para diagnóstico por
imagem, cujo workflow é apresentado na Fig. 1.
Fig. 1. Workflow do PACS descrito como um processo monolítico.
Um processo descrito em WS-BPEL é um container,
onde são declaradas as atividades a serem executadas, dados,
tipos de manipuladores (handlers) e as relações com
parceiros externos. PACS pode ter sua descrição em WSBPEL iniciada por
<process name="PACSBusinessProcess"
targetNamespace="http://example.com"
xmlns="http://docs.oasisopen.org/wsbpel/2.0/process/executable">
WS-BPEL permite agregar Web Services, definir a lógica
de cada interação de serviço e orquestrar essas interações.
Uma interação envolve dois lados (o processo e um
parceiro), é descrita via o partnerLink, que é um canal de
comunicação caracterizado por um partnerLinkType, myRole
e partnerRole, sendo que essas informações identificam a
funcionalidade a ser provida por ambos os lados. Em PACS
pode ser definido um canal de comunicação entre esse
processo e um cliente como
<partnerLinks>
<partnerLink name="client"
partnerLinkType="tns:PACSBusinessProcess"
myRole="PACSBusinessProcessProvider"
partnerRole="PACSBusinessProcessRequester" />
</partnerLinks>
Para a troca de mensagens emprega-se receive, reply e
invoke. As duas primeiras permitem a um processo acessar
componentes externos através de um protocolo de
comunicação (e.g., SOAP), sendo que receive permite ao
processo captar requisições desses componentes. Em PACS,
a requisição de um radiologista pela persistência e detecção
automática de nódulos de uma tomografia de pulmão, pode
ser captada por
<receive name=”ImagePersistenceAndAnalysisReq”
partnerLink="processProvider" operation="initiate"
variable="input" createInstance="yes"/>
Para que um processo possa emitir uma resposta a um
solicitante é necessário um reply relacionado a um receive.
Um possível reply para o receive acima é
<reply name=”ImagePersistenceAndAnalysisResponse”
partnerLink="processProvider" operation="initiate"
variable="output"/>
Um processo requisita uma operação oferecida por um
Web Service através de invoke. A operação de persistência de
imagem médica pode ser requisitada por
Em geral um processo de negócio contém desvios
condicionados a critérios. Em PACS, imageResp determina a
invocação da função de detecção automática de nódulo ou o
disparo de uma exceção. Esse desvio pode ser descrito como
<if>
<condition>imageResp</condition>
<invoke name=”AutomaticAnalysis” … />
<else>
<throw faultName=”PersistenceException”/>
</else>
</if>
Atividades executadas iterativamente devem ser
declaradas via while, onde é realizada uma avaliação para
uma posterior execução, ou via repeat until, onde a avaliação
sucede a execução. Em PACS, a persistência de várias
imagens pode ser descrita como
<while>
<condition>
currentImageNumber <= numberOfImages
</condition>
<invoke name=”persistImage” … />
<assign>
<copy>
<from>$currentImageNumber + 1</from>
<to>$currentImageNumber</to>
</copy>
</assign>
</while>
Atividades executadas paralelamente devem ser
declaradas via flow. Em PACS, as operações de persistência
de uma imagem e de análise desta podem ser declaradas para
execução em paralelo como
<flow name=”parallelRequest”>
<invoke name=”MedicalImagePersistence” … />
<invoke name=”AutomaticAnalysis … />
</flow>
B. BPM em Nuvens Computacionais
O modelo Process enactment, Activity execution and
Data (PAD) é apresentado em [7], onde são investigadas
possíveis distribuições de um BPM entre as premissas e uma
nuvem, considerando a partição de atividades e dados, mas
não considerando a partição do executor de processo. Em
[12] o PAD é estendido, possibilitando também a partição do
executor, conforme ilustrado na Fig. 2.
<invoke name=”ImagePersistence”
partnerLink="ImagPL" operation="persistImage"
inputVariable="imageVar"
outputVariable=”imageResp”/>
É comum um processo de negócio conter atividades a
serem executadas em sequência. Em PACS, a solicitação de
persistência de imagem médica, a execução dessa tarefa e a
emissão da resposta ao solicitante, podem ser descritas como
<sequence name=”ImagePersistenceSequence”>
<receive name=”ImagePersistenceRequest” … />
<invoke name=”ImagePersistence” … />
<reply
name=”ImagePersistenceResponse … />
</sequence>
Fig. 2. Possibilidades de partição e distribuição de BPM.
Processos de negócio definem fluxos de controle, que
regulam as atividades e a sequência das mesmas, e fluxos de
dados, que determinam como estes são transferidos de uma
atividade a outra. Um executor tem que lidar com ambos os
tipos e, se dados sensíveis estiverem presentes, os fluxos de
dados devem ser protegidos. Na dissertação de mestrado [13]
é proposto um framework para a decomposição de um
processo em dois processos colaborativos, com base numa
lista de distribuição de atividades e dados, onde restrições
relativas aos dados podem ser definidas, para assegurar que
dados sensíveis permaneçam nas premissas. A Fig. 3 ilustra
essa decomposição.
base na avaliação de uma condição; mixagem simples, que
une múltiplos ramos alternativos para um único desses ser
executado; e ciclos arbitrários, que modela comportamento
recursivo. Essa RI suporta também: dependência de dados,
que representa explicitamente as dependências de dados
entre os nós, que é necessária pois o processo original é
decomposto em processos colaborativos e dados sensíveis
podem estar presentes; e comunicação, que permite
descrever como um processo invoca outro. A RI emprega um
modelo baseado em grafos para representar processos, onde
um nó representa uma atividade ou um elemento de controle
e uma aresta representa uma relação entre dois nós. Esses
nós e arestas foram especializados, definindo-se uma
representação gráfica para cada especialização:
 Atividade, cada nó tem em geral uma aresta de
controle de entrada e uma de saída;
 Comportamento paralelo, ilustrado na Fig. 5 (a), é
modelado com nós flow e eflow. O primeiro divide
um ramo de execução em vários ramos paralelos e
possui no mínimo duas arestas de controle de saída. O
segundo junta vários ramos paralelos num único ramo
e possui duas ou mais arestas de controle de entrada e
no máximo uma de saída;
Fig. 3. Exemplo de decomposição.
III.
ABORDAGEM PROPOSTA
O framework apresentado em [13], cujas fases são
ilustradas na Fig. 4, contém uma Representação
Intermediária (RI) baseada em grafos, na qual conceitos de
processos de negócio são capturados. A decomposição de um
processo passa pela RI, sendo que a adoção de uma
linguagem de processos de negócio requer transformações da
linguagem para a RI (lifting) e vice-versa (grounding). Em
[13] foi adotada a linguagem abstrata Amber [14], efetuada
uma análise para definir as regras de decomposição
suportadas pelo framework, concebidos algoritmos para a
sua implementação, os quais realizam transformações em
grafos, e concebido um algoritmo para verificar se restrições
relativas aos dados são violadas pela decomposição.
 Comportamento condicional, ilustrado na Fig. 5 (b), é
modelado com nós if e eif. O primeiro possui duas
arestas de controle de saída, uma rotulada true a outra
false, e após a avaliação da condição somente uma
destas é tomada. O segundo junta ramos condicionais,
convertendo-os num único ramo de saída;
 Comportamento repetitivo, ilustrado nas Fig. 5 (c) e
(d), é modelado com um único nó loop e, após a
avaliação da condição, o ramo do comportamento
repetitivo é tomado ou abandonado. Esse nó pode
estar antes ou depois do comportamento, sendo que
no primeiro caso resulta em zero ou mais execuções e
no segundo em pelo menos uma execução;
Fig. 4. Etapas envolvidas no framework.
A. Representação Intermediária
Para definir os requisitos da RI, foram adotados os
seguintes padrões de workflow [15]: sequência, que modela
fluxos de controle e expressa a sequência de execução de
atividades num processo; divisão paralela, que divide um
processo em dois ou mais ramos para execução simultânea;
sincronização, que junta múltiplos ramos num único ramo de
execução; escolha condicional, que executa um ramo com
Fig. 5. Construções para comportamentos paralelo (a), condicional (b) e
repetitivo com loop antes (c) e loop depois (d).
 Comunicações síncrona e assíncrona são ilustradas
na Fig. 6 (a) e Fig. 6 (b) respectivamente. Por
exemplo, a síncrona é modelada com os nós ireq,
ires, rec e rep, através dos quais dois subprocessos,
partes do processo global, se comunicam;
 Ecom é o conjunto de arestas de comunicação, onde e
= (n1, Communication, n2) com n1, n2 ∈ C;
 L é um conjunto de rótulos textuais que podem ser
atribuídos aos nós e arestas;
 nlabel : N → L, onde N = A∪ C∪ S atribui um rótulo
textual a um nó;
 elabel : E → L atribui um rótulo textual a uma aresta;
Fig. 6. Comunicações síncrona (a) e assíncrona (b).
 Arestas de controle, representadas por setas sólidas,
modelam o fluxo de controle. Uma aresta de controle
é disparada pelo seu nó de origem, tão logo a ação
associada ao mesmo termina, e o nó terminal dessa
aresta aguarda pelo seu disparo para iniciar a sua ação
associada. Caso o nó de origem seja if, essa aresta é
rotulada true ou false, e caso a condição avaliada
corresponda a esse rótulo, esta é disparada pelo nó;
 Arestas de dados possibilitam investigar os efeitos na
troca de dados causados pelas mudanças das
atividades de um processo a outro, permitindo
verificar se alguma restrição aos dados foi violada
durante a partição do processo original. Uma aresta
de dados é representada por uma seta tracejada. Uma
aresta de dados do nó de origem ao nó terminal
implica que os dados definidos no primeiro são
usados pelo segundo. Cada aresta possui um rótulo,
que define o nome dos dados compartilhados;
 Arestas de comunicação permitem enviar controle e
dados a diferentes processos e são rotuladas com
nomes de itens de dados enviados via as mesmas.
Formalmente, um grafo na RI é uma tupla (A, C, S, ctype,
stype, E, L, nlabel, elabel), onde:
 A é um conjunto de nós de atividade;
 C é um conjunto de nós de comunicação;
 S é um conjunto de nós estruturais ∈ {flow, eflow, if,
eif, loop};
 Os conjuntos par a par A, C e S são disjuntos;
 Os conjuntos N e E são disjuntos.
B. Decomposição
Em [13], para cada construção da RI foram identificadas
decomposições para processos situados nas premissas, que
possuem atividades a serem alocadas na nuvem, e vice-versa.
A Fig. 7 ilustra um conjunto de atividades sequenciais,
marcado para a nuvem, sendo alocado num único processo e
substituído, no processo original, por nós de invocação
síncrona.
Fig. 7. Conjunto de atividades sequenciais movido como um bloco.
Embora semanticamente diferentes, as construções
paralelas e condicionais são generalizadas como compostas,
pois possuem a mesma estrutura sintática, podendo ser
decompostas de várias formas. Neste trabalho os nós de
início e fim devem ter a mesma alocação e as atividades de
um ramo, com a mesma alocação desses nós, permanecem
com os mesmos. Se uma determinada construção é toda
marcada para a nuvem, a decomposição é semelhante a das
atividades sequenciais.
Na Fig. 8, os nós de início e fim são marcados para a
nuvem, e um ramo permanece nas premissas, sendo que a
atividade desse ramo é colocada num novo processo nas
premissas, o qual é invocado pelo processo na nuvem.
 ctype : C → {InvokeRequest, InvokeResponse,
Receive, Reply} atribui um tipo comunicador a um nó
de comunicação;
 stype : S → {Flow, EndFlow, If, EndIf, Loop} atribui
um tipo nó controle a um nó de controle;
 E = Ectrl ∪ Edata ∪ Ecom é o conjunto de arestas no
grafo, sendo que uma aresta é definida como (n1,
etype, n2), onde etype ∈ {Control, Data,
Communication} é o tipo da aresta e n1, n2 ∈ A ∪ C
∪ S;
 Ectrl é o conjunto de arestas de fluxo de controle, onde
e=(n1, Control, n2) com n1, n2 ∈ A ∪ C ∪ S;
 Edata é o conjunto de arestas de dados, onde e = (n1,
Data, n2) com n1, n2 ∈ A ∪ C ∪ S;
Fig. 8. Um ramo da construção composta permanece nas premissas.
Na Fig. 9, os nós de início e fim são marcados para a
nuvem e os ramos permanecem nas premissas, sendo criado
para cada ramo um novo processo nas premissas.
Fig. 11. Ramos iterativos.
Como já mencionado, a abordagem de decomposição
aqui descrita emprega uma lista de distribuição de atividades
e dados, que determina o que deve ser alocado nas premissas
e numa nuvem computacional. Embora a definição dessa
lista esteja fora do escopo deste trabalho, parte-se do
princípio que esta é elaborada manual ou automaticamente
de acordo com os seguintes critérios:
 Atividades sigilosas ou que contenham dados
sigilosos devem ser alocadas nas premissas;
Fig. 9. Os ramos da construção composta permanecem nas premissas.
Na Fig. 10, os nós de início e fim permanecem nas
premissas e os ramos são marcados para a nuvem, sendo
criado para cada ramo um processo na nuvem.
Fig. 10. Os nós início e fim permanecem nas premissas.
Laços usam o loop, e se um laço é todo marcado para a
nuvem, a decomposição é semelhante a das atividades
sequenciais. Quando loop e comportamento são marcados
com alocações distintas, este último é tratado como um
processo separado. A Fig. 11 ilustra um laço onde o nó loop
é marcado para a nuvem e a atividade iterativa fica nas
premissas.
Em função da complexidade da decomposição, os
algoritmos para a sua implementação foram concebidos em
quatro etapas consecutivas: identificação, partição, criação
de nós de comunicação e criação de coreografia. Tais
algoritmos, apresentados em [13], foram omitidos aqui
devido às limitações de espaço.
 Atividades com baixo custo computacional e volume
de dados devem ser alocadas nas premissas; e
 Atividades com alto custo computacional, com uma
alta relação entre tempo de processamento e tempo de
transferência de dados e que não se enquadrem no
primeiro critério, devem ser alocadas na nuvem.
C. Lifting e Grounding
Devido à base XML de WS-BPEL, o lifting e o
grounding convertem estruturas de árvores em grafos e viceversa, sendo que lifting possui um algoritmo para cada tipo
de construção WS-BPEL e grounding um algoritmo para
cada tipo de elemento da RI. Dessa forma, os principais
mapeamentos foram: assign e throw para nós Atividade; flow
para Comportamento paralelo; if para Comportamento
condicional, onde construções com mais de uma condição
são mapeadas para Comportamentos condicionais aninhados
junto ao ramo false; while e repeatUntil para
Comportamento repetitivo; receive e reply para
Comunicação com nós rec e res; sequence para um conjunto
de nós que representam construções aninhadas
interconectados por arestas de controle; invoke assíncrono
para o nó ireq e síncrono para os nós ireq e ires.
Os algoritmos para o lifting e o grouding foram
implementados em Java 7 usando a API para XML, baseada
nas especificações do W3C, e o framework para testes JUnit.
Por exemplo, as estruturas de árvore e grafo para o if,
apresentadas na Fig. 12, tiveram seus lifting e grounding
implementados a partir dos Algoritmos 1 e 2
respectivamente, cujas entrada e saída estão ilustradas na
Fig. 12. Os algoritmos para o lifting e o grounding de outras
estruturas da RI e de WS-BPEL foram omitidos devido a
limitações de espaço.
enviado a FalseGenerator. Caso contrário, se todas as
condições foram assumidas false e havendo atividade para
execução, um nó else é adicionado à árvore.
Algoritmo 2 Grounding para o grafo if
function IfGenerator(g)
t ← IfTree()
t.children ← t.children ∪
{CondGenerator(g.cond)}
t.children ← t.children ∪
{Generator(g.true)}
t.children ← t.children ∪
{FalseGenerator(g.false)}
end function
Fig. 12. Estruturas de árvore e grafo para a construção if.
IfParser caminha nos nós aninhados da árvore
verificando a condição e construindo o ramo true do grafo if
com as atividades relacionadas, sendo que as construções
restantes são enviadas a FalseParser para que o ramo false
seja construído. Caso a árvore tenha mais de uma condição,
o ramo false conterá um grafo if para a segunda condição,
esse grafo terá um ramo false que conterá outro grafo if para
a terceira condição, e assim sucessivamente.
Algoritmo 1 Lifting para a árvore da construção if
function IfParser(t)
cond ← {}
if t of type IfTree then
cond ← IfGraph()
for all c ∈ t.children do
if c type of Condition then
cond.cond ← CondParser(c)
else if c type of ElseTree ∨
c type of ElseIfTree then
cond.false ← FalseParser(t.children)
return cond
else if c type of Tree then
cond.true ← Parser(c)
end if
t.children ← t.children – {c}
end for
end if
return cond
end function
function FalseParser(s)
if s = {} then return s end if
falseBranch ← Graph()
if s.first of type ElseIfTree then
cond ← IfBranch()
cond.true ← ElseParse(s.first)
cond.false ← FalseParse(s-{s.first})
falseBranch.nodes ← {cond}
else if s.first of type ElseTree then
falseBranch.nodes←{ElseParse(s.first)}
else return FalseParse(s-{s.first})
end if
return falseBranch
end function
IfGenarator caminha no ramo true do grafo verificando e
adicionando à árvore if a condição junto com as atividades
relacionadas, sendo que o ramo false é enviado à
FalseGenerator que verifica se há um nó if aninhado. Caso
exista uma construção elseif, com a condição e as atividades
relacionadas, esta é adicionada à árvore e seu ramo false é
function FalseGenerator(f)
r ← {}
while f ≠ {} do
if # of f.nodes = 1 ^
f of type ElseIfTree then
t ← ElseIfTree()
t.children ←
CondGenerator(f.cond) ∪
Generator(f.true)
r ← r ∪ t
else
r ← r ∪ ElseGenerator(f)
end if
f ← f.false
end while
return r
end function
IV.
ESTUDO DE CASO
O estudo de caso para validar a decomposição foi
baseado no PACS, um processo na Saúde que tem por
objetivo persistir diagnósticos e tomografias mamárias e
aplicar uma função para a detecção de possíveis nódulos nas
mesmas. O PACS aceita um conjunto de imagens e seus
respectivos pré-diagnósticos e identificadores, efetua a
persistência de cada imagem e diagnóstico, executa a função
para detecção automática de nódulos sobre as tomografias
mamárias e emite um vetor contendo os identificadores das
imagens com nódulos em potencial.
No workflow do processo monolítico do PACS, ilustrado
na Fig. 1, as construções marcadas para alocação na nuvem
estão com um fundo destacado. A Fig. 13(a) ilustra a RI do
PACS monolítico após o lifting, enquanto a Fig. 13(b) ilustra
a RI após a execução da decomposição.
A Fig. 14 ilustra o PACS decomposto após o grounding
com a adição de dois observadores: um externo, cuja visão é
a mesma do observador do PACS monolítico, ou seja, só
enxerga as interações entre Cliente e PACS; um interno que,
além dessas interações, enxerga também as interações entre
os processos nas premissas e na nuvem.
A Fig. 15 ilustra, via diagramas UML de comunicação,
exemplos de traços obtidos executando o processo
monolítico (a) e o processo decomposto (b), sendo que as
interações destacadas neste último são visíveis somente ao
observador interno. Se ocultadas tais interações, ambos os
traços passam a ser equivalentes em observação para o
observador externo.
A. Análise de Desempenho
A fim de comparar o desempenho entre os processos
monolítico e decomposto, estes foram implementados
empregando-se as seguintes ferramentas: sistema operacional
Debian 6; servidor de aplicação Apache Tomcat 6; Java 6;
mecanismo de processos BPEL Apache ODE; e o framework
para disponibilizar os Web Services Apache AXIS 2. O
processo monolítico e a parte nas premissas do processo
decomposto foram executados sobre uma infraestrutura com
1GB de RAM, 20 GB de disco e 1 núcleo virtual com 2,6
GHz. A parte na nuvem do processo decomposto foi
executada sobre um modelo IaaS, em uma nuvem privada
gerenciada pelo software OpenStack, com as diferentes
configurações descritas na Tabela I.
TABELA I CONFIGURAÇÕES DAS INSTÂNCIAS NA NUVEM.
(a)
(b)
Fig. 13. RIs dos processos monolítico (a) e decomposto (b).
Código
Memória
HD
Núcleos
Frequência
conf#1
conf#2
conf#3
conf#4
conf#5
conf#6
2 GB
2 GB
4 GB
4 GB
6 GB
6 GB
20 GB
20 GB
20 GB
20 GB
20 GB
20 GB
1
2
1
2
1
2
2.6 GHz
2.6 GHz
2.6 GHz
2.6 GHz
2.6 GHz
2.6 GHz
As execuções dos processos empregaram uma carga de
trabalho composta por duas tuplas na forma <id, diagnostic,
image>, onde id é um identificador de 4 bytes, diagnostic é
um texto de 40 bytes e image é uma tomografia mamária de
11,1 MB. Foram coletadas 100 amostras dos tempos de
resposta dos processos monolítico e decomposto para cada
configuração i. De acordo com [16], o percentual Pi de
desempenho ganho do processo decomposto em relação ao
monolítico para a iésima configuração pode ser definido como
𝑃𝑖 = 1 −
𝑇𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑡𝑜
𝑇𝑚𝑜𝑛𝑜𝑙 í𝑡𝑖𝑐𝑜
𝑖

onde:
 𝑇𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑡𝑜 𝑖 é o tempo de resposta médio do processo
decomposto na configuração i; e
 𝑇𝑚𝑜𝑛𝑜𝑙 í𝑡𝑖𝑐𝑜 é o tempo de resposta médio do processo
monolítico.
O tempo de comunicação adicional foi desconsiderado,
pois essa medida é relativa a cada recurso disponível e ao
tamanho da carga de trabalho. A Fig. 16 ilustra o percentual
de ganho de desempenho do processo decomposto em
relação ao monolítico para cada uma das configurações,
sendo que o percentual mínimo é superior a 10%.
Fig. 14. PACS decomposto com observadores externo e interno.
1: request()
(a)
:ProcessoMonolítico
:WebServiceCliente
1.1: response()
1: request()
(b)
:WebServiceCliente
1.3: response()
1.1: cloudRequest()
:ProcessoPremissa
:ProcessoNuvem
1.2: cloudResponse()
Fig. 15. Diagramas UML dos processos monolítico (a) e decomposto (b).
Para verificar a hipótese de que as médias dos tempos de
resposta do processo decomposto foram significativamente
menores que a do processo monolítico, foi empregada a
estatística do teste t [17] a um nível de significância de 5%.
Os testes resultaram em valores da probabilidade p-value na
ordem de 2,2 × 10−16 , confirmando essa hipótese.
Fig. 18. Percentual de desempenho ganho e custo/hora do recurso na
nuvem.
Fig. 16. Percentual de desempenho ganho do processo decomposto.
B. Custos Relativos à Nuvem
Para determinar os custos adicionais agregados a esses
ganhos de desempenho, foi criado um modelo de regressão
linear [18] com os dados obtidos via 45 observações de
preços de três grandes provedores de IaaS, o qual emprega as
seguintes variáveis independentes: quantidade de RAM em
MB; quantidade de disco em GB; o número de núcleos
virtuais; e a frequência de cada um desses núcleos. Dessa
forma, o valor estimado 𝑦 do preço em dólar/hora do recurso
alocado na nuvem é definido como
𝑦 =∝ +𝜷𝑿

onde:
 ∝= −2,4882 × 10−16 é o intercepto do modelo;
 𝜷 = [0.013506, 0.072481, 0.083593, 0.000092282] é
o vetor de coeficientes de regressão; e

X = [memory_in_gb, number_of_virtual_cores,
ghz_by_core, hard_disk_in_gb] é o vetor de variáveis
independentes.
2
Esse modelo possui o coeficiente de determinação R de
89,62% e erro aleatório médio de US$ 0,0827, o qual foi
determinado com a técnica de validação cruzada leave-oneout [19]. A Fig. 17 ilustra a aderência dos valores estimados,
via a Equação (2), aos valores observados. Já a Fig. 18 ilustra
a relação entre o custo adicional de cada configuração,
definida via a Equação 2, e a porcentagem de desempenho
ganho através da mesma.
Observa-se na Fig. 18 que o maior ganho de desempenho
é obtido com a conf#6, a qual proporciona uma redução
maior que 12% no tempo de resposta do processo de
negócio, sendo acompanhada de um custo adicional de
aproximadamente US$ 0,20/hora do recurso alocado na
nuvem.
V.
TRABALHOS CORRELATOS
Em [20] novas orquestrações são criadas para cada
serviço usado por um processo de negócio, resultando numa
comunicação direta entre os mesmos ao invés destes terem
uma coordenação única. O processo WS-BPEL é convertido
para um grafo de fluxo de controle, que gera um Program
Dependency Graph (PDG), a partir do qual são realizadas as
transformações, e os novos grafos gerados são reconvertidos
para WS-BPEL. Como no algoritmo cada serviço no
processo corresponde a um nó fixo para o qual uma partição
é gerada, este trabalho não é adequado para a abordagem
aqui proposta, pois esta visa a criação de processos nos quais
múltiplos serviços possam ser usados.
Os resultados descritos em [21] focam na
descentralização de orquestrações de processos WS-BPEL,
usando Dead Path Elimination (DPE) para garantir a
conclusão da execução de processos descentralizados, mas
DPE também torna a abordagem muito dependente da
linguagem empregada na especificação do processo de
negócio. A RI aqui apresentada é independente dessa
linguagem e, consequentemente, também a decomposição,
bastando o desenvolvimento das transformações de lifting e
grounding apropriadas.
Em [22] é reportado que a maioria das pesquisas, em
descentralização de orquestrações, foca em demasia em
linguagens de processos de negócio específicas. Não focar
tanto nessas linguagens foi um dos principais desafios da
pesquisa aqui apresentada, sendo que outro desafio foi não se
preocupar somente com problemas de desempenho, mas
também com medidas de segurança reguladas por governos
ou organizações. Consequentemente, a decisão de executar
uma atividade nas premissas ou na nuvem, neste trabalho, é
já tomada na fase de projeto do ciclo de vida do BPM.
VI.
Fig. 17. Aderência dos valores estimados aos valores observados.
CONSIDERAÇÕES FINAIS E TRABALHOS FUTUROS
Este trabalho é uma continuação do apresentado na
dissertação de mestrado [13] e focou nas regras de
decomposição de processos de negócio, sendo que as
seguintes contribuições adicionais merecem destaque:
 Para demonstrar a generalidade da abordagem
proposta, ao invés da linguagem Amber usada em
[13], foi utilizada WS-BPEL para a especificação de
processos de negócio;
 Para que essa abordagem pudesse ser empregada,
transformações de lifting e grounding tiveram que ser
desenvolvidas para WS-BPEL;
 O fato de WS-BPEL ser executável, possibilitou a
implementação dos processos criados e a comparação
de seus comportamentos ao comportamento do
processo original, validando assim a abordagem
proposta; e
 Essas implementações possibilitaram também a
realização de uma analise comparativa de
desempenho entre os processos original e decomposto
e uma avaliação dos custos inerentes à alocação de
parte do processo decomposto na nuvem.
Os resultados obtidos com esse trabalho indicam que a
abordagem proposta é genérica, viável e eficaz tando do
ponto de vista de desempenho quanto financeiro.
Atualmente, a RI está sendo estendida para suportar
mais padrões de workflow e para modelar comportamentos
de exceção de WS-BPEL. Num futuro próximo, esta
pesquisa continuará nas seguintes direções: complementar as
regras de decomposição para suportar construções
compostas, nas quais os nós de início e fim tenham
diferentes localizações, e para possibilitar a extensão do
número de localizações, já que múltiplas nuvens podem ser
usadas e/ou múltiplos locais nas premissas podem existir nas
organizações; e desenvolver um framework de cálculo, que
leve em consideração os custos reais do processo original e
dos processos criados, visando recomendar quais atividades
e dados devem ser alocados em quais localizações.
AGRADECIMENTOS
[5] D. L. Banks, "The Health Insurance Portability and Accountability
Act: Does It Live Up to the Promise?," Journal of Medical Systems,
vol. 30, no. 1, pp. 45-50, February 2006.
[6] R. K. L. Ko, "A computer scientist's introductory guide to business
process management (BPM)," Crossroads, vol. 15, no. 4, pp. 11-18,
June 2009.
[7] Y.-B. Han, J.-Y. Sun, G.-L. Wang and H.-F. Li, "A Cloud-Based
BPM Architecture with User-End Distribution of Non-ComputeIntensive Activities and Sensitive Data," Journal of Computer Science
and Technology, vol. 25, no. 6, pp. 1157-1167, 2010.
[8] D. S. Linthicum, Cloud Computing and SOA Convergence in Your
Enterprise: A Step-by-Step Guide, Boston, MA, USA: Pearson
Education Inc., 2009.
[9] OMG, "Business Process Model and Notation (BPMN) Version 2.0,"
January 2011. [Online]. Available: http://goo.gl/k2pvi. [Accessed 17
março 2013].
[10] A. Alves, A. Arkin, S. Askary, C. Barreto, B. Bloch, F. Curbera, M.
Ford, Y. Goland, A. Guízar, N. Kartha, C. K. Liu, R. Khalaf, D.
König, M. Marin, V. Mehta, S. Thatte, D. van der Rijn, P. Yendluri
and A. Yiu, "Web Services Business Process Execution Language
Version 2.0," OASIS Standard, 11 April 2007. [Online]. Available:
http://goo.gl/MTrpo. [Accessed 1 Março 2013].
[11] P. M. d. Azevedo-Marques and S. C. Salomão, "PACS: Sistemas de
Arquivamento e Distribuição de Imagens," Revista Brasileira de
Física Médica, vol. 3, no. 1, pp. 131-139, 2009.
[12] E. Duipmans, L. F. Pires and L. da Silva Santos, "Towards a BPM
Cloud Architecture with Data and Activity Distribution," Enterprise
Distributed Object Computing Conference Workshops (EDOCW),
2012 IEEE 16th International, pp. 165-171, 2012.
[13] E. F. Duipmans, Business Process Management in the Cloud with
Data and Activity Distribution, master's thesis, Enschede, The
Netherlands: Faculty of EEMCS, University of Twente, 2012.
[14] H. Eertink, W. Janssen, P. O. Luttighuis, W. Teeuw and C. Vissers,
"A business process design language," World Congress on Formal
Methods, vol. I, pp. 76-95, 1999.
[15] W. v. d. Aalst, A. t. Hofstede, B. Kiepuszewski and A. Barros.,
"Workflow Patterns," Distributed and Parallel Databases, vol. 3, no.
14, pp. 5-51, 2003.
[16] R. Jain, The art of computer systems performance analysis: techniques
for experimental design, measurement, simulation, and modeling,
Wiley, 1991, pp. 1-685.
Os autores agradecem ao suporte do CNPq através do
INCT-MACC.
[17] R Core Team, "R: A Language and Environment for Statistical
Computing," 2013. [Online]. Available: http://www.R-project.org/.
[Accessed 5 Abril 2013].
REFERÊNCIAS
[18] J. D.Kloke and J. W.McKean, "Rfit: Rank-based Estimation for
Linear Models," The R Journal, vol. 4, no. 2, pp. 57-64, 2012.
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A.
Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica and M.
Zaharia, "Above the Clouds: A Berkeley View of Cloud Computing,"
EECS Department, University of California, Berkeley, 2009.
[19] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy
Estimation and Model Selection," in Proceedings of the 14th
international joint conference on Artificial intelligence, vol. 2, San
Francisco, CA: Morgan Kaufmann Publishers Inc., 1995, pp. 11371143.
[2] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg and I. Brandic, "Cloud
computing and emerging IT platforms: Vision, hype, and reality for
delivering computing as the 5th utility," Future Generation Computer
Systems, vol. 25, no. 6, pp. 599-616, June 2009.
[20] M. G. Nanda, S. Chandra and V. Sarkar, "Decentralizing execution of
composite web services," SIGNPLAN Notices, vol. 39, no. 10, pp.
170-187, October 2004.
[3] P. Mell and T. Grance, "The NIST Definition of Cloud Computing,"
National Institute of Standards and Technology, vol. 53, no. 6, pp. 150, 2009.
[4] S. Yu, C. Wang, K. Ren and W. Lou, "Achieving secure, scalable, and
fine-grained data access control in cloud computing," in Proceedings
of the 29th conference on Information communications, Piscataway,
NJ: IEEE Press, 2010, pp. 534-542.
[21] O. Kopp, R. Khalaf and F. Leymann, "Deriving Explicit Data Links in
WS-BPEL Processes," Services Computing, 2008. SCC '08, vol. 2, pp.
367-376, July 2008.
[22] W. Fdhila, U. Yildiz and C. Godart, "A Flexible Approach for
Automatic Process Decentralization Using Dependency Tables," Web
Services, 2009. ICWS 2009, pp. 847-855, 2009.
On the Influence of Model Structure and Test Case
Profile on the Prioritization of Test Cases in the
Context of Model-based Testing
João Felipe S. Ouriques∗ , Emanuela G. Cartaxo∗ , Patrı́cia D. L. Machado∗
∗ Software
Practices Laboratory/UFCG, Campina Grande, PB, Brazil
Email: {jfelipe, emanuela}@copin.ufcg.edu.br, [email protected]
Abstract—Test case prioritization techniques aim at defining
an ordering of test cases that favor the achievement of a goal
during test execution, such as revealing faults as earlier as
possible. A number of techniques have already been proposed
and investigated in the literature and experimental results have
discussed whether a technique is more successful than others.
However, in the context of model-based testing, only a few
attempts have been made towards either proposing or experimenting test case prioritization techniques. Moreover, a number
of factors that may influence on the results obtained still need to
be investigated before more general conclusions can be reached.
In this paper, we present empirical studies that focus on observing
the effects of two factors: the structure of the model and the
profile of the test case that fails. Results show that the profile
of the test case that fails may have a definite influence on the
performance of the techniques investigated.
context or in a more specific context, such as regression
testing, depending on the information that is considered by the
techniques [4]. Moreover, both code-based and specificationbased test suites can be handled, although most techniques
presented in the literature have been defined and evaluated for
code-based suites in the context of regression testing [5] [6].
Keywords—Experimental Software Engineering, Software Testing, Model-Based Testing, Test Case Prioritization.
Techniques for ordering the test cases may be required to
support test case selection, for instance, to address constrained
costs of running and analysing the complete test suite and also
to improve the rate of fault detection. However, to the best of
our knowledge, there are only few attempts presented in the
literature to define test case prioritization techniques based on
model information [10] [11]. Generally, empirical studies are
preliminary, making it difficult to assess current limitations and
applicability of the techniques in the MBT context.
I.
I NTRODUCTION
The artifacts produced and the modifications applied during software development and evolution are validated by the
execution of test cases. Often, the produced test suites are also
subject to extensions and modifications, making management a
difficult task. Moreover, their use can become increasingly less
effective due to the difficulty to abstract and obtain information
from test execution. For instance, if test cases that fail are
either run too late or are difficult to locate due to the size and
complexity of the suite.
To cope with this problem, a number of techniques have
been presented in the literature. These techniques can be
classified as: test case selection, test suite reduction and test
case prioritization. The general test case selection problem is
concerned with selecting a subset of the test cases according
to a specific (stop) criterion, whereas test suite reduction
techniques focus on selecting a subset of the test cases, but
the selected subset must provide the same coverage as the
original suite [1]. While the goal of selection and reduction is
to produce a more cost-effective test suite, studies presented
in the literature have shown that the techniques may not
work effectively, since some test cases are discarded and
consequently, some failures may not be revealed [2].
On the other hand, test case prioritization techniques have
been investigated in order to address the problem of defining
an execution order of the test cases according to a given testing
goal, particularly detecting faults as early as possible [3]. These
techniques can be applied either in a general development
Model-based Testing (MBT) is an approach to automate
the design and generation of black-box test cases from specification models, together with all oracle information needed
[7]. MBT can be applied to any model with different purposes,
from which specification-based test cases are derived, and also
at different testing levels. As usually, automatic generation
produces a big number of test cases that may also have a
considerable degree of redundancy [8] [9].
To provide useful information that may influence on the
development of prioritization techniques, empirical studies
must focus on controlling and/or observing factors that may
determine the success of a given technique. Given the goals of
prioritization in the context of MBT, a number of factors can
be determinant such as the size and the coverage of the suite,
the structure of the model (that may determine the size and
structure of test cases), the amount and distribution of failures
and the degree of redundancy of test cases.
In this paper, we investigate mainly the influence of two
factors: the structure of the model and the profile of the test
cases that fail. For this, we conduct 3 empirical studies, where
real application models, as well as automatically generated
ones, are considered. The focus is on general prioritization
techniques that can be applied to MBT test suites.
The purpose of the first study was to acquire preliminary
observations by considering real application models. From this
study, we concluded that a number of different factors may
influence on the performance of the techniques. Therefore, the
purpose of the second and third studies, the main contribution
of this paper, was to investigate on specific factors by controlling them through the use of generated models. Results from
these studies show that, despite the fact that the structure of the
models may present or not certain constructions (for instance
the presence of loops1 ), it is not possible to differentiate the
performance of the techniques when focusing on the presence
of the construction investigated. On the other hand, depending
on the profile of the test case that fails (longest, shortest,
essential, and so on), one technique may perform better than
the other.
of the source code/model. Then, a set of permutations P T S is
obtained and the T S0 that has the highest value of f is chosen.
In the studies presented in this paper, we focus on system
level models, that can be represented as activity diagrams
and/or as labelled transition systems with inputs and outputs as
transitions. Models are generated according to the strategy presented by Oliveira Neto et al. [12]. Test cases are sequences of
transitions extracted from a model by a depth-search algorithm
as presented by Cartaxo et al. [9] and Sapna and Mohanty
[11]. Prioritization techniques receive as input a test suite and
produces as output an ordering for the test cases.
When the goal is to increase fault detection, the Average
Percentage of Fault Detection (APFD) metric has been largely
used in the literature. The highest the APFD value is, the faster
and the better the fault detection rates are [14].
The paper is structured as follows. Section II presents
fundamental concepts along with a quick definition of the
prioritization techniques considered in this paper. Section III
discusses related works. Section IV presents a preliminary
study where techniques are investigated in the context of two
real applications, varying the ammount of faults. Sections V
and VI presents the main empirical studies conducted: the former reports a study with automatically generated models where
the presence of certain structural constructions is controlled,
whereas the latter depicts a study with automatically generated
models that are investigated for different profiles of the test
case that fails. Section VII presents concluding remarks about
the results obtained and pointers for further research. Details
on the input models and data collected in the studies can be
found at the project site2 . Empirical studies have been defined
according to the general framework proposed by Wohlin [13]
and the R tool3 has been used to support data analysis.
II.
BACKGROUND
This section presents the test case prioritization concept
(subsection II-A) and the techniques considered in this paper
(subsection II-B).
A. Test Case Prioritization
Test case prioritization is a technique that orders test cases
in an attempt to maximize an objective function. This problem
was defined by Elbaum et al. as follows [14]:
Given: T S, a test suite; P T S, a set of permutations of
T S; and, f , a function that maps P T S to real numbers
(f : P T S → R).
Problem: Find a T S0 ∈ P T S | ∀ T S00 (T S00 ∈ P T S) (T S00
6= T S0) · f (T S0) ≥ f (T S00)
The objective function is defined according to the goal of
the test case prioritization. For instance, the manager may need
to quickly increase the rate of fault detection or the coverage
1 A number of loops distributed in a model may lead to huge test suites
with a certain degree of redundancy between the test cases even if they are
traversed only once for each test case.
2 https://sites.google.com/a/computacao.ufcg.edu.br/mb-tcp/
3 http://www.r-project.org/
Note that the key point for the test case prioritization is the
goal, and the success of the prioritization is measured by this
goal. However, it is necessary to have some data (according to
the defined goal) to calculate the function for each permutation.
Then, for each test case, a priority is assigned and test cases
with the highest priority are scheduled to execute first.
Test case prioritization can be applied in code-based and
specification-based contexts, but it has been more applied in
the code-based context and it is often related to regression
testing. This way, Rothermel et al. [4] has proposed the
following classification:
•
General test case prioritization - test case prioritization is applied any time in the software development
process, even in the initial testing activities;
•
Regression testing prioritization - test case prioritization is performed after a set of changes was made.
Therefore, test case prioritization can use information
gathered in previous runs of existing test cases to help
prioritize the test cases for subsequent runs.
B. Techniques
This subsection presents general test case prioritization
techniques that will be approached in this paper.
Optimal. This technique is largely used in experiments as
upper bound on the effectiveness of the other techniques. This
technique presents the best result that can be obtained. To
obtain the best result, it is necessary to have, for example, the
faults (if the goal is to increase fault detection) as input that
are not available in practice (so the technique is not feasible in
practice). For this, we can only use applications with known
faults. This let us determine the ordering of test cases that
maximizes a test suite’s rate of fault detection.
Random. This technique is largely used in experiments as
lower bound on the effectiveness of the other techniques [6],
based on a random choice strategy.
Adaptive Random Testing (ART). This technique distributes
the selected test case as spaced out as possible based on a
distance function [15]. To apply this technique, two sets of
test cases are required: the executed set (the set of distinct
test cases that have been executed but without revealing any
failure) and the candidate set (the set of test cases that are
randomly selected without replacement). Initially, the executed
set is empty and the first test case is randomly chosen from
the input domain. The executed set is then updated with the
selected element from the candidate set. From the candidate
set, an element that is farthest away from all executed test
cases, is selected as the next one. There are several ways to
implement the concept of farthest away. In this paper, we will
consider:
•
Jaccard distance: The use of this function in the
prioritization context was proposed by Jiang et al. [6].
It calculates the distance between two sets and it is
defined as 1 minus the size of the intersection divided
by the size of the union of the sample sets. In our
context, we consider a test case as an ordered set of
edges (that represent transitions). Considering p and c
as test cases and B(p) and B(c) as a set of branches
covered by the test cases p and c, respectively, the
distance between them can be defined as follows:
J(p, c) = 1 −
•
|B(p) ∩ B(c)|
|B(p) ∪ B(c)|
Manhattan distance: This distance, proposed by
Zhou [16], is calculated by using two arrays. Each
array has its size equal to the number of branches
in the model. Since this function is used to evaluate
the distance between two test cases, each test case is
associated with one array. For each position of the
array, it is assigned 1 if the test case has the branch
or 0, otherwise.
Fixed Weights. This technique was proposed by Sapna and
Mohanty [11] and it is a prioritization technique based on UML
activity diagrams. The structures of the activity diagram are
used to prioritize the test cases. First of all, the activity diagram
is converted into a tree structure. Then, weights are assigned
according to the structure of the activity diagram (3 for forkjoin nodes, 2 for branch-merge nodes, 1 for action/activity
nodes). Lately, the weight for each path is calculated (sum
of the weights assigned to nodes and edges) and the test cases
are prioritized according to the weight sums obtained.
STOOP. This technique was proposed by Kundu et al. [17].
The inputs are sequence diagrams. These diagrams are converted into a graph representation called as sequence graph
(SG). After this, the SGs are merged. From the merged
sequence graph, the test cases are generated. Lastly, the set of
test case is prioritized. The test cases are sorted in descending
order taking into account the average weighted path length
(AWPL) metric. This is defined as follows:
Pm
AW P L(pk ) =
i=1
eW eight(ei )
m
where pk = e1 ; e2 ; em is a test case and eW eight is the
amount of test cases that contains the edge ei .
III.
R ELATED W ORK
Several test case prioritization techniques have been proposed and investigated in the literature. Most of them focus
on code-based test suites and the regression testing context
[18], [19]. The experimental studies presented, have discussed
whether a technique is more effective than others, comparing
them mainly by the APFD metric. And, so far, there is no
experiment that presented general results. This evidences the
need for further investigation and empirical studies that can
contribute to advances in the state-of-the-art.
Regarding code-based prioritization, Zhou et al. [20] compared fault-detection capabilities of the Jaccard-distance-based
ART and Manhattan-distance-based ART. Branch coverage
information was used for test case prioritization and the results
showed that Manhattan is more effective than Jaccard distance
in the context considered [20]. Also, Jeffrey and Gupta [21]
proposed an algorithm that prioritizes test cases based on
coverage of statements in relevant slices and discuss insights
from an experimental study that considers also total coverage.
Moreover, Do et al. [22] presented a series of controlled
experiments evaluating the effects of time constraints and
faultiness levels on the costs and benefits of test case prioritization techniques. The results showed that time constraints
can significantly influence both the cost and effectiveness.
Moreover, when there are time constraints, the effects of
increased faultiness are stronger. Furthermore, Elbaum et al.
[5] compared the performance of 5 prioritization techniques
in terms of effectiveness, and showed how the results of
this comparison can be used to select a technique (regression
testing) [18]. They applied the prioritization techniques to 8
programs. Characteristics of each program (such as: number
of versions, KLOC, number and size of the test suites, and
average number of faults) were taken into account.
By considering the use of models in the regression testing
context, Korel et al. [10], [19], [23] presented two model-based
test prioritization methods: selective test prioritization and
model dependence-based test prioritization. Both techniques
focus on modifications made to the system and models. The
inputs are the original EFSM system model and the modified
EFSM. Models are run to perform the prioritization. On the
other hand, our focus is on general prioritization techniques
were modifications are not considered.
Generally, in the MBT context, we can find proposals to
apply general test case prioritization from UML diagrams, such
as: i) the technique proposed by Kundu [17] et al. where
sequence diagrams are used as input; and ii) the technique
proposed by Sapna and Mohanty [11] where activity diagrams
are used as input. Both techniques are investigated in this
paper.
In summary, the original contribution of this paper is to
present empirical studies in the context of MBT that consider
different techniques and factors that may influence on their
performance such as the structure of the model and the profile
of the test case that fails.
IV.
F IRST E MPIRICAL S TUDY
The main goal of this study is to “analyze general prioritization techniques for the purpose of comparing their performances, observing the impact of the number of test cases that
fail, with respect to their ability to reveal failures earlier, from
the point of view of the tester and in the context of MBT”. We
worked with the following research hypothesis: “The general
test case prioritization techniques present different abilities of
revealing failures, considering different amount of failing test
cases in the test suite”. In the next sections, we present the
study design and analysis on data collected.
A. Planning
We conducted this experiment in a research laboratory – a
controlled environment. This characteristic leads to an offline
study. Moreover, all the techniques involved in the study only
require the test cases to prioritize and the mapping from the
branches of the system model to the test cases that satisfy each
branch. Thus, no human intervention is required, eliminating
the “expertise” influence.
As objects, from which system models were derived, we
considered real systems. Despite the fact that the applications
are real ones, they do not compose a representative sample of
the whole set of applications and, thereby, this experiment deal
with a specific context.
In order to analyze the performance of the techniques,
observing the influence of the number of test cases that fail,
we defined the following variables:
Independent variables and factors
•
•
General prioritization techniques: Techniques defined in Section II. We will consider the following
short-names for the sake of simplicity: optimal, random, ARTjac (Adaptive Random Testing with Jaccard distance), ARTman (Adaptive Random Testing
with Manhattan distance), fixedweights, stoop;
Number of test cases that fail: low (lower than 5%
of the total), medium (between 5% and 15% of the
total), high (higher than 15% of the total);
Dependent variable
•
Average Percentage of Fault Detection - APFD
In this study, we used two system models from two realworld applications: i) Labelled Transition System-Based Tool
– LTS-BT [24] – a MBT activities support tool developed in
the context of our research group and ii) PDF Split and Merge
- PDFsam4 – a tool for PDF files manipulation.
They were modelled by UML Activity Diagram, using the
provided use cases documents and the applications themselves.
From this diagram a graph model was obtained for each
application, from which test cases were generated by using a
depth search-based algorithm proposed by Sapna and Mohanty
[11] where each loop is considered two times at most. Table I
shows some structural properties from the models and the test
cases that were generated from them to be used as input to the
techniques.
It is important to remark that test cases for all techniques
were obtained from the same model using a single algorithm.
Also, even though the STOOP technique has been generally
proposed to be applied from sequence diagrams, the technique
itself works on an internal model that combines the diagrams.
Therefore, it is reasonable to apply STOOP in the context of
this experiment.
TABLE I.
S TRUCTURAL PROPERTIES OF THE MODELS IN THE
EXPERIMENT.
Property
Branching Nodes
Loops
Join Nodes
Test Cases
Shortest Test Case
Longest Test Case
Defects
TC reveal failures
LTS-BT
26
0
7
53
10
34
4
14
PDFSam
11
5
6
87
17
43
5
32
The number of test cases that fail variable was defined
considering real and known defects in the models and allocated
as shown in Table II.
4 Project’s
site: http://www.pdfsam.org
TABLE II.
Level
low
medium
high
D EFINITION OF THE T EST C ASES THAT FAIL VARIABLE
Failures in LTS-BT
2 test cases → 3,77%
4 test cases → 7,54%
8 test cases → 15,09%
Failures in PDFSam
4 test cases → 4,59%
7 test cases → 8,04%
16 test cases → 18,39%
The relationship between a defect (associated with a specific edge in the model) and a failure (a test case that fails) is
that when a test case exercises the edge, it reveals the failure.
For each different level, we considered a different set of defects
of each model, and in the high level, two defects originate
the failures. Moreover, these test cases do not reveal the two
defects at the same time for the two models.
By using the defined variables and detailing the informal
hypothesis, we postulated eight pairs of statistical hypotheses
(null and alternative): three pairs evaluating the techniques
at each level of number of test cases that fail (e.g. H0 :
AP F D(low,i) = AP F D(low,j) and H1 : AP F D(low,i) 6=
AP F D(low,j) , for techniques i and j, with i 6= j) and
five pairs evaluating the levels for each technique (e.g.
H0 : AP F D(random,k) = AP F D(random,l) and H1 :
AP F D(random,k) 6= AP F D(random,l) , for levels k and l,
with k 6= l), excluding the optimal technique. For the lack of
space, the hypotheses pairs are not written here.
Based on the elements already detailed, the experimental
design for this study is One-factor-at-a-time [25]. The data
analysis for the hypotheses pairs is based on 2-Way ANOVA
[26] [27], after check the assumptions of normality of residuals
and equality of variances. Whether any assumption is not
satisfied, a non parametric analysis needed to be proceeded.
We calculated the number of replications based on a pilot
sample, using the following formula proposed by Jain [27].
We obtained 815 as result, for a precision (r) of 2% of the
sample mean and significance (α) of 5%.
The following steps were executed to perform the experiment: 1) Instantiate lists for data collection for each replication
needed; 2) Instantiate the failure models to be considered; 3)
Generate test cases; 4) Map branches to test cases; 5) Execute
each technique for each object considering the replications
needed; 6) Collect data and compute dependent variable; 7)
Record and analyse results. All techniques were automatically
executed.
B. Data Analysis
When analysing data collected, we must verify the ANOVA
assumptions. Figure 1 assures that the residuals are not normally distributed, because the black solid line should be near
of the straight continuous line of the normal distribution. Thus,
we proceeded a nonparametric analysis.
A confidence interval analysis, as seen in Table III of the
95% confidence intervals of the pseudomedians5 of APFD
values collected might give a first insight about some null
hypotheses rejection.
The set of hypothesis defined for this experiment compare
the techniques under two points of view: i) whole set of
5 The pseudomedian is a nonparametric estimator for the median of a
population [28].
QQ−Plot for the Residuals
−0.2
−0.4
−0.6
Sample Quantiles
0.0
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
−4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
−2
0
C. Threats to validity
2
4
Quantiles of Normal Distribution
Fig. 1.
the growth of the level, while fixedweights decreases its
performance when the level goes from low to medium and
increase when the level goes from medium to high. This
different patterns compose an evidence of influence of another
factors over the researched techniques that motivated the
execution of the experiments presented in Sections V and VI.
QQ-Plot of the residuals and the normal distribution
TABLE III.
C ONFIDENCE I NTERVAL OF THE P SEUDOMEDIANS
–
optimal
random
ARTJac
ARTMan
fixedweights
stoop
Low
0.992
0.992
0.807
0.829
0.902
0.906
0.808
0.830
0.540
0.543
0.244
0.244
Medium
0.992
0.992
0.864
0.876
0.888
0.900
0.863
0.876
0.436
0.439
0.319
0.319
High
0.992
0.992
0.834
0.847
0.877
0.885
0.839
0.850
0.679
0.679
0.560
0.560
techniques at each single level, and ii) each technique isolated
in the different levels.
For the first set of hypothesis, when considering the levels
of number of test cases that fail separately (set of two columns
for each level), some confidence intervals do not overlap,
therefore the null hypotheses of equality must be rejected.
However, in the three levels, there is an overlap between
random and ARTMan and the p-values of Mann-Whitney
tests between the two techniques are 0.9516, 0.9399 and
0.4476 for low, medium and high, respectively. These p-values
are greater than the significance of 5%, thus the performance
of these techniques are statistically similar at this significance.
For the second set of hypothesis, by analyzing each technique separately (lines of Table III), all the null hypothesis
of equality must be rejected, once every technique present no
overlap between the confidence intervals at each level. This
means that the performance of the techniques can vary when
more or less test cases fail.
As general observations, ARTJac presented the best performance for the three levels. Moreover, the techniques presented slightly variations when considering the three levels (by
increasing or decreasing), except from fixedweights and stoop
that increase more than other techniques. These techniques that
are mostly based on structural elements of the test cases, may
be more affected by the number of test cases that fail than the
random based ones.
Furthermore, by increasing the level of the number of test
cases that fail, different evolution patterns in the techniques
performance arise, e.g. stoop increases its performance with
As a controlled experiment with statistical analysis, measures were rigorously taken to address conclusion validity regarding data treatment and assumptions, number of replications
and tests needed. For the internal validity of this experiment,
it is often difficult to represent a defect at a high abstract level
since a code defect may refer to detailed contents. Therefore,
an abstract defect may correspond to one or more defects at
code level and so on. To mitigate this threat, we considered
test cases that fail as measure instead of counting defects (even
though we had data on the real defects). This decision suits our
experiment perfectly, since the APFD metric focus on failure
rather than defects.
The construct validity regarding the set of techniques
and evaluation metric chosen to compose the study, was
supported by a systematic review [29] that revealed suitable
techniques and evaluation metrics, representing properly the
research context. The low number of system models used in
this experiment threatened its external validity, since two
models do not represent the whole universe of applications.
However, as preliminary study, we aimed at a specific context
observation only.
V.
S ECOND E MPIRICAL S TUDY
Motivated by the study reported in the Section IV, this
section contains a report of an empirical study that aims to
“analyzing general prioritization techniques for the purpose
of observing the model structure influence over the studied
techniques, with respect to their ability to reveal failures
earlier, from the point of view of the tester and in the context
of Model-Based Testing”. Complementing the definition, we
postulated the following research hypothesis: “The general
test case prioritization techniques present different abilities to
reveal failures, considering models with different structures”.
A. Planning
We also conducted this experiment in a research environment and the techniques involved in the study need the same
artifacts from the first experiment – the test suite generated
through a MBT test case generation algorithm. Thus, the
execution of the techniques does not need human intervention, what eliminates the factor “experience level” from the
experiment.
The models that originate the test suites processed in the
experiment were generated randomly using a parametrized
graph generator (Section V-B). Thus, the models do not
represent real application models.
For this study, we defined the following variables:
Independent variables
•
General prioritization techniques (factor): ARTJac,
stoop;
•
Number of branch constructions to be generated
in the input models (factor): 10, 30, 80;
•
Number of join constructions to be generate in the
input models (factor): 10, 20, 50;
•
Number of loop constructions to be generate in the
input models (factor): 1, 3, 9;
•
Maximum depth of the generated models (fixed value
equals to 25);
•
Rate of test cases that fail (fixed value equals to 10%);
Dependent variable
•
Average Percentage of Fault Detection - APFD.
For the sake of simplicity of the experimental design
required when considering all techniques and variables, in this
study, we decided to focus only on two techniques among the
ones considered in Section IV – ARTJac and stoop – particularly the ones with best and worst performance, respectively.
They can be seen as representatives of the random-based and
structural based techniques considered respectively. Moreover,
we defined the values for the variables that shape the models
based on the structural properties from the models considered
in the motivational experiment reported in the Section IV.
In this experiment, we do not desire the effect of the
failures location over the techniques, thus we selected failures
randomly. To mitigate the effect of the number of test cases
that fail, we assign a constant rate of 10% of the test cases to
reveal failure.
In order to evaluate the model structure, we defined
three different experimental designs and according to Wu and
Hamada [25], each one is a one-factor-at-a-time. The designs
are described in the next subsections.
1) Branches Evaluation: In order to evaluate the impact of
the number of branches in the capacity of revealing failures,
we defined three levels for this factor and fixed the number
of joins and branches in zero. For each considered level of
number of branches with the another parameters fixed, 31
models were generated by the parameterized generator. For
each model, the techniques were executed with 31 different
random failure attributions and we gathered the APFD value
of each execution.
We postulated five pairs of statistical hypotheses: three
analyzing each level of the branches with the null hypothesis
of equality between the techniques and the alternative
indicating they have a different performance (e.g. H0 :
AP F D(ART Jac,10 branch) = AP F D(Stoop,10 branch) and
H1 : AP F D(ART Jac,10 branch) 6= AP F D(Stoop,10 branch) )
and two related to each technique isolately, comparing
the performance in the three levels with the null
hypotheses of equality and alternative indicating some
difference (e.g. H0 : AP F D(ART Jac,10 branch) =
AP F D(ART Jac,30 branch) = AP F D(ART Jac,80 branch)
and
H1
:
AP F D(ART Jac,10 branch)
6=
AP F D(ART Jac,30 branch) 6= AP F D(ART Jac,80 branch) ).
2) Joins Evaluation: In the number of joins evaluation,
we proposed a similar design, but varying just the number
of joins and fixing the another variables. We fixed the number
of branches in 50, loops in zero and all the details that were
exposed in the branch evaluation are applied for this design.
The reason for allowing 50 branches is that branches may be
part of a join. Therefore, we cannot consider 0 branches.
The corresponding set of hypotheses follows the same
structure of the branch evaluation, but considering the number
of joins.
3) Loops Evaluation: In the number of loops evaluation,
once again, we proposed a similar design, but varying only
the number of loops and fixing the number of branches in
30 and the joins in 15 (again, this structures are commonly
parte of a loop, so it is not reasonable to consider 0 branches
and 0 joins). We structured a similar set of hypotheses as in
the branch evaluation, but considering the three levels of the
number of loops variable.
The following steps were executed to perform the experiment: 1) Generate test models as described in Section V-B; 2)
Instantiate lists for data collection for each replication needed;
3) Instantiate the failure models to be considered; 4) Generate
test cases; 5) Map branches to test cases; 6) Execute each
technique for each object considering the replications needed;
7) Collect data and compute dependent variable; 8) Record and
analyse results. All techniques were automatically executed
and test cases were generated by using the same algorithm as
in Section IV.
B. Model Generation
The considered objects for this study are the randomly
generated models. The generator receives five parameters:
1)
2)
3)
4)
5)
Number of branch constructions;
Number of join constructions;
Number of loop constructions;
The maximum depth of the graphs;
The number of graphs to generate.
The graph is created by executing operations to include the
constructions in sequences of transitions (edges). The first step
is to create an initial sequence using the forth parameter, e.g.
let a maximum depth be equal to five, so a sequence with five
edges is created, as in the Figure 2.
Fig. 2.
Initial configuration of a graph with maximum depth equals to 5.
Over this initial configuration, the generator executes the
operations. To increase the probability of generating structurally different graphs, the generator executes operations
randomly, but respecting the number passed as parameter.
The generator perform the operations of adding branching,
joining, and looping in the following way:
•
Branching: from a non-leaf random node x, create two
more new nodes y and z and create two new edges
(x, y) and (x, z) (Figure 3a);
•
Joining: from two non-leaf different random nodes x
and y, create a new node z and create two new edges
(x, z) and (y, z) (Figure 3b);
•
Looping: from two non-leaf different random nodes x
and y, with depth(x) > depth(y), create a new edge
(x, y) (Figure 3c).
Following the analysis, we performed three tests, as summarized in the Table V. We chose the test according to the
normality from the samples: for normal samples, we performed
the T-test and for non-normal samples, the Mann-Whitney test.
TABLE V.
P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5%
OF SIGNIFICANCE FROM THE FIRST EXPERIMENTAL DESIGN SAMPLES .
10 Branches
0.7497
(a) Branching the node 4 to nodes 7 and 8.
(b) Joining the nodes 2 and 5 to node 7.
(c) Looping the node 4 to 2.
Fig. 3.
Examples of operations performed by the parametrized graph
generator.
The generator execute the same process as many times as
the number of graphs to generate parameter indicates.
C. Data Analysis
80 Branches
0.1745
All the p-values on the Table V are greater than the defined
significance of 5%, so the null hypothesis of equality of
the techniques cannot be rejected, at the defined significance
level, in other words, the two techniques presented similar
performance at each level separately.
The next step of the analysis is to evaluate each technique separately through the levels and we proceeded a nonparametric test of Kruskal-Wallis to test their correspondent
hypothesis. The tests calculated for ARTJac and stoop p-value
equals to 0.6059 and 0.854 respectively. Comparing the pvalues against the significance level of 5%, we cannot reject
the null hypothesis of equality between the levels for each
technique, so the performance is similar, at this significance
level.
2) Joins Analysis: Following the same approach from the
first experimental design, we can see on Table VI the p-values
of the normality tests. The bold face p-values indicate the
samples normally distributed, at the considered significance.
TABLE VI.
As we divided the whole experiment into three experimental designs, the data analysis will respect the division.
Basically, we followed the same chain of tests for the three
designs. Firstly, we tested the normality assumptions over the
samples using the Anderson-Darling test and the equality of
variances through F-test. Depending on the result of these
tests, we chose the next one, that evaluate the equality of
the samples, Mann-Whitney or T-Test. After evaluate the
levels separately, we tested the techniques separately through
the three levels using ANOVA or Kruskal-Wallis test. We
considered for each test the significance level of 5%.
The objective in this work is to expose influences of the
studied structural aspects of the models on the performance
of the techniques, thus if the p-value analysis in a hypothesis
testing suggests that the null hypothesis of equality may not be
rejected, this is an evidence that the variable considered alone
does not affect the performance of the techniques. On the other
hand, if the null hypothesis must be rejected, it represents a
evidence of some influence.
1) Branches Analysis: The first activity for the analysis
is the normality test and Table IV summarizes this step. The
two samples from the low level had the null hypotheses of
normality rejected.
TABLE IV.
P-VALUES FOR THE A NDERSON -DARLING NORMALITY
TESTS WITH 5% OF SIGNIFICANCE FROM THE FIRST EXPERIMENTAL
DESIGN SAMPLES . N ORMAL SAMPLES ARE IN BOLD FACE .
ART Jaccard
Stoop
30 Branches
0.9565
10 Branches
30 Branches
80 Branches
3.569 · 10−15
2.207 · 10−13
0.3406
0.273
0.3566
0.06543
P-VALUES FOR THE A NDERSON -DARLING NORMALITY
TESTS WITH 5% OF SIGNIFICANCE FROM THE SECOND EXPERIMENTAL
DESIGN SAMPLES . N ORMAL SAMPLES ARE IN BOLD FACE .
ART Jaccard
Stoop
10 Joins
0.9394
0.5039
20 Joins
0.8015
0.5157
50 Joins
0.6733
0.05941
Based on these normality tests, we tested the equality of the
performance of the techniques at each level and, according to
the Table VII, the techniques performs statistically in a similar
way at all levels.
TABLE VII.
P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5%
OF SIGNIFICANCE FROM THE SECOND EXPERIMENTAL DESIGN SAMPLES .
10 Joins
0.9816
20 Joins
0.62
50 Joins
0.06659
The next step is to assess each technique separately. We
executed a Kruskal-Wallis test comparing the three samples for
ARTJac and stoop and the p-value was 0.4418 and 0.3671,
respectively. Comparing with the significance level considered
of 5%, both null hypothesis of equality was not rejected, what
means the techniques behave similarly through the levels.
3) Loops Analysis: Following the same line of argumentation, the first step is to evaluate the normality of the measured
data and the Table VIII summarizes these tests.
According to the results of the normality tests, we tested
the equality of the techniques at each level of this experimental
design. As we can see on Table IX, the null hypotheses for
1 Loop, 3 Loops and 9 loops cannot be rejected because
they have p-value greater than 5%, thus the techniques present
similar behaviour for all levels of the factor.
TABLE VIII.
P-VALUES FOR THE A NDERSON -DARLING NORMALITY
TESTS WITH 5% OF SIGNIFICANCE FROM THE THIRD EXPERIMENTAL
DESIGN SAMPLES . N ORMAL SAMPLES ARE IN BOLD FACE .
ART Jaccard
Stoop
1 Loop
3 Loops
9 Loops
0.07034
0.985
0.02681
0.08882
2.75 · 10−10
9.743 · 10−11
TABLE IX.
P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5%
OF SIGNIFICANCE FROM THE THIRD EXPERIMENTAL DESIGN SAMPLES .
1 Loop
0.141
3 Loops
0.6049
9 Loops
0.07042
Analyzing the two techniques separately through the levels,
we performed the non-parameric Kruskal-Wallis test and the
p-values were 0.9838 and 0.3046 for ARTJac and stoop,
respectively. These p-values, compared with the significance
level of 5%, indicate that the null hypotheses of the considered
pairs cannot be rejected, in another words, the techniques
perform statistically similar through the different levels of the
number of looping operations.
A. Planning
We performed the current experiment in the same environment of the previous ones and the application models used
in this experiment are the same used in Section V. Since
we do not aim to observing variations of model structure,
we considered the 31 models that were generated with 30
branches, 15 joins, 1 loop and maximum depth 25.
For this experiment, we defined these variables:
Independent variables
•
General prioritization techniques (factor): ARTJac,
stoop;
•
Failure profiles, i.e., characteristics of the test cases
that fail (factor);
◦ Long test cases – with many steps (longTC);
◦ Short test cases – with few steps (shortTC);
◦ Test cases that contains many branches
(manyBR);
◦ Test cases that contains few branches (fewBR);
◦ Test cases that contains many joins
(manyJOIN);
◦ Test cases that contains few joins (fewJOIN);
◦ Essential test cases (ESSENTIAL) (the ones
that uniquely covers a given edge in the
model);
•
Number of test cases that fail: fixed value equals to 1;
D. Threats to Validity
About the validity of the experiment, we can point some
threats. To the internal validity, we defined different designs to evaluate separately the factors, therefore, it is not
possible analyze the interaction between the number of joins
and branches, for example. We did it because some of the
combinations between the three variables might be unfeasible,
e.g. a model with many joins and without any branch.
Moreover, we did not calculate the number of replications
in order to achieve a defined precision because the execution would be infeasible (conclusion validity). The executed
configuration took several days because some test suites were
huge. To deal with this limitation, we limited the generation
to 31 graphs for each experimental design and 31 failure
attributions for each graph, keeping the balancing principle
[13] and samples with size greater than, or equal to, 31 are
wide enough to test for normality with confidence [26], [27].
Furthermore, the application models were generated randomly to deal with the problem of lack of application models,
but, at the same time, this reduces the capability of represent
the reality, threatening the external validity. To deal with
this, we used structural properties, e.g. depth and number of
branches, from existent models.
VI.
T HIRD E MPIRICAL S TUDY
This section contains a report of an experiment that aims
to “analyze general prioritization techniques for the purpose
of observing the failure profile influence over the studied
techniques, with respect to their ability to reveal failures
earlier, from the point of view of the tester and in the context
of Model-Based Testing”.
Complementing the definition, we postulated the following
research hypothesis: “The general test case prioritization techniques present different abilities to reveal failures, considering
that the test cases that fail have different profiles”. We are
considering profiles as the characteristics of the test cases that
reveal failures.
Dependent variable
•
Average Percentage of Fault Detection - APFD.
A special step is the failure assignment, according to the
profile. As the first step, the algorithm sorts the test cases
according to the profile. For instance, for the longTC profile,
the test cases are sorted decreasingly by the length or number
of steps. If there are more than one with the biggest length
(same profile), one of them is chosen randomly. For example,
if the maximum size of the test cases is 15, the algorithm
selects randomly one of the test cases with size equals to 15.
Considering the factors, this experiment is a one-factor-ata-time, and we might proceed analysis between the techniques
at each failure profile and between the levels at each technique.
In the execution of the experiment, each one of the 31 models
were executed with 31 different and random failure assigned
to each profile, with just one failure at once (a total of 961
executions for each technique). This number of replications
keeps the design balanced and gives confidence for testing
normality [27].
Based on these variables and in the design, we defined
the correspondent pairs of statistical hypotheses: i) to analyse
each profile with the null hypothesis of equality between
the techniques and the alternative indicating they have a
different performance (e.g. H0 : AP F D(ART Jac,longT C) =
AP F D(stoop,longT C) and H1 : AP F D(ART Jac,longT C) 6=
AP F D(stoop,longT C) ), and also ii) to analyse each technique
with the null hypothesis of equality between the profiles(
∀f1 , f2 ∈ {longTC, shortTC, manyBR, fewBR, manyJOIN,
fewJOIN, ESSENTIAL}, f1 6= f2 · H0 : AP F D(ART Jac,f1 ) =
AP F D(ART Jac,f 2) , and H1 : AP F D(ART Jac,f1 ) 6=
AP F D(ART Jac,f 2) ). If the tests reject null hypotheses, this
fact will be considered as an evidence of the influence of the
failure profile over the techniques.
Experiment execution followed the same steps defined in
Section V. However, as mentioned before, each technique was
run by considering one failure profile at a time.
B. Data Analysis
The boxplots from Figures 4 and 5 summarizes the trends
of the data collected. The notches in the boxplots are a
graphical representation of the confidence interval calculated
by the R software. When these notches overlap, it suggests the
better and deeper investigation of the statistical similarity of
the samples.
techniques, because frequently a test case among the longest
ones are among the ones with the biggest number of branches.
The same happens with the profiles ShortT C and F ewBR,
by the same reasoning.
There is a relationship between the profiles F ewJOIN
and ESSEN T IAL as we can see in Figures 4 and 5. The
essential test cases are the ones that contains some requirement
uniquely, in this case a branch, only covered by itself, and by
this definition, the test cases among the ones with least joins
frequently are essentials.
In summary, rejection of null hypothesis are a strong
evidence of the influence of the failure profiles over the performance of the general prioritization techniques. Furthermore,
data suggests that ARTJac may not have a good performance
when the test case that fails is either long or with many
branches. In this case, stoop has a slightly better performance.
In the other cases, ARTJac has a better performance, similarly
to results obtained in the first experiment with real applications
(Section IV).
C. Threats to Validity
Regarding conclusion validity, we did not calculate the
number of replications needed To deal with this threat of
precision, we limited the random failure attributions at each
profile for each graph in 31, keeping the balancing principle
[13] and samples with size greater than, or equal to, 31 are
wide enough to test for normality with confidence [26], [27].
Fig. 4.
Boxplot with the samples from ARTJac.
Construct validity is threatened by the definition of the
failure profiles. We chose the profiles based on data and
observations from previous studies, not necessarily the specific
results. Thus, we defined them according to our experience and
there might be other profiles not investigated yet. This threat
is reduced by the experiment’s objective, that is to expose the
influence of different profiles on the prioritization techniques
performance, and not to show all the possible profiles.
VII.
C ONCLUDING R EMARKS
This paper presents and discusses the results obtained
from empirical studies on the use of test case prioritization
techniques in the context of MBT. It is widely accepted that
a number of factors may influence on the performance of the
techniques, particularly due to the fact that the techniques can
be based on different aspects and strategies, including or not
random choice.
Fig. 5.
Boxplot with the samples from stoop.
For testing the performance between the two techniques
at every failure profile from a visual analysis of the boxplots
of the samples, seen in the Figures 4 and 5, we can see that
there are no overlaps between the techniques in any profile
(the notches in the box plot do not overlap), in another words,
at 5% of significance, ARTJac and stoop perform statistically
different in every researched profile.
Comparing each technique separately through the failure
profiles, both of them present differences between the profiles,
enough condition to also reject the null hypothesis of equality.
By observing the profiles longT C and manyBR, in Figures 4 and 5, they incur in similar performances for the two
In this sense, the main contribution of this paper is to investigate on the influence of two factors: the structure of the model
and the profile of the test case that fails. The intuition behind
this choice is that the structure of the model may determine
the size of the generated test suites and the redundancy degree
among their test cases. Therefore, this factor may affect all of
the techniques involved in the experiment due to either the use
of distance functions or the fact that the techniques consider
certain structures explicitly. On the other hand, depending on
the selection strategy, the techniques may favor the selection of
given profiles of test cases despite others. Therefore, whether
the test cases that fail have a certain structural property may
also determine the success of a technique. To the best of
our knowledge, there are no similar studies presented in the
literature.
In summary, in the first study, performed with real applications in a specific context, different growth patterns of APFD
for the techniques can be considered as evidence of influence
of more factors in the performance of the general prioritization
techniques other than the number of test cases that fail. This
result motivated the execution of the other studies.
On one hand, the second study, aimed at investigating the
influence of the number of occurrences of branches, joins and
loops over the performance of the techniques, showed that
there is no statistical difference on the performance of the
techniques studied with significance of 5%. On the other hand,
in the third study, based on the profile of the test case that fail,
the fact that all of the null hypotheses were rejected may indicate a high influence of the failure profile on the performance
of the general prioritization techniques. Moreover, from the
perspective of the techniques, this study exposed weaknesses
associated with these profiles. For instance, ARTJac presented
low performance when long test cases (and/or with many
branches) reveal failures and high when short test cases (and/or
with few branches) reveal failures. On the other hand, stoop
showed low performance with almost all profiles. From these
results, testers may opt to use one technique or the other based
on failure prediction and the profile of the test cases.
As future work, we will perform a more complex factorial
experiment, calculating the interaction between the factors
analyzed separately in the experiments reported in this paper.
Moreover, we plan an extension of the third experiment to
consider other techniques and also investigate other profiles
of test cases that may be of interest. From the analysis of the
results obtained, new (possibly hybrid) technique may emerge.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
ACKNOWLEDGMENT
This work was supported by CNPq grants 484643/2011-8
and 560014/2010-4. Also, this work was partially supported by
the National Institute of Science and Technology for Software
Engineering6 , funded by CNPq/Brasil, grant 573964/2008-4.
First author was also supported by CNPq.
[21]
[22]
[23]
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
M. J. Harrold, R. Gupta, and M. L. Soffa, “A methodology for
controlling the size of a test suite,” ACM Trans. Softw. Eng. Methodol.,
vol. 2, no. 3, pp. 270–285, Jul. 1993.
D. Jeffrey and R. Gupta, “Improving fault detection capability by
selectively retaining test cases during test suite reduction,” Software
Engineering, IEEE Transactions on, vol. 33, no. 2, pp. 108–123, 2007.
G. Rothermel, R. Untch, C. Chu, and M. Harrold, “Test case prioritization: an empirical study,” in Software Maintenance, 1999. (ICSM ’99)
Proceedings. IEEE International Conference on, 1999, pp. 179–188.
G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Prioritizing
test cases for regression testing,” IEEE Transactions on Software
Engineering, vol. 27, pp. 929–948, 2001.
S. G. Elbaum, A. G. Malishevsky, and G. Rothermel, “Test case
prioritization: A family of empirical studies,” IEEE Transactions in
Software Engineering, February 2002.
B. Jiang, Z. Zhang, W. K. Chan, and T. H. Tse, “Adaptive random test
case prioritization,” in ASE, 2009, pp. 233–244.
M. Utting and B. Legeard, Practical Model-Based Testing: A Tools
Approach, 1st ed. Morgan Kauffman, 2007.
E. G. Cartaxo, P. D. L. Machado, and F. G. O. Neto, “Seleção
automática de casos de teste baseada em funções de similaridade,” in
XXIII Simpósio Brasileiro de Engenharia de Software, 2008, pp. 1–16.
6 www.ines.org.br
[24]
[25]
[26]
[27]
[28]
[29]
E. G. Cartaxo, P. D. L. Machado, and F. G. Oliveira, “On the use of a
similarity function for test case selection in the context of model-based
testing,” Software Testing, Verification and Reliability, vol. 21, no. 2,
pp. 75–100, 2011.
B. Korel, G. Koutsogiannakis, and L. Tahat, “Application of system
models in regression test suite prioritization,” in IEEE International
Conference on Software Maintenance, 2008, pp. 247–256.
S. P. G. and H. Mohanty, “Prioritization of scenarios based on uml
activity diagrams,” in CICSyN, 2009, pp. 271–276.
F. G. O. Neto, R. Feldt, R. Torkar, and P. D. L. Machado, “Searching
for models to test software technology,” 2013, proc. of First International Workshop on Combining Modelling and Search-Based Software
Engineering, CMSBSE/ICSE 2013.
C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell, and
A. Wesslen, Experimentation in software engineering: an introduction.
Norwell, MA, USA: Kluwer Academic Publishers, 2000.
S. Elbaum, A. G. Malishevsky, and G. Rothermel, “Prioritizing test
cases for regression testing,” in In Proc. of the Int. Symposium on
Software Testing and Analysis. ACM Press, 2000, pp. 102–112.
T. Y. Chen, H. Leung, and I. K. Mak, “Adaptive random testing,” in
Advances in Computer Science - ASIAN 2004, ser. Lecture Notes in
Computer Science, vol. 3321/2005. Springer, 2004, pp. 320–329.
Z. Q. Zhou, “Using coverage information to guide test case selection
in adaptive random testing,” in IEEE 34th Annual COMPSACW, 2010,
july 2010, pp. 208 –213.
D. Kundu, M. Sarma, D. Samanta, and R. Mall, “System testing for
object-oriented systems with test case prioritization,” Softw. Test. Verif.
Reliab., vol. 19, no. 4, pp. 297–333, Dec. 2009.
S. Elbaum, G. Rothermel, S. K, and A. G. Malishevsky, “Selecting
a cost-effective test case prioritization technique,” Software Quality
Journal, vol. 12, p. 2004, 2004.
B. Korel, L. Tahat, and M. Harman, “Test prioritization using system
models,” in Software Maintenance, 2005. ICSM’05. Proceedings of the
21st IEEE International Conference on, 2005, pp. 559–568.
Z. Q. Zhou, A. Sinaga, and W. Susilo, “On the fault-detection capabilities of adaptive random test case prioritization: Case studies with large
test suites,” in HICSS, 2012, pp. 5584–5593.
D. Jeffrey, “Test case prioritization using relevant slices,” in In the Intl.
Computer Software and Applications Conf, 2006, pp. 411–418.
H. Do, S. Mirarab, L. Tahvildari, and G. Rothermel, “The effects
of time constraints on test case prioritization: A series of controlled
experiments,” IEEE Transactions on Software Engineering, vol. 36,
no. 5, pp. 593–617, 2010.
B. Korel, G. Koutsogiannakis, and L. H. Tahat, “Model-based test
prioritization heuristic methods and their evaluation,” in Proceedings
of the 3rd international workshop on Advances in model-based testing,
ser. A-MOST ’07. New York, NY, USA: ACM, 2007, pp. 34–43.
[Online]. Available: http://doi.acm.org/10.1145/1291535.1291539
E. G. Cartaxo, W. L. Andrade, F. G. O. Neto, and P. D. L. Machado,
“LTS-BT: a tool to generate and select functional test cases for
embedded systems,” in Proc. of the 2008 ACM Symposium on Applied
Computing, vol. 2. ACM, 2008, pp. 1540–1544.
C. F. J. Wu and M. S. Hamada, Experiments: Planning, Analysis, and
Optimization, 2nd ed. John Wiley and Sons, 2009.
D. C. Montgomery and G. C. Runger, Applied Statistics and Probability
for Engineers. John Wiley and Sons, 2003.
R. K. Jain, The Art of Computer Systems Performance Analysis:
Techniques for Experimental Design, Measurement, Simulation, and
Modeling. Wiley, 1991.
E. Lehmann, Nonparametrics, ser. Holden-Day series in probability and
statistics, H. D’Abrera, Ed. San Francisco u.a.: Holden-Day u.a., 1975.
J. F. S. Ouriques, “Análise comparativa entre técnicas de priorização
geral de casos de teste no contexto do teste baseado em especificação,”
Master’s thesis, UFCG, Janeiro 2012.
The Impact of Scrum on Customer Satisfaction: An
Empirical Study
Bruno Cartaxo1 , Allan Araújo1,2 , Antonio Sá Barreto1 and Sérgio Soares1
Informatics Center - CIn / Federal University of Pernambuco - UFPE1
Recife Center for Advanced Studies and Systems - C.E.S.A.R2
Recife, Pernambuco - Brazil
Email: {arsa,bfsc,acsbn,scbs}@cin.ufpe.br
Abstract—In the beginning of the last decade, agile methodologies emerged as a response to software development processes
that were based on rigid approaches. In fact, the flexible characteristics of agile methods are expected to be suitable to the lessdefined and uncertain nature of software development. However,
many studies in this area lack empirical evaluation in order to
provide more confident evidences about which contexts the claims
are true. This paper reports an empirical study performed to
analyze the impact of Scrum adoption on customer satisfaction
as an external success perspective for software development
projects in a software intensive organization. The study uses data
from real-life projects executed in a major software intensive
organization located in a nation wide software ecosystem. The
empirical method applied was a cross-sectional survey using a
sample of 19 real-life software development projects involving 156
developers. The survey aimed to determine whether there is any
impact on customer satisfaction caused by the Scrum adoption.
However, considering that sample, our results indicate that it was
not possible to establish any evidence that using Scrum may help
to achieve customer satisfaction and, consequently, increase the
success rates in software projects, in contrary to general claims
made by Scrum’ advocates.
I.
I NTRODUCTION
Since the term software engineering emerged in 1968 [1]
it has motivated a tremendous amount of discussions, works,
and research on processes, methods, techniques, and tools for
supporting high-quality software development in a wide and
industrial scale.
Initially, industrial work — based on manufacturing —
introduced several contributions to the software engineering
body of knowledge. Many software processes has been supported by industrial work concepts such as functional decomposition and localized labor [2]. During the last decades,
techniques and tools has been created as an analogy to the
production lines. The first generation of software processes
family was based on the waterfall life cycle assuming that the
software development life cycle was a linear and sequential
similar to a production line [3]. Then, in the early 90’s,
other initiatives were responsible for creating iterative and
incremental processes such as the Unified Process [4].
Despite these efforts and investments, software projects
success rate has presented a dramatic situation in which less
than 40% of projects achieve success (Figure 1). Obviously,
these numbers may not be compared to other profitable industries [5].
Fig. 1.
2011 Chaos Report - Extracted from [5]
Some specialists argue that software development is different from the traditional industrial work in respect to its nature.
Software engineering may be described as knowledge work
which is “focused on information and collaboration rather than
manufacturing placing value on the ownership of knowledge
and the ability to use that knowledge to create or improve
goods and services” [2]. There are several differences between
these two kinds of work. While the work is visible and stable
in industrial work; it is invisible and changing in knowledge
work. Considering that knowledge work (including software
development) is more uncertain and less defined than the
industrial work that is based on predictability, the application
of industrial work techniques on knowledge work may lead to
projects with increased failure rates.
Since 2001, agile methods have emerged as a response for
overcoming the difficulties related to the software development. Some preliminary results shown that agile methodologies may increase success rates as shown in Figure 2 [5]:
Although some results may indicate that agile methodologies help to achieve success in software development, many
of these researches fail to present evidence through empirical
evaluation. Only through these evaluation it is possible to
establish whether and in which context the proposed method
or technique is efficient, effective, and can be applied [6] [7]
[8]. In particular, for the agile context, a minor part of studies
contains an empirical evaluation as shown in Figure 3 [9].
results obtained from this research.
Fig. 2.
Waterfall vs. Agile - Extracted from [5]
Hence, this paper is organized as following. Sections 2
and 3 present the definition for this study, including the
conceptual model and the research method used for the survey,
respectively. Section 4 is aimed to find out the results obtained
from the survey execution. Limitations of this study as well
as possible future studies are discussed at Section 5. Section
6 introduces some related studies and, finally, Section 7
presents the conclusion. Additionally, we present the applied
questionnaire as well as the used likert anchoring scheme at
the appendix.
II.
C ONCEPTUAL M ODEL OF C USTOMER S ATISFACTION
The research model presented by this study verifies the
impact of an independent variable (software development
approach) on the project’s success indexes considering the
customer point of view. This independent variable may be
assigned with two different values: Scrum and not Scrum
(traditional approaches for software project management).
Fig. 3.
Agile empirical evaluation rate - Extracted from [9]
Thus, the scope for this work was defined intending to
provide a comparison between agile methods and traditional
software development approaches. First, it necessary to point
out that there are several agile methodologies such as Scrum,
Extreme Programming (XP), Feature-Driven Development,
Dynamic Systems Development Method (DSDM), Lean Software Development that are intended to support knowledge
work (less defined and more uncertain) [2]. In parallel, it also
exist many traditional approaches that are intended to support industrial work (more defined and less uncertain). These
methods and processes are usually based on the remarkable
frameworks such as PMBoK (Project Management Body of
Knowledge) [10] and Unified Process [4]. These methods may
include several perspectives such software engineering, project
management, design and so on. For an objective analysis, it
was chosen the project management perspective. On one hand,
for agile methods, it was selected Scrum (project management
based); on the other hand, it was chosen any traditional approach that include a perspective for the project management.
In this context, a survey was executed in the C.E.S.A.R
(Recife Center for Advanced Studies and Systems) using a random sample containing 19 different projects adopting Scrum
or any other traditional approach for managing the initiative
involving 156 developers. The main expected contributions by
this study are listed below:
•
•
Increase the body of knowledge about Scrum and
agile methods using a systematic approach through evidences within an industrial environment. In particular
it is intended to reduce the lack of empirical evaluation
in software development discussions.
Help the organization to understand how to increase
internal success rates by analyzing and discussing the
In particular, it is necessary to recognize that customers
probably have different definitions for “success” within a software project. In order to establish an external perspective, the
model assumes seven critical factors for customer satisfaction
(dependent variables), and consequently, for project success:
time, goals, quality, communication and transparency, agility,
innovation and benchmark. The next subsections provide more
details for each one.
A. Time
In general, “time to market” is a critical variable within
a software project. Thus, we define a project as successful
if agreed and negotiated deadlines are met. Since Scrum is
based on small iterations, it is expected anticipated delivery of
valuable software [11] and also short time-to-market. Hence,
we argue that a software project in which Scrum is adopted
is able to provide higher customer satisfaction rates regarding
to the time constraints by meeting the agreed and negotiated
deadlines. Hypotheses 1: Scrum-based projects provide increased customer satisfaction from the time perspective.
B. Goals
Software projects are launched for strategic purposes, such
as costs reduction, legal compliance, market-share increase,
etc. Thus, we define a project as successful if the goals
that motivated the endeavor are met. Since Scrum considers
a deeper and frequent stakeholder participation and collaboration, it is expected a continuous goals adjustment [11].
Hence, we argue that a software project in which Scrum is
adopted is able to provide higher customer satisfaction rates by
addressing the customer needs regarding to the defined goals
within a project. Hypotheses 2: Scrum-based projects provide
increased customer satisfaction from the goals perspective.
C. Quality
By definition, “quality is the degree to which a set of
inherent characteristics fulfill requirements” [10]. Product and
process quality depend on the software project criticality
demanded by the customers. Thus, we define a project as a
successful if the required quality standards for that specific
situation are met. So, regular inspections (one of the Scrum
pillars) are one of most effective quality tools within a software
development project [2]. Hence, we argue that a software
project in which Scrum is adopted is able to provide higher
customer satisfaction rates by addressing the customer needs
regarding to the defined quality standards within a project. Hypotheses 3: Scrum-based projects provide increased customer
satisfaction from the quality perspective.
D. Communication and Transparency
Software projects are expected to create intangible products under a dynamic and uncertain environment. Therefore,
frequent and continuous communication is required in order
to provide confidence to the stakeholders regarding to the
work progress. One of the Scrum pillars is transparency [11].
Thus, we define a project as successful if the customers
feel themselves confident as a result of the communication
and transparency. Hence, we argue that a software project in
which Scrum is adopted is able to provide higher customer
satisfaction rates by addressing the customer needs regarding
to the expected level of communication and transparency
within a project. Hypotheses 4: Scrum-based projects provide
increased customer satisfaction from the communication and
transparency perspective.
G. Benchmark
Usually, software projects are launched as a procurement
initiative in which an organization (buyer) hires a development
organization (seller) to create a product or service that may
be developed by several companies. It is natural that seller
organizations do comparison between their suppliers. In this
sense, we consider ”benchmark” as a comparison between
organizations that develop software. Thus, we define a project
as successful if customers may recommend a development
organization when comparing its project results to other organizations project results. Hence, we argue that a software
project in which Scrum is adopted is able to provide higher
customer satisfaction rates by comparing a project executed by
a specific organization with other ones. Hypotheses 7: Scrumbased projects provide increased customer satisfaction from
the benchmark perspective.
III.
In order to define a methodology to guide this study, we
have chosen an approach based on surveys; and selected five
of six recommended steps by Kitchenham [12], as below:
•
Setting the objectives: This study investigates the relationship between the Scrum adoption (as a software
development approach) and the customer satisfaction;
•
Survey design: Cross-sectional, since the survey instrument was applied only once at a fixed point in
time. It is not intended to promote a forward-looking
to provide information about changes in the specific
population through time;
•
Developing the survey instrument: It was based on
a questionnaire designed to identify the customer satisfaction within a particular project which determines
its success degree from the external point of view;
•
Obtaining valid data: The questionnaire was sent
through e-mail for each customer business representatives (e.g. sponsor, product or project managers);
•
Analyzing the data: Finally, the data analysis was
executed using techniques from descriptive and inferential statistics.
E. Agility
Some projects occurring in a fast-moving or timeconstrained environments, call for an agile approach [2]. The
main characteristics of an agile software project are the “early
and continuous delivery of valuable software” and “ability to
provide fast response to changes”. Thus, we define a project
as successful if the agility expected by the customers is met.
Hence, we argue that a software project in which Scrum is
adopted is able to provide higher customer satisfaction rates by
addressing the agility demanded by the customer. Hypotheses
5: Scrum-based projects provide increased customer satisfaction from the agility perspective.
F. Innovation
Software projects are expected to deliver new softwarebased products and services for users/customers existing and
emerging needs. Therefore, the innovation comes through new
ways of work, study, entertainment, healthcare, etc. supported
by software. Since Scrum also supports the principle of
“early and continuous delivery of valuable software” it is
expected that Scrum software development might help to create
innovative products and services for the customer business.
Thus, we define a project as successful if the innovation
expected by the customer is met. Hence, we argue that a
software project in which Scrum is adopted is able to provide
higher customer satisfaction rates by addressing the customer
expectation through innovative products and services generated
by the project. Hypotheses 6: Scrum-based projects provide
increased customer satisfaction from the innovation perspective.
R ESEARCH M ETHOD
The following subsections present discussions related to
the population, sample, variables, data collection procedure,
and data analysis techniques used for this study.
A. Population
The population for this study is targeted on software
intensive organizations, including companies of different sizes,
developing several software-based solutions for a wide variety
of markets.
B. Sample
It was selected a random sample of projects executed by
C.E.S.A.R - Recife Center for Advanced Studies and Systems1
which belongs to the target population. C.E.S.A.R is an innovation institute which has more than 500 employees working
1 http://www.cesar.org.br/site/
TABLE I.
S TUDY HYPOTESES
Null Hypotheses (NH)
Alternative Hypotheses (AH)
(NH1) Ts = Tns
(AH1) Ts ̸= Tns
(NH2) Gs = Gns
(AH2) Gs ̸= Gns
(NH3) Qs = Qns
(AH3) Qs ̸= Qns
(NH4) CTs = CTns
(AH4) CTs ̸= CTns
(NH5) As = Ans
(AH5) As ̸= Ans
(NH6) Is = Ins
(AH6) Is ̸= Ins
(NH7) Bs = Bns
(AH7) Bs ̸= Bns
the customer satisfaction the Likert scale was used
assuming values from 1 (poor) to 5 (excellent) values.
•
Fig. 4.
Contextual variables
on projects from different business domains (e.g. finance, thirdsector, manufacture, service, energy, government, telecommunication, etc.), creating solutions for several platforms (mobile,
embedded, web, etc.). The number of projects may vary from
70 to 100 in a year.
Initially, the sample contained 27 projects, but it was
reduced to 19 projects because incomplete questionnaires
responses were eliminated from the sample. Even though, it
represents an effective response rate of 70.3%, which is above
the minimum norm of 40% suggested by [13] for academic
studies.
Furthermore, it was collected additional information related
to each project, including project type, team size as below
(Figure 4):
•
Project type: 5 private and 14 public/brazilian tax
incentives law.
•
Team size: From 4 to 21.
•
Project nature: Consulting: 4; Information Systems:
3; Telecommunications: 4; Maintenance: 1; Research
& Development (R&D): 6; Embedded Systems: 4.
Notice that one project may have different natures. Due
to this reason, the number may be slightly different from the
sample size.
Contextual Variables: Project type, team size, and
project nature were identified as variables that may potentially influence the results. Project type and nature
categorization was previously defined. The team size
was the number of people involved during the development, including engineers, designers and manager.
D. Data Collection Procedure
First, the questionnaires were sent to customer business
representatives through e-mail in a Microsoft Excel spreadsheet format. Each document contained the project categorization regarding to the contextual variables (project type, nature,
and team size) and to the independent variable (Scrum/NonScrum).
Thus, the customer business representatives were responsible for answering the questionnaire and then sending it back
to the C.E.S.A.R project management office (PMO).
E. Data Analysis Techniques
The data analysis considered two different techniques.
First, it was executed an exploratory data analysis (descriptive
statistics) using tools such as barplots and boxplots in order to
identify the preliminary insights about the data characteristics
regarding to measures such as mean, position and variation.
Then, hypotheses tests (inferential statistics) were conducted to provide more robust information for the data analysis
process as shown in Table I. After the exploratory data
analysis, it was not found apparent relevant difference in
the obtained results. Thus, the alternative hypotheses were
modified to verify the inequality, instead of the superiority.
IV.
C. Variables
This study contains several variables as following:
•
Independent Variable: The software process is the
independent variable and may assume two different
values: Scrum (agile method) and Non-Scrum (any
traditional approach).
•
Dependent Variables: The success of a software
project is the result of customer satisfaction from an
external point of view considering several aspects:
time, goals, quality, communication and transparency,
agility, innovation and benchmark. In order to measure
R ESULTS
A. Descriptive Statistics - Exploratory Data Analysis
Initially, the final sample - the one containing 19 projects
- was divided into two groups (Scrum and Non-Scrum). Then,
some exploratory data analysis techniques (descriptive statistics) were applied in order to find out central tendency, position
and dispersion related to the data set. On one hand, barplots
(Figure 5) helped to identify the means (central tendency)
for each variable representing different aspects of customer
satisfaction. On the other hand, boxplots (Figure 6) helped to
reveal the data dispersion and position [14].
According to the barplots in Figure 5, we can notice that the
projects using Scrum presented better results considering the
was a lot of data dispersion from grade one to five;
and three was the mode.
Fig. 5.
Dependent variables means
•
Communication and Transparency (CT): For the
Scrum group, there was a variation (data dispersion)
from grade two to five without a predominance of any
value. For the Non-Scrum group, the grades were more
concentrated from grade four to five and the mode was
five.
•
Agility (A): Both boxplots (Scrum and Non-Scrum
groups) for the agility variable were extremely similar
presenting a variation from grade three to five and the
mode was the grade four.
•
Innovation (I): For the Scrum group the variation
was from grade four to five with an outlier (the grade
three). For the Non-Scrum group the grades presented
a dispersion from grade two to five.
•
Benchmark (B): For both groups, the variation was
the same: from grade three to five without any additional information.
Finally, it is not possible to determine a relevant difference
between the results from the groups considering the seven
dependent variables as aspects of customer satisfaction. Therefore, there is no evidence about an advantage for the projects
in which Scrum was applied.
B. Inferential Statistics - Hypotheses Tests
Fig. 6.
Dependent variables boxplots
following aspects: time, communication and transparency and
agility. The projects that did not use Scrum presented better
results for quality, goals, innovation and benchmark aspects.
Despite these results, it not possible to assume that any group
(Scrum and Non-Scrum) has an absolute advantage.
According to the boxplot in Figure 6, it is possible to make
some comments about each aspect of customer satisfaction
considering the grades obtained from the sample observations:
•
•
•
Time (T): For the Scrum projects groups, the grades
presented a dispersion from two to five; and the
second and third quartiles are coincident, showing that
many grades four were given by the customers. For
the Non-Scrum projects, the grades presented a more
concentrated behavior with a dispersion from three to
five; and a the first and second quartiles are coincident.
Goals (G): For both groups, it was possible to identify
a more concentrated data dispersion: from four to five
in the Scrum projects; and three to four in the NonScrum projects. Besides, there are many occurrences
of grades four in both groups. In particular, for the
Non-Scrum group, it may be seen an outlier (the grade
five).
Quality (Q): For the Scrum group, the variation
(dispersion) was from three to four and the mode
(most frequent value) was four with two outliers (the
grades two and five). For the Non-Scrum group, there
Since the exploratory data analysis (descriptive statistics)
was not able to provide any conclusion within this study, it
was decided to go ahead through another method. Hypotheses
tests (inferential statistics) was then used intending to establish
a systematic basis for a decision about the data set behavior.
First, the same previous segmentation was handled separating the sample into two groups: Scrum (seven elements) and
Non-Scrum (12 elements) projects. Thus, we assumed both as
independent samples containing ordinal data. In this case, it is
recommended using nonparametric test for ordinal variables.
In particular, it was chosen the Mann-Whitney’s U test [15].
When performing nonparametric (or distribution free), there is
no need to perform any kind of normality test (goodness of
fit).
The choice of U Mann-Whitney test did not bring harm to
problem analysis, as in situations where the data are normal,
the loss of efficiency compared to using the Student’s t test is
only 5%; in other situations where the data distribution has a
“heavier” tail than normal, the U test will be more efficient
[14].
Thus, hypotheses tests were performed (using the U test)
through R language2 to determine equality or inequality
considering the samples means for each group (Scrum and
Non-Scrum) from the perspective of each aspect (dependent
variable).
According to the previous hypothesis definitions, the equality was supposed to be accepted if the null hypothesis could
not be rejected. Instead (in case of null hypothesis rejection)
2 http://www.r-project.org/
TABLE II.
Criterion
H YPOTHESES TEST RESULTS
NH
AH
P-Value
W
T
Ts = Tns
Ts ̸= Tns
0.09736
60
G
Gs = Gns
Gs ̸= Gns
0.1137
26
Q
Qs = Qns
Qs ̸= Qns
0.7911
39
CT
CTs = CTns
CTs ̸= CTns
0.4849
49.5
A
As = Ans
As ̸= Ans
0.7126
46
I
Is = Ins
Is ̸= Ins
0.4681
34
B
Bs = Bns
Bs ̸= Bns
0.8216
39.5
we were supposed to recognize a difference related to the
means for each group and assume the alternative hypothesis.
The results obtained are presented in Table II.
The reference parameter to allow inference about the
acceptance or rejection of the null hypothesis was the p-value,
which is the test significance level. In addition, the p-value
obtained in each test would be compared to the Fisher’s scale
[14] which states that any p-value less than 0.05 should cause
the rejection of the null hypothesis. Hence, the obtained results
of p-values were absolutely superior compared 0.05. In this
case, no null hypothesis should be rejected.
Therefore, there is no evidence that the Scrum group
results were higher than Non-Scrum group results. Thus, it is
impossible to infer that the adoption of Scrum may increase the
customer satisfaction (and the project success as well) within
the scope of this research work.
V.
L IMITATIONS AND F UTURE W ORK
In this research work the internal validity was reduced
at the expense of external validity. On one side, the data
was collected from real-life industrial software development
projects and helped to increase the study external validity.
On the other hand, there are several contextual variable (e.g.
organization culture and environment factors) that were not
controlled and may influence on results, harming the internal
validity. So, for studies with reduced internal validity, it is
not possible to determine causality and generalization to other
contexts.
Furthermore, it is important to point out that it was not
executed any test to evaluate the questionnaire psychometric
properties what may jeopardize construct validity related to
this research.
Spite of these limitations, the study is expected to contribute to agile methodologies body of knowledge and to
Scrum discussion (in particular) once it supported by reallife experiences and an empirical evaluation. Thus, some
refinements are listed as possible future works as below:
•
Increase the sample size for obtaining more robust
results. The larger the sample, the stronger will be the
inferences about the data population behavior.
•
Investigation through other perspectives than customer
satisfaction such as team and satisfaction related to the
definition of success within a software project.
•
Execution of a different empirical evaluation technique. It would be promoted an experiment in order
to determine causality relationships. Additionally, it
would be included a case study intended to figure out
behavior and phenomenon that may lead to increased
customer satisfaction.
VI.
R ELATED W ORK
França et al [16] promoted a survey aimed to investigate
the relationship between agile practices usage and success
of projects using Scrum. The context for that research was
similar to the one considered for this work: software development companies located in the Porto Digital initiative,
Recife, Pernambuco, Brazil. Among the 25 attributes of agile
methodologies, only 8 (32%) correlated with the success of
the projects. Thus, as in our study, the agile methodologies
practices do not seem to show evidence of being decisive for
projects success.
Otherwise, a longitudinal case study conducted for 2 (two)
years by Mann [17] obtained quantitative indications that
Scrum adoption may lead to increased customer satisfaction
and overtime reduction.
Begel [18] also presented an industrial survey with Microsoft employees about the use of agile practices. In this
context, it was reported improved communication, quick releases, and flexibility/rapid response to changes as the main
benefits. On the other side, it was also reported disadvantages
including excessive number of meetings, difficulty to scale-up
for large projects; and buy-in decisions management.
VII.
C ONCLUSION
This paper has described an empirical evaluation designed
to provide insights for the question: “What is the impact
of Scrum on the Customer Satisfaction”? In general, people
who are enthusiastic of agile methods (including Scrum)
argue that these approaches are more suitable for software
development that is uncertain and requires flexibility to accommodate changes. In this context, we aimed to investigate
the relationship between the adoption of agile methodologies
and increased success rates in software development projects.
In order to provide an accurate comparison, we defined the
scope as considering an external perspective for success based
on the customer satisfaction according to several aspects including time, goals, quality, communication and transparency,
agility, innovation and benchmark (dependent variables). Thus,
other perspectives and aspects were considered out of scope
for this research. Additionally, for a proper comparison the
study focused on the project management property for software
development approaches.
We chose a cross-sectional survey using a real-life project
sample as our empirical evaluation method. The sample was
separated into two groups, Scrum and Non-Scrum (independent variable). This segmentation was intended to allow a
comparison between projects using Scrum and those using
other traditional approach for managing software development
approaches. In particular, the comparison was performed for
each dependent variable, intending to promote a detailed
analysis instead of an overall comparison.
The preliminary results from the exploratory analysis
showed no differences regarding to the data behavior fo both
groups (Scrum and Non-Scrum), considering several properties such as central tendency, position, and dispersion. Then,
quantitative analysis using a Mann-Whitney hypothesis test
(U test) also showed no relevant difference between both
groups results. Therefore, it was not possible to establish
any superiority associated with the use of Scrum in software
development projects.
met are less priority.
3. Fair: Some important goals were not met according to the
customer expectations.
2. Unsatisfactory: Not meeting several important goals.
1. Poor: The executing organization staff showed lack of
ability to identify customer needs and care of the goals was
very unsatisfactory.
We recognize some limitations for this study. First, the
internal validity might be threatened since we did not control
any contextual variable. Then, the construct validity might be
harmed because we were not able to verify: the psychometric
properties related to the questionnaire and also the standard
application of Scrum practices and guidelines. In spite of
these limitations, we expect this research can help industry
and academia in developing the software development body
of knowledge by combining scientific rigor with industry
experience. We also expect to contribute to the organization
(C.E.S.A.R) to understand how to increase success rates internally.
C. Quality: What is the perception about the quality of
the project and its products and services?
In the future, we intend to execute another survey with
an increased size sample (containing projects from several
organizations) considering the contextual variables as criteria
for data categorization. By promoting these refinements, we
aimed to figure out patterns of data behavior for specific
groups. In addition, other empirical evaluation techniques (experiments, case study) might be applied in order to overcome
the limitations mentioned previously.
A PPENDIX - Q UESTIONNAIRE
This appendix describes the questionnaire used as
instrument to measure each specific aspect of customer
satisfaction, as well as the likert scale anchoring approach
as recommend by Uebersax [19]. It is aimed to provide a
common understanding about the concept model and their
qualitative values.
A. Time: What is the customer feeling regarding to
the project deadlines?
5. Excellent: All deadlines defined or negotiated with
customer have been achieved. Deadlines adjusted due to
external dependencies to the customer must be considered
here.
4. Good: All deadlines, including the ones were negotiated due
to internal technical problem within executing organization
were met. In this classification each deadline may not have
been rescheduled more than once.
3. Fair: Existence of negotiated deadlines more than once,
due to problems with the executing organization, but were
met.
2. Unsatisfactory: Existence of some deadlines that were not
met and the deliveries occurred late.
1. Poor: Much of time constraints were not met, or there is
delay (s) that seriously impacted the customer.
5. Excellent: It was found no defect or only a few
minor ones.
4. Good: Some low severity defects were found and they
were resolved in a satisfactory manner and within the agreed
time.
3. Fair: Few moderate severity defects were detected and
they have been resolved in a satisfactory manner and within
the agreed time.
2. Unsatisfactory: Various defects of low severity were
identified. Or defects in general were not resolved within the
time agreed with the client.
1. Poor: Critical severity defects were identified at the stage
of acceptance tests.
D. Communication and Transparency: Does the customer
feel comfortable due to the information provided on the
progress of the project?
5. Excellent: Very effective communication between the
executing organization and the customer is performed
proactively, without the client request, providing the proper
level of information.
4. Good: Continuous transparency to the project through its
execution. Communication is established when requested by
the customer.
3. Fair: Existence of some problems related to one of them:
information display; lack of information, form of presentation
and data confusion.
2: Unsatisfactory: At various times it was not easy to see the
actual project progress. Information was not available when
they should be.
1. Poor: Transparency about the project progress was
nonexistent throughout its execution.
E. Agility: What is the customer perception about
the organization agility within a specific project?
B. Goals: Does the customer think the project objectives
were met?
5. Excellent: Expectations exceeded and high level of
professionalism.
4. Good: There was a satisfactory flexibility.
3. Fair: There was flexibility but sometimes expectations
were not met.
2. Unsatisfactory: There was some problem that not impacted
the project execution. However, there are several improvement
areas.
1. Poor: Existence of major problems that introduced impacts
the project execution, including unresolved and controversial
issues.
5. Excellent: All agreed objectives were met.
4. Good: Nearly all agreed objective were met. Goals not
F. Innovation: What is the customer perception about the
team capacity of bring innovation and innovative solutions?
5. Excellent: Team with excellent ability to present innovative
and efficient solutions, beyond expectations.
4. Good: Team presented satisfactory innovative solutions,
meeting expectations.
3. Fair: Team presented some innovative solutions, but not
all expectations were met.
2. Unsatisfactory: Team with low capacity to present
innovative solutions to the tasks. Several problems were faced
when trying to resolve more complex requirements.
1. Poor: Lack of ideas / innovative solutions, not meeting the
expectations.
G. Benchmark. What is the organization performance
compared to other suppliers considering the project
execution?
5.
4.
3.
2.
1.
Excellent
Good
Fair
Unsatisfactory
Poor
ACKNOWLEDGMENT
The authors would like to thank C.E.S.A.R (Recife Center
for Advanced Studies and Systems) for kindly providing reallife projects data obtained from its PMO (Project Management
Office); We would also like to thank Federal University of
Pernambuco (UFPE) and Informatics Center (CIn) for supporting this research work. This work was partially supported
by the National Institute of Science and Technology for Software Engineering (INES), grants 5739642008-4 (CNPq) and
APQ-1037-1.03/08 (FACEPE). Bruno Cartaxo and Antonio Sa
Barreto is supported by FACEPE, Sérgio Soares is partially
supported by CNPq grant 3050852010-7.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
P. Naur and B. Randell, Eds., Software Engineering: Report of a
conference sponsored by the NATO Science Committee, Garmisch,
Germany, 7-11 Oct. 1968, Brussels, Scientific Affairs Division, NATO,
1969.
M. Griffiths, PMI-ACP Exam Prep: Rapid Learning to Pass the Pmi
Agile Certified Practitioner (Pmi-acp) Exam - on Your First Try!:
Premier Edition. Rmc Publications Incorporated, 2012. [Online].
Available: http://books.google.com.ar/books?id=mM6rtgAACAAJ
I. Sommerville, Software Engineering, 9th ed.
Harlow, England:
Addison-Wesley, 2010.
P. Kruchten, The Rational Unified Process: An Introduction, 3rd ed.
Boston: Addison-Wesley, 2003.
“2001 chaos report,” Tech. Rep.
B. Cartaxo, I. Costa, D. Abrantes, A. Santos, S. Soares, and
V. Garcia, “Eseml: empirical software engineering modeling language,”
in Proceedings of the 2012 workshop on Domain-specific modeling,
ser. DSM ’12. New York, NY, USA: ACM, 2012, pp. 55–60.
[Online]. Available: http://doi.acm.org/10.1145/2420918.2420933
D. I. K. Sjoberg, T. Dyba, and M. Jorgensen, “The future of
empirical methods in software engineering research,” in 2007 Future
of Software Engineering, ser. FOSE ’07. Washington, DC, USA:
IEEE Computer Society, 2007, pp. 358–378. [Online]. Available:
http://dx.doi.org/10.1109/FOSE.2007.30
W. F. Tichy, “Should computer scientists experiment more?”
Computer, vol. 31, no. 5, pp. 32–40, May 1998. [Online]. Available:
http://dx.doi.org/10.1109/2.675631
[9] T. Dyba and T. Dingsoyr, “What do we know about agile software
development?” Software, IEEE, vol. 26, no. 5, pp. 6–9, 2009.
[10] PMI, Ed., A Guide to the Project Management Body of Knowledge
(PMBOK Guide): An American National Standard ANSI/PMI 99-0012008, 4th ed. Newtown Square, PA: Project Management Institute,
2008.
[11] K. Schwaber, Agile Project Management With Scrum. Redmond, WA,
USA: Microsoft Press, 2004.
[12] B. A. Kitchenham and S. L. Pfleeger, “Personal Opinion Surveys,” in
Guide to Advanced Empirical Software Engineering, F. Shull, J. Singer,
and D. I. K. Sjøberg, Eds. Springer, 2008, pp. 63–92+.
[13] Y. Baruch, “Response Rate in Academic Studies-A Comparative
Analysis,” Human Relations, vol. 52, no. 4, pp. 421–438, Apr. 1999.
[Online]. Available: http://dx.doi.org/10.1177/001872679905200401
[14] W. O. Bussab and P. A. Morettin, Estatı́stica Básica, 6th ed. Saraiva,
2010.
[15] S. Siegel and N. Castellan, Nonparametric statistics for the behavioral
sciences, 2nd ed. McGraw–Hill, Inc., 1988.
[16] A. C. C. França, F. Q. B. da Silva, and L. M. R. de Sousa Mariz,
“An empirical study on the relationship between the use of
agile practices and the success of scrum projects,” in Proceedings
of the 2010 ACM-IEEE International Symposium on Empirical
Software Engineering and Measurement, ser. ESEM ’10. New
York, NY, USA: ACM, 2010, pp. 37:1–37:4. [Online]. Available:
http://doi.acm.org/10.1145/1852786.1852835
[17] C. Mann and F. Maurer, “A case study on the impact of scrum
on overtime and customer satisfaction,” in Proceedings of the
Agile Development Conference, ser. ADC ’05. Washington, DC,
USA: IEEE Computer Society, 2005, pp. 70–79. [Online]. Available:
http://dx.doi.org/10.1109/ADC.2005.1
[18] A. Begel and N. Nagappan, “Usage and perceptions of agile
software development in an industrial context: An exploratory study,”
in Proceedings of the First International Symposium on Empirical
Software Engineering and Measurement, ser. ESEM ’07. Washington,
DC, USA: IEEE Computer Society, 2007, pp. 255–264. [Online].
Available: http://dx.doi.org/10.1109/ESEM.2007.85
[19] J. S. Uebersax, “Likert scales: Dispelling the confusion,” april 2013.
[Online]. Available: http: //www.john-uebersax.com/stat/likert.htm
Identifying a Subset of TMMi Practices to Establish
a Streamlined Software Testing Process
Kamilla Gomes Camargo∗ , Fabiano Cutigi Ferrari∗ , Sandra Camargo Pinto Ferraz Fabbri∗
∗ Computing
Department – Federal University of São Carlos – Brazil
Email: {kamilla camargo, fabiano, sfabbri}@dc.ufscar.br
Abstract—Context: Testing is one of the most important phases
of software development. However, in industry this phase is
usually compromised by the lack of planning and resources. Due
to it, the adoption of a streamlined testing process can lead to the
construction of software products with desirable quality levels.
Objective: Presenting the results of a survey conducted to identify
a set of key practices to support the definition of a generic,
streamlined software testing process, based on the practices described in the TMMi (Test Maturity Model integration). Method:
Based on the TMMi, we have performed a survey among software
testing professionals who work in both academia and industry.
Their responses were analysed quantitatively and qualitatively in
order to identify priority practices to build the intended generic
process. Results: The analysis enabled us to identify practices that
were ranked as mandatory, those that are essential and should
be implemented in all cases. This set of practices (33 in total)
represents only 40% of the TMMi’s full set of practices, which
sums up to 81 items related to the steps of a testing process.
Conclusion: The results show that there is consensus on a subset
of practices that can guide the definition of a lean testing process
when compared to a process that includes all TMMi practices.
It is expected that such a process encourages a wider adoption
of testing activities in software development.
I. I NTRODUCTION
Since software had become widely used, it has played
an important role in the people’s daily lives. Consequently,
its reliability cannot be ignored [1]. In this context, quality
assurance (QA) activities should monitor the whole development process, promoting the improvement of the final product
quality, and hence making it more reliable. One of the main
QA activities is software testing that, when well executed, may
deliver a final product with a low number of defects.
Despite the importance of software testing, many companies
face difficulties in devising a software testing process and customising it to their reality. A major barrier is the difficulty in
adapting testing maturity models for the specific environment
of the organisation [2]. Many organisations realise that process
improvement initiatives can solve these problems. However, in
practice, defining the steps that can be taken to improve and
control the testing process phases and the order they should
be implemented is, in general, a difficult task [3].
Reference models, such as TMMi [4], point out what should
be done for the improvement of the software testing process.
However, such models do not indicate how to do it. Despite
the model organisation in levels (such as in CMMI [5]), which
suggests an incremental implementation from the lowest level,
TMMi has a large number of practices that must be satisfied,
though not all of them are feasible for all sizes of companies
and teams. In addition, the establishment of a testing process
relying on a reference model becomes a hard task due to the
difficulty for model comprehension. Moreover, the models do
not define priorities in case of lack of time and/or resources,
thus hindering the whole model adoption.
According to Purper [6], the team responsible to define
the testing process usually outlines a mind map of the model
requirements in relation to the desired testing process. During
the elaboration of the real testing process, this team manually
verifies whether the mandatory practices, required by the
model, are addressed. In general, these models indicate some
prioritisation through their levels; however, within each level,
it is not clear what should be satisfied at first.
For better results, the testing process should include all
phases of software testing. However, the process should be as
minimal as possible, according to the reality of the company
and the model used for software development. This adequacy
can make the testing process easier to be applied and does not
require many resources or a large team. This shall help the
company achieve the goal of improving the product quality.
Based on this scenario, we conducted a survey in order
to identify which are the practices of TMMi that should
be always present in a testing process. Our goal was to
characterise the context of Brazilian companies to provide
them with a direction on how to define a lightweight, still
complete testing process. Therefore, the survey results reflect
the point of view of Brazilian testing professionals. Given that
a generic testing process encompasses phases such as planning,
test case design, execution and analysis, and monitoring [7, 8],
we expected the survey could indicate which are the essential
practices for each phase. The assumption was that there are
basic activities related to each of these phases that should never
be put aside, even though budget, time or staff are scarce.
The remainder of this paper is organised as follows: Section II describes the underlying concepts of this research. Section III presents the survey planning, how the participants were
invited, and the data evaluation methods. Section IV shows the
survey results and the participant’s profile. Section V discusses
these results for each stage of the generic testing process.
Finally, Section VI presents possible threats to the validity
of the survey, and Section VII presents the conclusions.
II. BACKGROUND
TMMi [4] is a reference model that complements CMMI [5]
and was established to guide the implementation and improvement of testing processes. It is similar to CMMI in structure,
because it includes maturity levels that are reached through
the achievement of goals and practices. For TMMi, a process
evolves from a chaotic initial state (Level 1), to a state in
which the process is managed, controlled and optimised (Level
5). Each specific goal indicates a single characteristic that
must be present in order to satisfy the corresponding process
area. A specific goal is divided into specific practices that
describe which activities are important and can be performed
to achieve the goal. Generic goals are related to more than
one process area and describe features which may be used
to institutionalise the testing process. Figure 1 illustrates the
structure of TMMi. The survey questionnaire of this study was
developed based on the TMMi specific goals and practices.
Each goal was represented by a question and each practice
represented by a sub question.
Fig. 1.
TMMi structure and components [4]
Höhn [9] has defined a mind map of TMMi. The map
distributes process areas, specific goals and their practices
throughout phases of a generic testing process. This map is
called KITMap and was developed to facilitate the TMMi
understanding and to share information. In the map, the
root node is the name of the treated theme, i.e. the testing
process. Nodes of the second level are the phases of a generic
testing process. Such phases guided the grouping of the survey
questions. They are: Planning, Test Case Design, Setup of
Test Environment and Data, Execution and Evaluation, and
Monitoring and Control. At the third level of KITMap are
process areas of TMMi.
Höhn [9] organised the process areas of TMMi according
to their relation to each phase of the generic testing process.
Figure 2 illustrates, from the left side, (i) the phase of the
generic testing process (i.e. Test Case Design); (ii) two process
areas that are related to that phase; (iii) the specific goal
related to the first process area; and (iv) the various specific
practices related to the specific goal. Note that process areas
from different TMMi levels may be associated to the same
phase of the generic testing process. This can be observed
in Figure 2, in which one process area is from Level 2 of
TMMi while the other is from Level 3. Despite this, both are
associated to the same phase (Test Case Design).
III. S URVEY P LANNING
A. Survey Goals
This study was performed with the aim of identifying
which are the most important practices of TMMi, and hence
should be prioritised during the testing process execution,
according to the opinion of who works with software testing
for three years or more. This study is motivated by our
experience and, equally important, real life observations that
some testing-related practices should never be put aside, even
though time, budget or human resources are scarce.
B. Survey Design
The survey was developed using the Lime Survey tool [10].
Lime Survey allows one to organise survey questions in groups
and visualise them in separate web pages. The questionnaire
is based on TMMi Version 3.1 [4].
The questions were split into six groups. The first group
aims to characterise the subject profiles; it includes questions
related to the level of knowledge on software quality reference
models, namely, CMMI [5], MR-MPS [11] and TMMi. The
remaining groups of questions (i.e. 2 to 6) each focuses on
a phase of a generic testing process, as defined by Höhn [9].
The phases are: (1) Planning (2) Test Case Design (3) Setup of
Testing Environment and Data (4) Execution and Evaluation
(5) Monitoring and Control.
Each questionnaire page includes a single group of questions. The first page also brings some directions regarding
how to fill in the forms, including a table to describe the
values the subjects could use to assign each TMMi practice
a level of importance. The values are described in Table I.
Note that we decided not to include a neutral value for the
scale of importance. This was intended to make the subject
decide between practices that should be classified as priority
(i.e. levels 4 or 3 of importance) or not (i.e. levels 2 or 1).
All questions related to testing practices (i.e. groups 2 to 6)
are required; otherwise, a subject cannot go ahead to the next
group of questions.
TABLE I
L EVELS OF IMPORTANCE FOR SURVEYED
PRACTICES .
1- Dispensable Dispensable activity; does not need to be performed.
2- Optional
Activity that not necessarily needs to be performed.
3- Desirable
Activity that should be implemented, though may be put
aside.
4- Mandatory
Essential activity that must always be performed.
We highlight two key points in this survey: (1) none of the
subjects were told the questionnaire was based on the TMMi
structure – this intended to avoid bias introduced by knowledge
on the process maturity model; and (2) the subjects should
answer the questionnaire according to their personal opinion
– this intended to avoid bias introduced by the company or
institution context.
To build the questionnaire, which is available online1 , we
translated TMMi goals, practices and other items of interest to
Portuguese, since there is no official translation of TMMi to
languages other than English. The translation took into account
technical vocabulary in the target language (i.e. Portuguese).
In the questionnaire, every question within a given group
(i.e. within a specific testing process phase) regards a TMMi
Specific Goal. Each question includes a set of sub-questions
regarding the TMMi Specific Practices (SPs). Note that in
TMMi a Specific Goal is achieved when the associated SPs
are performed in a testing process. Therefore, assigning a
1 http://amon.dc.ufscar.br/limesurvey/index.php?sid=47762&lang=pt-BR –
accessed on 17/04/2013.
SP1.1 Identify and prioritise test conditions
Level 2 PA2.4 Test Design
and Execution
SG1 Perform Test Analisys and
Design using Test Design Techiniques
Test Case Design
SP1.2 Identify and prioritise test cases
SP1.3 Identify necessary specifc data
SP1.4 Maitain horizontal treceability with requirements
SP3.1 Identify and prioritise non-functional test conditions
Level 3 PA3.4 Non-Functional
Testing
Fig. 2.
SG3 Perform Non-Funcional Test
Analisys and Design
SP3.2 Identify and prioritise non-functional test cases
SP3.3 Identify necessary specifc data
SP3.4 Maitain horizontal treceability with non-functional
requirements
KITMap excerpt (adapted from Höhn [9]).
particular set of SPs levels of importance shall allow us to also
draw conclusions about the Specific Goal relevance, according
to the subject’s personal experience.
Figure 3 illustrates a question related to the Planning
Phase. This question addresses the Perform a Product Risk
Assessment Specific Goal, and includes three sub-questions
regarding the associated Spa’s. As previously described, the
subject should assign a level of importance ranging from 1
to 4 to each SP (see Table I), according to its opinion about
the relevance of the SP to achieving the goal defined in the
question. Note that all questions bring a side note to help the
subject understand and properly answer the question. This help
note can be seen in the bottom of Figure 3.
Characterising the Profiles: The first group of questions aims
to characterise the profile of subjects taking into account their
work environment. Figure 4 shows part of the profile form.
To design the profile form, we considered that the subject’s
experience and the process maturity level of its institution
or company impact on the subject’s knowledge on testing.
Therefore, the following information is required:
• Experience with software testing (research, industry and
teaching): it is well-known that tacit knowledge is different
from explicit knowledge. Due to this, this information aims
to characterise different types of knowledge, acquired either
with industrial, research or teaching experience.
• Testing process in the company: this information is required only for those who report experience in industry, in
order to characterise their work environment.
• Certification in process maturity model: this information is
required for those who report their companies have any certification in maturity models; if applicable, the subject is required to inform which maturity model (namely, MR-MPS,
CMMI, TMMi or any other) and the corresponding level.
This might have impact on the subject’s personal maturity
regarding the model.
• Knowledge of TMMi and MR-MPS: knowledge of reference models, specially TMMi, grants the subject a higher
maturity regarding testing processes.
C. Obtained Sample
For this survey, a personal e-mail announcement was sent
to Brazilian software testing professionals from both academy
and industry. It was also announced in a mailing list2 that includes more than 3,000 subscribers from Brazil. Furthermore,
Fig. 4.
we invited professionals that work for a pool of IT companies
named PISO (Pólo Industrial de Software)3 from the city of
Ribeirão Preto, Brazil.
The questionnaire was made available in December, 2011,
and remained open for a period of 45 days. In total, we
registered 113 visits, from which 39 resulted in fully answered
questionnaires that were considered for data analysis. Even
though the sample is not large, these 39 answers allowed us
to analyse the data statistically, however with less rigour than
in analyses applied to large samples. The analysis procedures
are described in the following section.
D. Data Analysis Procedures
Initial Analysis: An initial data analysis revealed that the
practices were mostly ranked as 3 and 4 in regard to their level
of importance. This is depicted in Figure 5, which groups
the answers of all subjects for all questions according to
the assigned levels of importance4 . This initial analysis also
allowed us to identify two outliers which were removed from
the dataset: the first regards a subject that assigned level 4 to
all practices, while the second inverted all values in his/her
answers (i.e. he/she interpreted the value of 4 as the lowest
level of importance and the value of 1 as the highest level).
Therefore, the final dataset, depicted in Figure 5, comprises
37 fully filled in questionnaires.
3 http://www.piso.org.br/
- accessed on 17/04/2013
that we had a total of 37 set of answers for 81 questions; thus, the
four groups shown in Figure 5 sum up 2997 individual answers.
4 Note
2 http://br.dir.groups.yahoo.com/group/DFTestes
– accessed on 17/04/2013
Part of the profile characterisation form (translated to English).
Fig. 3.
Example of question, structured according to a testing process phase and TMMi Specific Goals and Practices.
1309
Number of answers
1225
355
108
1
2
3
4
Level of Importance
Fig. 5. Frequency distribution of levels of importance, considering all answers
of all subjects.
In this survey, we considered the following independent
variables: (i) industrial experience with testing process; (ii)
knowledge and usage experience with MR-MPS; and (iii)
knowledge of TMMi. The dependent variable is the level of
importance assigned to each practice. The scale used for the
dependent variable characterises data with ordinal measurement level, i.e. we were dealing with discrete values. Besides,
the data distribution was non-symmetric since the vast majority
of practices were ranked as 3 and 4, as shown in Figure 5.
The characteristics of the data led us to use the nonparametric Sign Test [12]. This test evaluates if the median, for
a given set of values (in our case, for each practice), is higher
than a fixed value. We used the fixed value of 3.5, which would
allow us to identify which practices were indeed classified
as mandatory (i.e. with maximum level of importance), since
more than 50% of the subject would have been ranked those
practices as mandatory.
Due to the size of our sample, we adopted a p-value=0.15 to
draw conclusions on the executed tests. Even though this is not
a widely adopted level of confidence, some other exploratory
studies [13, 14], which dealt with similar small samples, also
adopted relaxed levels of confidence instead of the traditional
p-value=0.01 or p-value=0.05.
The results of this analysis did not result in statistical
significance for some practices, even when the majority of
subjects assigned levels 3 or 4 for those practices. For instance,
the Identify and prioritise test cases practice was ranked as
mandatory by most of the subjects (19 out of 37); however,
the Sign Test did not show statistical significance. Obviously,
the sample size may have impacted on the sensitiveness of the
statistical test, leading to inconclusive results even in cases of
majority of answers ranging from 3 to 4. This is the case
of the Identify and prioritise test conditions practice. The
answer distribution for this practice is summarised in Table II.
The figures show that the number of subjects that assigned
this practice level 3 of importance is higher than the number
of subjects that assigned it level 4; despite this, we could not
observe any difference, statistically significant, in favour of
the former (i.e. level 3).
TABLE II
L EVELS OF IMPORTANCE ASSIGNED TO THE Identify and prioritise test
conditions PRACTICE .
Level
4
3
2
1
Number of answers
14
19
2
2
After this first analysis, we elaborated a new set of three
questions to help us clarify some open issues. These new
questions, which require simple “Yes” or “No” answers, aimed
to resolve some dependencies observed in the results. Such
dependencies were identified by Höhn [9] and indicate that
the implementation of some testing-related practices require
the previous implementation of others. This new questionnaire
was announced by e-mail to all subjects who answered the
first one, and remained open for a period of 14 days. We had
feedback from 14 subjects. The results of this new round of
questions are discussed in Section V.
A new analysis, based on the frequency of answers in the
first set of questions, indicated some trends the statistical tests
did not allow for. It consisted in a descriptive analysis of the
data, since we were unable to conclude on some practices,
even when they were ranked as mandatory by many subjects.
In short, we identified the practices that were mostly ranked
as mandatory when comparing to the other values of the
scale (desirable, optional and dispensable – see Table I).
In spite of the weak confidence such kind of analysis may
represent, the identified subset of practices was similar to the
subset obtained solely based on a statistical basis. In fact,
this set of practices included all practices identified through
the statistical aforementioned procedures. A summary of the
results is depicted in the Venn diagram of Figure 7. Details
are discussed in the next section.
IV. R ESULTS
The results of the survey are described in this section.
Initially, Section IV-A defines some profiles, each representing
a group of subjects, based on the experience reported in
the profile characterisation form. Then, Section IV-B shows
the results with respect to the level of importance of TMMi
practices according to each profile.
A. Profile Definition
Figure 6 summarises the level of knowledge of both subjects
and their institutions according the the profile characterisation
questions. A description of the charts comes in the sequence.
a) Experience
More than 3 years
From 1 to 3 years
Less than 1 year
11%
b) Testing Process
Yes
No
n/a
14%
22%
46%
65%
43%
c) Certification
d) Type of Certification
Yes
No
n/a
16%
MR-MPS
CMMI
50%
50%
24%
59%
8%
e) TMMi
59%
Fig. 6.
32%
Knows in practice
Knows in theory only
Does not know
Summary of profile characterisation.
a) Experience: this chart shows that 46% of the subjects (17
out of 37) have more than three years of experience in
testing either in industry or academy; only 11% (4 out of
37) have less than one-year experience.
b) Testing Process: this chart shows that 65% of the subjects
(24 out of 37) work (or have worked) in a company that
has a testing process officially implemented (i.e. an explicit
testing process). From the remaining subjects, 22% (8 out
of 37) do not (or have not) worked in a company with an
explicit testing process, while around 14% (5 out of 37)
have not answered this question.
c) Certification: this chart shows that 59% of the subjects (22
out of 37) work (or have worked) in a company that has
been certified with respect to a software process maturity
model (e.g. CMMI, MR-MPS). The remaining subjects
have never worked in a certified company (24%) or have
not answered this question (16%).
d) Type of Certification: from the subjects that reported to
work (or have worked) in a certified company – chart (c)
of Figure 6 –, half of them (i.e. 11 subjects) are (or were)
in a CMMI-certified company, while the remaining are (or
were) in a MR-MPS-certified company.
e) TMMi: this chart reveals that only 8% of the subjects (3
out of 37) have had any practical experience with TMMi.
Besides this, 59% of subjects (22 out of 37) have stated to
have only theoretical knowledge of TMMi, whereas 32%
(12 out of 37) do not know this reference model.
Based on the results depicted in Figure 6, we concluded that
the sample is relevant with respect to the goals established for
this work. This conclusion relies on the fact that, amongst the
37 subjects who have fully answered the questionnaire, (i) 89%
of them have good to high knowledge of software testing
(i.e. more than one-year experience); (ii) 65% work (or have
worked) in companies that officially have a software testing
process; (iii) 59% work (or have worked) in a CMMI- or
MR-MPS-certified company; and (iv) 67% are knowledgeable
of TMMi, at least in theory. For CMMI-certified companies,
the maturity levels vary from 2 to 5 (i.e. from Managed to
Optimising). For MR-MPS-certified companies, the maturity
levels range from G to E (i.e. from Partially Managed to
Partially Defined).
To analyse the results regarding the level of importance of
TMMi practices according to the subjects’ personal opinion,
we defined three different profiles as follows:
• Profile-Specialist: compound by 12 subjects who have at
least three years of experience with software testing and
work (or have worked) in a company that has a formally
implemented software testing process.
• Profile-MR-MPS: compound by 20 subjects that are
knowledgeable of MR-MPS and use this reference model
in practice.
• Profile-TMMi: compound by 25 subjects that are knowledgeable of TMMi.
The choice for a MPS.BR-related profile was motivated
by the straight relationship between the reference model and
context of Brazilian software companies. Furthermore, these
three specific profiles were defined because we believe the
associated subjects’ tacit knowledge is very representative.
Note that the opinion of experts in CMMI was not overlooked
at all; instead, such experts’ opinion are spread over the
analysed profiles. Finally, we also considered the answers of
all subjects, in a group named Complete Set.
B. Characterising the Importance of TMMi Practices
As previously mentioned, the results herein describe are
based on the three profiles (namely, Profile-Specialist,
Profile-MR-MPS and Profile-TMMi) as well as on the whole
survey sample. Within each profile, we identified which practices were mostly ranked as mandatory. The Venn diagram
depicted in Figure 7 includes all mandatory practices, according to each profile. The practices are represented by numbers
and are listed in the table shown together with the diagram.
In Figure 7, the practices with grey background are also
present in the set obtained solely from the statistical analysis
described in Section III-D. As the reader can notice, this
set of practices appears in the intersection of all profiles.
Furthermore, practices with bold labels (e.g. practices 5, 7,
22, 31 etc.) are present in the set aimed to compose a lean
testing process (this is discussed in details in Section V). Next
we describe the results depicted in Figure 7.
• Complete Set: taking the full sample into account, 31
practices were assigned level 4 of importance (i.e. ranked
2
3
4
5
7
9
10
11
12
13
15
16
17
19
20
21
22
23
25
26
27
28
29
30
31
32
33
38
41
Identify product risks
Analyse product risks
Identify items and features to be tested
Define the test approach
Define exit criteria
Establish a top-level work breakdown structure
Define test lifecycle
Determine estimates for test effort and cost
Establish the test schedule
Plan for testing staffing
Identify test project risks
Establish the test plan
Review test plan
Obtain test plan commitments
Elicit test environment needs
Develop the test environment requirements
Analyse the test environment requirements
Identify non-functional product risks
Identify non-functional features to be tested
Define the non-functional test approach
Define non-functional exit criteria
Identify work products to be reviewed
Define peer review criteria
Identify and prioritise test conditions
Identify and prioritise test cases
Identify necessary specific test data
Maintain horizontal traceability with requirements
Develop and prioritise test procedures
Develop test execution schedule
42
45
46
49
50
51
52
53
54
55
56
57
58
60
63
66
67
69
71
72
73
74
75
76
77
79
80
81
Implement the test environment
Perform test environment intake test
Develop and prioritise non-functional test procedures
Execute test cases
Report test incidents
Write test log
Decide disposition of test incidents in configuration control board
Perform appropriate action close the test incident
Track the status of test incidents
Execute non-functional test cases
Report non-functional test incidents
Write test log (non-functional)
Conduct peer reviews
Analyse peer review data
Monitor test commitments
Conduct test progress reviews
Conduct test progress milestone reviews
Monitor defects
Monitor exit criteria
Monitor suspension and resumption criteria
Conduct product quality reviews
Conduct product quality milestone reviews
Analyse issues
Take corrective action
Manage corrective action
Perform test data management
Co-ordinate the availability and usage of the test environments
Report and manage test environment incidents
Fig. 7. Venn diagram that shows the intersections of results with respect to the practices ranked as mandatory by the majority of subjects within each profile.
as mandatory) by most of the subjects. The majority of
them are also present in the other profile-specific sets,
as shown in Figure 7. The reduced set of practices to
compose a lean testing process includes these 31 items, and
is complemented with practices 5 and 7 (the justification is
presented in Section V).
• Profile-Specialist: 49 practices were ranked as mandatory
by most of subjects within this profile. From these, 27
practices appear in the intersection with at least another set;
• Profile-MR-MPS: subjects of this profile ranked 33 practices as mandatory, from which only 30 are in intersections
with the other profiles; only 3 practices are considered
mandatory exclusively for subjects of this profile.
• Profile-TMMi: for those who know TMMi, 42 practices are
mandatory, from which 41 ones appear in the intersections
with other profiles.
V. A NALYSIS AND D ISCUSSION
Before the definition of the aimed reduced set of practices,
we analysed the results of the second questionnaire, which
has been designed to resolve some dependencies observed in
the initial dataset (i.e. based on the 37 analysed answers).
The dependencies have been identified by Höhn [9], who has
pointed out some practices that must be implemented before
the implementation of others. Based on the feedback of 14
subjects, all included in the initial sample, we were able to
resolve the observed dependencies, which are related to the
following practices: Analyse product risks, Define the test
approach, and Define exit criteria.
Regarding Analyse product risks, the subjects were asked if
this task should be done as part of the testing process. We got
12 positive answers, thus indicating this practice is relevant,
for example, to support the prioritisation of test cases. In fact,
the Analyse product risks practice was already present in the
reduced set of practices identified from the first part of the
survey. In spite of this, we wanted to make sure the subjects
have had clear comprehension that it should be performed as
part of the testing process.
The subjects were also asked whether a testing approach
could be considered fully defined when the product risks
were already analysed, and items and features to be tested
were already defined. This question was motivated by the fact
that Define the test approach (practice 5 in Figure 7) was
not present in the reduced set of practices derived from the
initial questionnaire. For this question, we received 10 negative
answers; that is, one cannot consider the testing approach fully
defined only by analysing product risks and defining items and
features to be tested. Therefore, we included practice 5 in the
final set, thus resolving a dependency reported by Höhn [9].
The third question of the second questionnaire addressed the
Define exit criteria practice (#5 in Figure 7), since it was not
identified as mandatory after the first data analysis. Subjects
were asked whether it is possible to run a test process without
explicit exit criteria (i.e. information about when test should
stop). Based on 9 negative answers (i.e. 65%), this practice
was also included in the reduced set.
This second analysis helped us to either clarify or resolve
the aforementioned dependencies amongst TMMi practices.
In the next sections we analyse and discuss the survey results.
For this, we adapted Höhn’s mind map [9] (Figures 8–12),
according to each phase of a generic testing process. Practices
highlighted in grey are identified as mandatory and should be
implemented in any testing process.
A. Planning
Planning the testing activity is definitely one of the most
important process phases. It comprises the definition of how
testing will be performed and what will be tested; it enables
proper activity monitoring, control and measurement. The
derived test plan includes details of the schedule, team, items
to be tested, and the approach to be applied [15]. In TMMi,
planning-related practices also comprise non-functional testing, definition of the test environment and peer reviews. In
total, 29 practices are related to planning (see Figure 8), spread
over the nine specific goals (labelled with SG in the figure).
To achieve these goals, the organisation must fulfil all the
practices shown in Figure 8. Despite this, our results show that
only 8 out of these 27 practices are mandatory, according to the
Complete Set subject group. According to Höhn’s analysis,
TMMi has internal dependencies amongst practices, some
Phase of
Generic
Process
TMMI
Process
Area (PA)
Specific Goal (SG)
of Process Area
SG 1
Perform a Product
Risk Assessment
Phase 1:
Planning
Specific Practice (SP)
of Specific Goal
SP 1.1 – Define product risk categories
and parameters
SP 1.2 – Identify product risks
SP 1.3 – Analyse product risks
SP 2.1 – Identify items and features
to be tested
SG 2
Establish a Test
Approach
SP 2.2 – Define the test approach
SP 2.3 – Define entry criteria
SP 2.4 – Define exit criteria
SP 2.5 – Define suspension and
resumption criteria
Level 2 – PA 2.2
SG 3
Establish Test
Estimates
SP 3.1 – Establish a top-level work
breakdown structure
SP 3.2 – Define test lifecycle
Test Planning
SP 3.3 – Determine estimates for
test effort and cost
SP 4.1 – Establish the test schedule
SG 4
Develop a
Test Plan
SP 4.2 – Plan for test staffing
SP 4.3 – Plan stakeholder involvement
SP 4.4 – Identify test project risks
SP 4.5 Establish the test plan
SG 5
Obtain Commitment
to the Test Plan
SP 5.1 – Review test plan
SP 5.2 – Reconcile work and resource
levels
SP 5.3 – Obtain test plan commitments
Level 2 – PA 2.5
SG 1
Develop Test
Environment
Requirements
Test Environment
SG 1
Perform a NonFunctional Product
Risk Assessment
SP 1.3 – Analyse the test environment
requirements
SP 1.1 – Identify non-functional
product risks
SP 1.2 – Analyse non-functional
product risks
Level 3 – PA 3.4
Non-Functional
Testing
SP 1.1 – Elicit test environment needs
SP 1.2 – Develop the test environment
requirements
SG 2
Establish a NonFunctional Test
Approach
SP 2.1 – Identify non-functional
features to be tested
SP 2.2 – Define the non-functional test
approach
SP 2.3 – Define non-functional exit
criteria
Level 3 – PA 3.5
Peer Reviews
Fig. 8.
SG 1
Establish a Peer
Review Approach
SP 1.1 – Identify work products to be
reviewed
SP 1.2 – Define peer review criteria
TMMi practices related to Planning.
related to the Planning phase. Therefore, 2 other practices are
necessary to resolve such dependencies (this is discussed in
the sequence). Thus, the final set of 10 mandatory practices for
the Planning phase is shown in grey background in Figure 8.
Amongst these practices, Identify product risks and Analyse
product risks demonstrate the relevance of evaluating product
risks. Their output plays key role in the testing approach
definition and test case prioritisation. The product risks consist
of a list of potential problems that should be considered while
defining the test plan. Figure 7 shows that these two practices
were mostly ranked as mandatory considering all profiles.
According to the IEEE-829 Standard for Software and
System Test Documentation [15], a test plan shall include:
a list of what will be and will not be tested; the approach
to be used; the schedule; the testing team; test classes and
conditions; exit criteria etc. In our survey, Identify items and
features to be tested, Establish the test schedule and Plan for
test staffing practices were mostly ranked as mandatory. They
are directly related to Establish the test plan, and address the
definition of most of the items listed in the IEEE-829 Standard.
This is complemented with the Define exit criteria, selected
after the dependency resolution. This evinces the coherence
of the survey’s subject choices for mandatory practices with
respect to the Planning phase.
The Planing phase also includes practices that address the
definition of the test environment. In regard to this, Elicit test
environment needs and Analyse the test environment requirements are ranked as mandatory and as clearly inter-related.
To conclude this analysis regarding the Planning phase, note
that not all TMMi specific goals are achieved only with the
execution of this selection of mandatory practices. Despite this,
the selected practices are able to yield a feasible test plan and
make the process clear, managed and measurable.
After Planning, the next phase is related to Test Case
Design. The input to this phase if the test plan, which includes
some essential definitions such as risk analysis, the items
which will be tested and the adopted approach.
B. Test Case Design
Figure 9 summarises the results of our survey for this phase,
based on the set of TMMi practices identified by Höhn [9]. As
the reader can notice, only two practices were mostly ranked
as mandatory by the Complete Set group of subjects: Identify
and prioritise test cases and Identify necessary specific test
data (both shown in grey background in Figure 9).
Phase of
Generic
Process
TMMI
Process
Area (PA)
Phase 2:
Test Case Design
Level 2 – PA 2.4
Specific Goal (SG)
of Process Area
SG 1
Perform Test
Analysis and Design
Using Test Design
Techniques
Test Design
and Execution
is likely that part of the subjects consider that the test plan
itself already fulfils the needs regarding test case design, thus
most of the practices are not really necessary. For instance,
if we considered solely the Profile-MR-MPS, none of the
practices within this phase would appear in the results (see
Figure 7 to crosscheck this finding). On the other hand,
subjects of the other profiles consider some other practices of
this phase should be explicitly performed in a testing process.
For instance, subjects of the Profile-Specialist profile ranked
Identify and prioritise test conditions, Identify necessary
specific test data and Maintain horizontal traceability with
requirements as mandatory. For the Profile-TMMi subjects,
Identify and prioritise test cases and Maintain horizontal
traceability with requirements should be mandatory.
From these results, we can conclude that there is uncertainty
about what should indeed be done during the test case design
phase. Moreover, this uncertainty may also indicate that not
always test cases are documented separately from the test
plan; the plan itself includes the testing approach (and its
underlying conditions) and the exit criteria. Thus, the two
selected practices for this phase complement the needs to
compose a feasible, streamlined testing process.
C. Setup of Test Environment and Data
As discussed in Section V-A, in the Planning phase test
environment requirements are identified and described. The
Setup of Test Environment and Data phase addresses the prioritisation and implementation of such requirements. Figure 10
shows the TMMi specific goals and practices for this phase.
Phase of
Generic
Process
TMMI
Process
Area (PA)
Phase 3:
Setup of Test
Environment
and Data
Level 2 – PA 2.4
Specific Practice (SP)
of Specific Goal
SP 1.1 – Identify and prioritise test
conditions
Level 2 – PA 2.5
Non-Functional
Testing
Fig. 9.
SG 2
Perform Test
Environment
Implementation
Non-Functional
Testing
SP 3.2 – Identify and prioritise nonfunctional test cases
SP 3.3 – Identify necessary specific
test data
SP 3.4 – Maintain horizontal
traceability with non-functional
requirements
TMMi practices related to Test Case Design.
According to the IEEE-829 Standard, the test plan encompasses some items related to test case design, such as the
definition of test classes and conditions [15]. Due to this, it
Fig. 10.
SP 2.1 – Implement the test
environment
SP 2.2 – Create generic test data
SP 2.3 – Specify test environment
intake test procedure
SP 1.3 – Identify necessary specific
test data
Level 3 – PA 3.4
SP 2.2 – Create specific test data
SP 2.3 – Specify intake test procedure
SP 2.4 – Develop test execution
schedule
Test Environment
SP 3.1 – Identify and prioritise nonfunctional test conditions
Level 3 – PA 3.4
SG 2
Perform Test
Implementation
Test Design
and Execution
SP 1.2 – Identify and prioritise test
cases
Specific Practice (SP)
of Specific Goal
SP 2.1 – Develop and prioritise test
procedures
SP 1.4 – Maintain horizontal traceability
with requirements
SG 3
Perform NonFunctional Test
Analysis and Design
Specific Goal (SG)
of Process Area
SG 4
Perform NonFunctional Test
Implementation
SP 2.4 – Perform test environment
intake test
SP 4.1 – Develop and prioritise nonfunctional test procedures
SP 4.2 – Create specific test data
TMMi practices related to Setup of Test Environment and Data.
According to TMMi, Develop and prioritise test procedures
consists in determining the order test cases will be executed.
Such order is defined in accordance with the product risks.
The classification of this practice as mandatory is aligned with
the practices selected for the Planning phase, some of which
related to risk analysis. Another practice ranked as mandatory
is Develop test execution schedule, which is directly related to
the prioritisation of test case execution. The other two practices
(i.e. Implement the test environment and Perform test environment intake test) address the environment implementation
and ensuring it is operational, respectively. The conclusion
regarding this phase is that the four practices are sufficient to
create an adequate environment to run the tests.
D. Execution and Evaluation
The next phase of a generic testing process consists of
test case execution and evaluation. At this point, the team
runs the tests and, eventually, creates the defect reports. The
evaluation aims to assure the test goals were achieved and
to inform the results to the stakeholders [8]. For this phase,
Höhn [9] identified 13 TMMi practices, which are related to
test execution goals, management of incidents, non-functional
test execution and peer reviews. This can be seen in Figure 11.
As the reader can notice, only four practices were not ranked
as mandatory. This makes evident the relevance of this phase,
since it encompasses the activities which are related to test
execution and management of incidents.
Phase of
Generic
Process
TMMI
Process
Area (PA)
Specific Goal (SG)
of Process Area
information needs to be organised and consolidated to enable
rapid status checking and, if necessary, corrective actions. This
is addressed during the Monitoring and Control phase [7].
Figure 12 depicts the TMMi practices with respect to this
phase. Again, the practices ranked as mandatory by most
of the subjects are highlighted in grey. Note that there is
consensus amongst all profile groups (i.e. Profile-Specialist,
Profile-MR-MPS, Profile-TMMi and the Complete Set)
about what is mandatory regarding Monitoring and Control.
This can be crosschecked in Figure 7.
Phase of
Generic
Process
TMMI
Process
Area (PA)
Specific Goal (SG)
of Process Area
Specific Practice (SP)
of Specific Goal
SP 1.1 – Monitor test planning
parameters
SG 1
Monitor Test
Progress Against
Plan
Phase 5:
Monitoring and
Control
SP 1.2 – Monitor test environment
resources provided and used
SP 1.3 – Monitor test commitments
SP 1.4 – Monitor test project risks
SP 1.5 – Monitor stakeholder
involvement
Specific Practice (SP)
of Specific Goal
SP 1.6 – Conduct test progress reviews
SP 1.7 – Conduct test progress
milestone reviews
SP 3.1 – Perform intake test
Phase 4:
Execution and
Evaluation
SG 3
Perform Test
Execution
Level 3 – PA 3.4
Non-Functional
Testing
SP 3.3 – Report test incidents
Level 2 – PA 2.3
Test Monitoring
and Control
SG 4
Manage Test
Incidents to Closure
SG 5
Perform NonFunctional Test
Execution
Peer Reviews
SG 2
Perform Peer
Reviews
SP 2.4 – Monitor exit criteria
SP 2.5 – Monitor suspension and
resumption criteria
SP 4.2 – Perform appropriate action
to close the test incident
SP 2.6 – Conduct product quality
reviews
SP 4.3 – Track the status of test
incidents
SP 2.7 – Conduct product quality
milestone reviews
SP 5.1 – Execute non-functional test
cases
SG 3
Manage Corrective
Actions to Closure
SP 5.2 – Report non-functional test
incidents
SP 3.3 – Manage corrective action
SP 2.1 – Conduct peer reviews
SP 2.2 – Testers review test
basis documents
Level 2 – PA 2.5
Test Environment
E. Monitoring and Control
The execution of the four phases of a generic testing
process yields a substantial amount of information. Such
SG 3
Manage and
Control Test
Environments
SP 3.1 – Perform systems management
SP 3.2 – Perform test data management
SP 3.3 – Co-ordinate the availability and
usage of the test environments
TMMi practices related to Execution and Evaluation.
The results summarised in Figure 11 include practices that
regard the execution of non-functional tests. However, in the
Planning an Test Case Design phases, the selected practices do
not address the definition of such type of tests. Although this
sounds incoherent, this may indicate that, from the planning
and design viewpoints, there is not a clear separation between
functional and non-functional testing. The separation is a characteristic of the TMMi structure, but for the testing community
these two types of testing are performed in conjunction, since
the associated practices as described in TMMi are very similar
in both cases.
SP 3.1 – Analyse issues
SP 3.2 – Take corrective action
SP 2.3 – Analyse peer review data
Fig. 11.
SP 2.2 – Monitor defects
SP 4.1 – Decide disposition of test
incidents in configuration
control board
SP 5.3 – Write test log
Level 3 – PA 3.5
SP 2.1 – Check against entry criteria
SP 2.3 – Monitor product risks
SP 3.4 – Write test log
Level 2 – PA 2.4
Test Design
and Execution
SP 3.2 – Execute test cases
SG 2
Monitor Product
Quality Against
Plan And
Expectations
SP 3.4 – Report and manage test
environment incidents
Fig. 12.
TMMi practices related to Monitoring and Control.
Performing the Conduct test progress reviews and Conduct product quality reviews practices means keeping track
of both the testing process status and the product quality,
respectively. Monitor defects addresses gathering metrics that
concern incidents (also referred to as issues), while Analyse
issues, Take corrective action and Manage corrective action
are clearly inter-related practices. The two other practices
considered mandatory within this phase are Co-ordinate the
availability and usage of the test environments and Report
and manage test environment incidents. Both are important
since either unavailability or incidents in the test environment
may compromise the activity as a whole.
As a final note with respect to the survey results, we emphasise that the subjects were not provided with any information
about dependencies amongst TMMi practices. Besides this, we
were aware that the inclusion of practices not mostly ranked as
mandatory might have been created new broken dependencies.
Despite this, the analysis of the final set of mandatory practices
shows that all dependencies are resolved.
VI. VALIDITY T HREATS
This section describes some issues that may threaten the
validity of our results. Despite this, the study limitations did
not prevent the achievement of significant results with respect
to software testing process definition, based on the opinion of
software testing professionals.
A first limitation concerns the questionnaire design. The
questions were based on the TMMi structure, so were the
help notes provided together with the questions. Even though
the intent of the help notes was facilitating the subjects’
understanding regarding the questions, they might not have
been enough to allow for correct comprehension. Although
the TMMi structure is very detailed aiming to facilitate
its implementation, this structure can become confusing for
readers, who cannot comprehend the difference between some
activities. For instance, in this survey it was clear that the
practices related to functional and non-functional testing were
not understood as distinct activities, since they were ranked as
mandatory only in the Execution and Evaluation phase.
Another threat regards the scale of values used in the first
questionnaire. The answer scale was composed of four values.
This represented a limitation for the statistical analysis, since
the responses were mostly concentrated in values 3 and 4. If
a wider scale were used, e.g. from 1 to 10, this could have
yielded a better distribution of answers, thus enabling us to
apply a more adequate interpretation model.
The sample size was also a limitation of the study. In
practice, although the sample includes only software testing
professionals, its size is reduced in the face of the real population. Perhaps the way the participation call was announced
and the time it was available have limited the sample.
VII. C ONCLUSIONS AND F UTURE W ORK
This paper described a survey that was conducted in two
stages and investigated whether there is a subset of TMMi
practices that can be considered essential for a generic testing
process. The survey was applied amongst professionals who
work with software testing. The analysis led us to conclude
that, from the set of 81 TMMi practices distributed by Höhn
[9] across the phases of a generic testing process, 33 are
considered essential for maintaining consistency when such a
process is defined. This represents a reduction of around 60%
in the number of TMMi practices. Note that the other TMMi
practices are not disposable; however, when the goal is to
implement a streamlined process, or even when the company
does not have the necessary know-how to implement its own
testing process, it can use this reduced set of practices to do so.
Thus, the results reported in this paper represent a simplified
way to create or improve testing processes, which is based on
a recognised reference model.
The practices highlighted in Figures 8–12 can also indicate
the priority of implementation for a company that is using
TMMi as a reference for its testing process. This model does
not indicate what can be implemented first, or the possible
dependencies amongst the process areas. Nonetheless, the
results of this study point out a set of activities that can be
implemented as a priority. At a later stage, the company may
decide to continue to deploy the remaining practices required
by the model in order to obtain the TMMi certification.
TMMi is fine-grained in terms of practices and their distribution across the specific goals and process areas. Even
though this may ease the implementation of practices, this
makes the model complex and difficult to understand. Once
a company is willing to build a testing process based on a
reference model, this process must be in accordance with its
reality. Not all TMMi practices are feasible for all sizes of
companies and teams. Thus, it is important be aware of a
basic set of practices that, if not performed, may compromise
the quality of the process, and hence the quality of the product
under test. In this context, we hope the results of this work can
support small and medium companies that wish to implement
a new testing process, or even improve their current processes.
ACKNOWLEDGEMENTS
We thank the financial support by CAPES and CNPq.
R EFERENCES
[1] P. Cao, Z. Dong, and K. Liu, “An Optimal Release Policy for
Software Testing Process,” in 29th Chinese Control Conference,
2010, pp. 6037–6042.
[2] A. Rodrigues, P. R. Pinheiro, and A. Albuquerque, “The definiton of a testing process to small-sized companies: The Brazilian
scenario,” in QUATIC’10. IEEE, 2010, pp. 298–303.
[3] J. Andersin, “TPI- a model for test process improvement,”
University of Helsinki, Helsinki - Finland, Seminar, 2004.
[4] TMMI Foudation, “Test Maturity Model integration (TMMi)
(Version 3.1),” pp. 1–181, 2010.
[5] SEI, “Capability Maturity Model Integration Version 1.2.
(CMMI-SE/SW, V1.2 – Continuous Representation),” Carnegie
Mellon University, Tech. Report CMU/SEI-2006-TR-001, 2006.
[6] C. B. Purper, “Transcribing Process Model Standards
into Meta-Processes,” in EWSPT’00.
London - UK:
Springer-Verlag, 2000, pp. 55–68.
[7] A. N. Crespo, M. Jino, M. Argollo, P. M. S. Bueno, and
C. P. Barros, “Generic process model for software testing,” Online, 2010, http://www.softwarepublico.gov.br/5cqualibr/xowiki/
Teste-item13 - accessed on 16/04/2013 (in Portuguese).
[8] A. M. J. Hass, “Testing processes,” in ICSTW’08. IEEE, 2008,
pp. 321–327.
[9] E. N. Höhn, “KITest: A framework of knowledge and improvement of testing process,” Ph.D. dissertation, University of São
Paulo, São Carlos, SP - Brazil, Jun. 2011, (in Portuguese).
[10] “Limesurvey,” http://www.limesurvey.org/, Apr. 2011. [Online].
Available: http://www.limesurvey.org/
[11] Softex, Imrovement of Brazilian Software Process - General
Guide (in Portuguese), Online, Softex - Association for Promoting Excellency in Brazilian Software, 2011.
[12] E. Whitley and J. Ball, “Statistics review 6: Nonparametric
methods,” Critical Care, vol. 6, no. 6, p. 509, Sep. 2002.
[13] J. Miller, “Statistical significance testing–a panacea for software
technology experiments?” Journal of Systems and Software,
vol. 73, no. 2, pp. 183–192, 2004.
[14] V. Basili and R. W. Reiter, “A controlled experiment quantitatively comparing software development approaches,” IEEE
Trans. Soft. Engineering, vol. SE-7, no. 3, pp. 299–320, 1981.
[15] “IEEE standard for software and system test documentation,”
IEEE Std 829-2008, pp. 1 –118, 2008.
On the Relationship between Features Granularity
and Non-conformities in Software Product Lines:
An Exploratory Study
Iuri Santos Souza1,2 , Rosemeire Fiaccone1 , Raphael Pereira de Oliveira1,2 , Eduardo Santana de Almeida1,2,3
1
Federal University of Bahia (UFBA), Salvador, BA, Brazil
Reuse in Software Engineering (RiSE), Recife, PE, Brazil
3
Fraunhofer Project Center (FPC) for Software and Systems Engineering, Brazil
Email: {iurisin,esa,raphaeloliveira}@dcc.ufba.br, r [email protected]
2
Abstract—Within Software Product Lines (SPL) features are
well-understood and facilitate the communication among SPL
developers and domain experts. However, the feature specification
task is usually based on natural language, which can present lack
of clarity, non-conformities and defects. In order to understand
the feature non-conformity in SPL, this paper presents an
empirical study to investigate the possible correlation between
feature granularity and feature non-conformity, based on an SPL
industrial project in the medical domain. The investigation aims
at exploring the features non-conformities and their likely root
causes using results from a previous study, which captured and
classified 137 features non-conformities, identified in 92 features.
The findings indicated that there is significant association between
the variables feature interaction and feature granularity. From
predictive models to estimate feature non-conformities based on
feature granularity and feature interaction values, the variable
feature interaction presented positive influence on the feature
non-conformity and the variable feature granularity presented
negative influence on the variable feature non-conformity.
Keywords—Software Product Lines; Feature Non-Conformity;
Features Granularity; Exploratory Study
I. I NTRODUCTION
In Software Product Lines (SPL), mass customizations is a
crucial aspect and different products can be tailored for covering distinct customers by selecting a particular set or subset of
features [1]. A feature can be defined as “a prominent uservisible aspect, quality, or characteristic of a software system
or systems” [2]. The feature concept has been successfully
applied in product portfolio, domain analysis, and product
derivation in the context of product lines [3].
The level or degree extension in source code necessary to
implement a given feature is a definition of feature granularity
[4]. Kästner et al. classified the granularity of features in
coarse granularity, which represents code extensions as the
addition of new classes or methods to the source code or the
addition of source code in explicit extension points; and fine
granularity, which represents code extensions as the addition
of new statements into existing methods, expressions or even
method signatures [4]. They discussed also the effects of feature granularity in different approaches on SPL development
and identified challenges to handle feature granularity, mainly
when creating an SPL by decomposing a legacy application.
During the SPL Scoping phase, features are specified to
define the capabilities of a product line [5]. The feature
specification document is composed by feature information,
such as feature type (mandatory or optional), feature granularity (coarse or fine-grained), feature priority, binding time,
parent feature, required feature, excluded feature, and so on
[2]. Feature specification task is usually specified on natural
language, which can present lack of clarity, non-conformities
and defects. Consequently, scoping analysts can introduce
ambiguity, inconsistency, and non-conformities. Feature nonconformity is a bad occurrence identified in the feature specification which means the absence of compliance with the
required quality attributes of a feature specification document
[6]. In the SPL context, quality assurance techniques, such
as inspections and testing, have a fundamental role [7][8]
since the assets developed can be reused in several products.
Nevertheless, the literature has showed that this aspect is being
poorly investigated [7][8][9], mainly considering data analysis
after performing quality assurance techniques.
In this context, this paper presents an exploratory study in
order to investigate the influence of feature granularity on
the feature non-conformity information (amount, types, and
occurrences). The study object (dataset) is based on an SPL
industrial project in the health information systems domain.
We believe that the findings and insights discussed in this
work can be useful to SPL researchers and practitioners to
realize what software projects variables can influence on the
feature non-conformities.
The remainder of this paper is organized as follows: Section
II presents the related work. Sections III details the context
and the design of the empirical study carried out in this
work. Section IV presents the analysis and the results of this
work grouped by research questions. Section V discusses the
main findings of the study. Section VI discusses the threats to
validity and finally, Section VII describes the conclusions and
future directions.
II. R ELATED W ORK
This section presents four similar studies to our proposal,
in terms of exploring the effects of feature granularity in the
SPL context and data analysis from software quality assurance
techniques.
Murphy et al. presented an exploratory study that characterized the effects of applying mechanisms to separation of
concerns (or features) in codebases of object-oriented systems
(within methods and among classes) from two perspectives
[10]. In the first perspective, they observed the effect of
mechanisms on the structure of codebases. In the second
perspective, they characterized the restructuring process required to perform the separation. The study applied three
different separation of concern mechanisms: Hyper/J, a tool
that supports the concept of hyperspaces [11], AspectJ, a
tool that supports the concept of aspect-oriented programming
[12], and a lightweight lexically-based approach [13], which
considers that the separation is possible without advanced
tool support. The study concluded that manual restructuring
is time-consuming and error-prone, then automated support
would ease the problems of restructuring codebases and separation of concerns. Moreover, Murphy et al. argued that the
exploratory study provides a guideline to help practitioners to
choose an appropriate target structure, prepare their codebase
for separation of concerns, and perform the necessary restructurings.
Kästner et al. explored the effects of feature granularity
(coarse and fine-grained) in two types of SPL development
approaches (Compositional and Annotative) [4]. The study
identified that compositional approaches do not support finegrained extensions and workarounds are required which raise
the implementation complexity. On the other hand, annotative
approaches can implement fine-grained extensions but introduce readability problems by obfuscating the source code.
Thus, they concluded that Compositional and Annotative approaches are not able to implement fine-grained extensions
satisfactorily and analyze possible solutions that allow implementing SPLs without sacrificing understandability. Furthermore, the work presents a tool (Colored Integrated Development Environment - CIDE) that intends to avoid the cited
problems, developing SPLs with fine-grained extensions to
implements features. Based on two case studies to evaluate the
tool, Kästner et al. argue that CIDE allows implementing finegrained features including statement and expression extensions
and even signature changes without workarounds.
In [14], Kalinowski et al. described the concepts incorporated into an evolved approach for software process improvements based on defect data, called Defect PreventionBased Process Improvement (DPPI). DPPI approach was assembled based on Defect Causal Analysis (DCA) guidance
[15] obtained from a Systematic Review in the DCA area and
feedback gathered from experts in the field [16]. DPPI provides a framework for conducting, measuring and controlling
DCA in order to use it efficiently for process improvement.
This approach integrates cause-effect learning mechanisms
into DCA meetings. The learning mechanisms consider the
characteristics of the product and the defects introduced in its
artifacts to enable the construction of a causal model for the
organization using Bayesian network. Kalinowski et al. argue
that the possibility of using the resulting Bayesian network for
defect prediction can support the definition of risk mitigation
strategies by performing ”what-if” scenario simulation.
In the work that investigated the relationship between inspection and evolution within SPL [17], we showed an empirical study to search evidence between information from features
non-conformities and data from corrective maintenance. The
study sample was analyzed using statistical techniques, such
as Spearman correlation rank and Poisson regression models.
The findings indicated that there is a significant positive
correlation between feature non-conformities and corrective
maintenance. Also, sub-domains with a high number of feature
non-conformities had a higher number of corrective maintenance and sub-domains qualified as high risk had also positive correlation with corrective maintenance. This correlation
allowed us to build predictive models to estimate corrective
maintenance based on the risk sub-domain attribute values.
In a previous work [6], we performed an empirical study
investigating the effects of applying an inspection approach in
feature specification. Our data was gathered from an industrial
SPL project. The study sample was analyzed using statistical
and economical techniques, such as (i) Pareto’s principle,
which showed that incompleteness and ambiguity reported
higher non-conformity occurrences, (ii) Spearman correlation
rank, which showed that sub-domain risk information can
be a good indicator for prioritization of sub-domains in the
inspection activity, and (iii) Poisson regression models, which
enabled to build a predictive model for estimating non-conformities in features specifications using risk attribute. Besides,
the analysis identified that optional features presented a higher
non-conformity density than mandatory features.
Despite our two previous work that used feature nonconformities data and the same industrial SPL project, the
research in this paper aims at investigating the feature nonconformity influences and their likely root causes based on a
different perspective, using feature granularity information.
To the best of our knowledge, we did not find a work
exploring or investigating empirical data and evidence from
inspection results and features granularity information in the
SPL context. Thus, the main contribution of this work is to
analyze simultaneously software inspection data gathered from
previous empirical study and features granularity information
in an SPL industrial project.
III. T HE S TUDY
A. Background
The SPL industrial project is been conducted in a partnership with a company, which develops information systems in
the medical domain for almost twenty years. The company
has more than 50 customers and a total of 51 staff members, distributed in different areas. It has four main products,
comprising a total set of 42 modules, which are responsible
for specific functions in different sub-domains (e.g., financial,
inventory control, nutritional control, home care, nursing and
medical assistance and so on). Each one of the four products,
with theirs sub-domains, is described below:
•
•
•
•
SmartDoctor: a web-based product composed of 11 subdomains. Its goal is to manage the tasks and routines of
a doctors office.
SmartClin: a desktop-based product composed of 28
sub-domains. It performs clinical management support
activities (e.g., medical exams, diagnostics and so on).
SmartLab:a desktop-based product composed of 28 subdomains. It integrates a set of features to manage clinical
pathology labs.
SmartHealth: a desktop-based product composed of 35
sub-domains. It manages the whole area of a hospital,
from financial to patient issues.
Some sub-domains are common to all products (present
in all products), other are variable (present in two or more
products) and some are specific (present just in one product).
Market trends, technical constraints and competitiveness motivated the company to migrate their products from singlesystem development to an SPL approach.
In the Scoping phase [18], based on the list of products
previously developed by the company, the scoping analysts
identified the domains with better market potential and selected the products, sub-domains and features for the product line. After that, the scoping analysts collected feature
information (e.g. feature type, feature hierarchy, and feature
granularity). The feature granularity definition is associated to
Object Oriented Paradigm. As output of the Scoping phase
some documents were developed: the product map document,
composed of products and their respective features and; the
feature specification documents by sub-domain. The features
were implemented through PowerBuilder 1 , an Object Oriented
language.
The assets (e.g., product map and feature specification) built
on the Scoping phase [19] of the SPL project were reviewed
by inspection activities, which were performed in order to
assess the quality of the artifacts. For example, the feature
specification documents from the two first iterations in the
project (9 sub-domains with 92 features) were inspected and
as result we found 137 feature non-conformities, which were
fixed by the scoping analysts.
The non-conformities were classified within nine types, as
defined by van Lamsweerde [20], which proposes a classification for non-conformities based on requirement specification.
As the literature does not present a similar classification
related to feature specification, we believe that this more
general definition could be used in this context. They are next
described:
•
•
Incompleteness or Omission: Absence or omission of
information necessary to specify the domain and subdomain features e.g. the feature document template contains incomplete or partially complete items and entries.
Ambiguity: Presence of a specification item that allows
more than one interpretation or understanding e.g. ambiguous term or statement.
1 PowerBuilder
- http://goo.gl/flmo3
•
•
•
•
•
•
•
Incorrectness or Inadequacy: Presence of a specification
item that is incorrectly or inappropriately described e.g.
feature specifications that not justify its presence in
products of the SPL.
Inconsistency or contradiction: A situation in which
some specified feature contain constraints, priority or
composition rules that are in conflict with other features
and/or work products, such as requirements or use case
documents.
Non-traceability or opacity: Presence of features that do
not specify or wrongly specify its identifier or interaction
with other features e.g. one feature that do not have a
unique identifier or one child feature that does not specify
its respective parent feature.
Incomprehensibility or un intelligibility: A situation in
which a specification item is stated in such a way that it
is incomprehensible to the target stakeholders.
Non-Organization or poor structuring: When the specified features do not facilitate the reading, understanding
and do not clearly states their relationship e.g. one feature
does not specify the name of the respective sub-domain
or the feature specification document does not organized
by sub-domain.
Unnecessary information or over specification: When the
specification provides more details than it is required e.g.
when it brings information regarding the later phases of
the development cycle of the product line (and anticipating decisions).
Business rule: Situation where the definition of the domain business rules is incorrectly specified.
B. Empirical Study Definition
1) Data Collection: For the exploratory study, we applied
interview and archival data methods [21] [22] to collect data
and information. The collected inspection data were treated
strictly confidential, in order to assure anonymity of the
company.
Archival data refers to documents from different development phases, organizational charts, financial records, and
previously collected measurements in an organization [21].
For this study, we used archival data to collect features nonconformities information, gathered from the software inspection activity and features information (feature type, feature
hierarchy, feature interaction and feature granularity), related
to all the features in the dataset.
2) Analysis Procedure: This phase comprises the qualitative and quantitative analysis of the collected data. We
performed quantitative data analysis based on descriptive
statistics, correlation analysis [23], and the development of
predictive models [24]. The objective of using qualitative
analysis is to sketch conclusions, based on the amount of
collected data, which may lead us to a clear chain of evidence
[25]. Moreover, the relevant data from documents, assets and
extracted statements, as well as the observations, were grouped
and stored in the study database in order to optimize the
exploration from sources of evidence in this study [25].
P (Yi = yi |xi , Yi > 0) =
e−λi λyi i
, yi = 1, 2, . . .
yi !(1 − e−λi )
(1)
Let Y1 , . . . , YNobs be a random sample from the zerotruncated Poisson distribution with parameter λi , i=1, . . . ,
Nobs . Considering the regression model [27] and Equation 2,
log(λi ) = β T xi
(2)
A. RQ1: What is the distribution of feature non-conformities
per feature granularity, feature interaction, feature type and
feature hierarchy?
In order to answer this question, samples for each variable,
according to Table I, are shown in box plots and then analyzed.
To compare the two samples from each box plot (Feature
Granularity, Interaction, Type and Hierarchy), the medians,
the interquartile ranges (the box lengths), the overall spreads
(distances between adjacent values) and the skewness were
analyzed to generalize some possible conclusions.
According to the feature non-conformities per granularity
box plot (Figure 1), the medians are separated meaning that
there is a significant difference between the medians of the
two boxes, with the median for coarse grained features being
higher. The length of the fine grained feature box is slight
bigger than the coarse grained one. The overall spreads are
different, being smaller for coarse grained features. The box
plot for coarse grained features shows a slight upper-skew:
the upper whisker is longer than the lower. Also the box plot
for fine grained features shows a upper-skew. The median
for fine grained features is lower than the median for coarse
grained features, which indicates that the number of nonconformities varies according to the feature granularity. For
the showed samples, coarse grained features have more feature
non-conformities than fine grained ones.
2.5
2.0
1.5
1.0
Number of Non−Conformities per Granularity
3.0
Feature Non−Conformities Per Granularity Boxplot
0.5
In this section, the analysis and the results are grouped by
research questions. The investigation of this study was guided
by four research questions. The necessary data to answer
the stated research questions were organized by sub-domains
considering number of features; feature non-conformities; feature granularity; feature type; feature hierarchical profile; and
feature interaction, as shown in Table I.
In order to better understand and answer the research
questions, some statistical techniques were used. For the first
research question is related with the distribution of nonconformities per feature granularity, interaction, hierarchy and
type. To answer this research question, we did a descriptive
analysis over the box plots.
The second question used a statistical test to investigate
the presence of an association between feature information
(feature type, feature hierarchy profile and feature interaction
profile) and feature granularity. The Pearson chi-square test
(or Fischer exact test) [26] is the most widely used method to
detect association between categorical variables based on the
frequency in the two-way tables (also known as contingency
tables). These tables provide a foundation for statistical inference, where statistical tests check the relationship between
the variables on the dataset observed. Once detected, it was
calculated odds ratio as a measure of association (effect size)
between two binary categorical variables.
To answer the third and fourth questions, it was fitted a
regression model [27]. Indeed, a truncated Poisson regression
model [26] where the dependent variable (outcome) was the
number of non-conformities. This approach assumes that the
logarithm of its expected value (the average number of nonconformities) can be modeled by a linear combination of
unknown parameters. The linear combination can be composed
of one or more independent variables. Modeling count variables is a common task in several sciences such as health,
micro econometrics, social and political. The classical Poisson
regression model [27] for count data is often of limited use in
these disciplines because empirical count data sets typically
exhibit over-dispersion and/or an excess number of zeros or
structurally exclude zero counts named left-truncated count
component. Examples of truncated counts include the number
of bus trips made per week in surveys taken on buses, the
number of shopping trips made by individuals sampled at a
mall, and the number of unemployment spells among a pool
of unemployed. The most common form of truncation in count
models is left truncation at zero. Thus, observation apparatus
is activated only by the occurrence of an event. Equation 1
represents the probability distribution for the special case of
Poisson-without-zeros.
where β = (β0 , β1 , . . . , βp )T , and xi is a vector of covariate
values for subject i , that is xi = (1, xi 1, ..., xi p)T . Furthermore, the above model can be fitted in order to maximize the
likelihood investigation.
0.0
IV. A NALYSIS AND R ESULTS OF THE S TUDY
Coarse
Fig. 1.
Fine
Boxplot - Feature non-conformities per feature granularity
For the feature non-conformities per feature interaction box
plot (Figure 2), the medians are more separated, with the
median for features with interaction being higher. The length
TABLE I
DATASET SUMMARY BOARD BY SUB - DOMAIN
Feature Granularity
Feature type
Feature Hierarchy ∗
Fine
Coarse
Mandatory
Optional
yes
no
A
4
0
4
4
0
0
4
B
22
1
21
21
1
5
17
C
23
6
17
22
1
15
8
D
8
1
7
8
0
0
8
E
4
1
3
4
0
2
2
F
11
1
10
11
0
1
10
G
3
0
3
3
0
3
0
H
8
2
6
8
0
0
8
I
9
3
6
5
4
9
0
Total
92
15
77
86
6
35
57
∗ feature with at least one hierarchy profile has the value “yes”, otherwise it has the value “no”
∗∗ feature with at least one interaction profile has the value “yes”, otherwise it has the value “no”
Sub-domains
Features
of the box corresponding to the features without interaction
is bigger than the box corresponding to the feature with
interaction. The overall spreads are different, being smaller
for features with interaction. The box plot for feature with
interaction shows a upper-skew: the upper whisker is longer
than the lower. On the other hand, the box plot for feature
without interaction shows a lower-skew: the lower whisker
is longer than the upper. The median for the feature without interaction is close to the lower adjacent value for the
feature with interaction, which indicates that the number of
non-conformities varies according to the feature interactions.
According to this box plot, the number of non-conformities
for feature with interaction is bigger than the number of nonconformities for feature without interaction.
Non-conformities
4
33
31
20
8
13
3
7
18
137
are related to features without hierarchy. The upper outlier
represents the number of non-conformities for the Domain D
per features without hierarchy for the same Domain (20/8 =
2,5). The lower outlier from Figure 3 represents the number
of non-conformities for the Domain G per features without
hierarchy for the same Domain (0) and also the number of nonconformities for the Domain I per features without hierarchy
for the same Domain (0). The box plot for feature with
hierarchy shows a lower-skew: the lower whisker is longer
than the upper. On the other hand, the box plot for feature
without hierarchy shows a upper-skew: the upper whisker is
longer than the lower one. The medians for both boxes are
the same, however, without considering the two outliers from
the feature without hierarchy, the number of non-conformities
for features with hierarchy is bigger than the number of nonconformities for features without hierarchy.
2.5
Feature Non−Conformities Per Feature Interaction Boxplot
Feature Interaction ∗∗
yes
no
3
1
15
7
9
14
6
2
4
0
11
0
2
1
3
5
6
3
59
33
3.0
2.5
2.0
1.5
1.0
Without_Interaction
0.0
With_Interaction
●
0.5
Number of Non−Conformities per Feature Hierarchy
2.0
1.5
1.0
0.5
0.0
Number of Non−Conformities per Feature Interaction
Feature Non−Conformities Per Feature Hierarchy Boxplot
●
With_Hierarchy
Fig. 2.
Without_Hierarchy
Boxplot - Feature non-conformities per feature interaction
Analyzing the feature non-conformities per feature hierarchy box plot (Figure 3), the medians are the same. The
length of the boxes is different. The feature with hierarchy
box is much bigger than the feature without hierarchy box.
The overall spreads are also different, being bigger for the
features with hierarchy, even considering the two outliers from
the feature without hierarchy. The outliers from Figure 3
Fig. 3.
Boxplot - Feature non-conformities per feature hierarchy
Analyzing the feature non-conformities per feature type
box plot (Figure 4), the medians are well separated, with the
median for mandatory feature being higher and very distanced
from the optional feature median. The mandatory feature box
is bigger than the optional feature one. The overall spreads are
also different, being bigger also for the mandatory features,
even considering the two outliers from the optional feature.
The outliers from Figure 4 are related to optional features. The
upper outlier represents the number of non-conformities for the
Domain I per optional features for the same Domain (9/4 =
2,25). The lower outlier from Figure 4 represents the number
of non-conformities for the Domain B per optional features
for the same Domain (1/1 = 1). The box plot for mandatory
feature has the same size for the lower and upper whiskers. For
the optional feature box plot, almost all the non-conformities
values were zero, transforming the box into a line. The median
for mandatory feature is bigger than the high quartile of the
optional feature, which leads to the conclusion that the number
of non-conformities varies according to the feature type. The
number of non-conformities for mandatory feature is bigger
than the number of non-conformities for optional features in
these samples.
2.5
Feature Non−Conformities Per Feature Type Boxplot
2.0
1.5
1.0
0.5
●
0.0
Number of Non−Conformities per Feature Type
●
Mandatory
Fig. 4.
Optional
Boxplot - Feature non-conformities per feature type
According to the box plots (feature granularity, interaction,
hierarchy and type), it can be observed that the feature nonconformities are more spread over coarse features, features
with interactions, features with hierarchy and mandatory features. During the feature specification activity, these feature
information should be considered to achieve the required
quality [6].
B. RQ2: Is there any association between feature information
and feature granularity?
This question aims to investigate whether there is significant
association between feature information and feature granularity from some defined perspectives, which are next described:
• Feature hierarchy and feature granularity: The feature
hierarchy defines that the feature immediately above any
feature is its parent feature and the feature immediately
below a parent feature is its child feature in a feature
model [2]. This perspective investigates if there is a
correlation between hierarchical feature information and
feature granularity.
• Feature type and feature granularity: The feature
type specifies whether a feature is contained within
all products of the product line (mandatory feature) or
only in some products (optional feature). Furthermore,
when children features have the same parent and only
one of them can be chosen to a product configuration,
they are called alternative features [2]. This perspective
investigates if there is correlation between feature type
information and feature granularity.
• Feature interaction and feature granularity: The feature interaction specifies whether a feature has any dependency relationship with another one. Thus, (I) a feature
can require the existence of another feature, because
they are interdependent, this second feature is a required
feature and (II) a feature can be mutually exclusive with
another, they cannot coexist (excluded features) [2]. This
perspective investigates the possible correlation between
feature interaction and feature granularity.
To better understand and answer the issues from these
perspectives, this question was split into three sub-questions.
1) RQ2.1: Is there any association between feature type
and feature granularity?: The investigated feature sample is
composed of 6 optional features and 86 mandatory features.
Analyzing the relationship between the feature type and the
feature granularity (Table II), we observed that among the
optional features, 2 optional features are fine-grained and 4
are coarse-grained features. In addition, among the mandatory
features, 13 mandatory features are fine-grained features and
73 mandatory features are coarse-grained features.
Observing the sample, to the optional features, 66.7% of
them are coarse-grained features; and within the mandatory
features this percentage grew to 84% (coarse-grained features).
Using the Fisher exact test [26] there was not enough evidence
to state whether there is statistical association between features
type and features granularity (Table II).
2) RQ2.2: Is there any association between feature hierarchy profiles and feature granularity?: Considering features
with hierarchical profiles, the sample is composed of 5 parent
features, 19 children features, 2 features took parent and
child profiles, simultaneously, and the other 66 features did
not have any hierarchical profiles, they are on the root level
of the feature model without any child feature. Analyzing
the relationship between the features hierarchy profiles and
the features granularity (Table II), we observed that among
the parent features, all 5 parent features are coarse-grained
features; regarding the children features, 4 children features
are fine-grained features and 15 child features are coarsegrained features; and among the parent-children features, all
the 2 parent-children features are coarse-grained features.
Based on the sample, the features that do not have any
hierarchical profile, 83% of them are coarse-grained features;
on the other hand, for the features that had, at least, one
hierarchical profile, this percentage grew to 85% (coarsegrained features). Thus, using a generalization of the Fisher
TABLE II
DATA FROM FEATURE INFORMATION AND FEATURE GRANULARITY PERSPECTIVES
Feature information
Fine
Feature type
Mandatory Features
Optional Features
Feature hierarchy
Parent Features
Children Features
Parent-Children Features
Feature without hierarchy profile
Feature interaction
Required Features
Requesting Features
Requesting-Required Features
Feature without interaction profile
Feature Granularity
%
Coarse
%
13
2
15.1
33.3
73
4
84.9
66.7
0
4
0
11
0
21.1
0
16.7
5
15
2
55
100
78.9
100
83.3
3
0
1
11
10
0
8.3
33.3
27
17
11
22
90
100
91.7
66.7
p-value
0.252∗
0.784∗∗
0.010∗∗
∗ Results from Fisher exact test
∗∗ Results from generalization of the Fisher exact test
exact test there was not enough evidence to state whether there
is statistical association between the variables feature hierarchy
and feature granularity (Table II).
3) RQ2.3: Is there any association between feature interaction and feature granularity?: Considering features with
interactions (dependencies), the sample is composed of 30
features which require another one, 17 features which are
requested by another one, 12 features that require another
one and are requested by another feature simultaneously,
and 33 features did not have interaction profiles. Analyzing
the relationship between the feature interaction profiles and
the feature granularity (Table II), we observed that among
the features which require another one, 3 required features
are fine-grained features and 27 features are coarse-grained
features; regarding the features which are requested by another
one, all 17 required features are coarse-grained features; and
among the features that require another one and are requested
by another feature simultaneously, 1 feature is fine-grained
feature and 11 features are coarse-grained features.
Among the features which did not have any interaction
profile, 67% of them are coarse-grained features; and among
the features that had, at least, one interaction profile, this
percentage grew to 93% (coarse-grained features). Using a
generalization of the Fisher exact test, it obtained relevant
evidence within the significance level (p-value) of 5% to
the association between the variables feature interaction and
feature granularity (Table II). Moreover, using the odds ratio to
measure the ratio of the odds that an event occurs to the odds
of the event not happen, it was observed in our data that the
odds of the coarse-grained features among those with at least
some interaction is 6.9 times higher compared to the features
which did not have any interaction profile. Thus, the variable
feature interaction presented significant statistical association
with the variable feature granularity.
C. RQ3: Is there any influence from feature granularity on
feature non-conformity data?
Assuming the importance of the variable feature granularity
[4] and in order to explore the variables that influence the fea-
ture non-conformity values, this question aims at investigating
whether the variable feature granularity present any influence
on the values collected to the variable feature non-conformity.
Thus, considering the studied sample with 92 features, the
non-conformity average was slightly less in the coarse-grained
features (1.47) when compared to fine-grained features (1.60).
The same behavior was observed regarding to the dispersion
of the feature non-conformities (Table 3).
Thus, in order to assess the influence of the feature granularity (X) on the feature non-conformity data, some models
were fitted for estimating feature non-conformities through
a truncated Poisson regression. In this case, 72 observations
were considered representing all positive count-features that
presented non-conformities after inspection (Y1 , . . . , YNobs ).
Based on Equation 3, four regression models (univariate) were
fitted in attempt to answer this research question.
log(λi ) = β0 + β1 X
(3)
In Equation 3, λi represents the average intensity of feature
non-conformities, β0 and β1 the values to the unknown
parameters. The estimate values were obtained by Stata2
software and according to results (Table IV), it was observed
that the log of average intensity of feature non-conformities
increase in 0.067 when it compares coarse-grained features
with fine-grained features. Based on the investigated sample,
this result means that features granularity presented small
influence on features non-conformity, however this finding has
low significance level, p-value = 0.785 (Table IV).
D. RQ4: Can feature information and feature granularity
simultaneously influence on feature non-conformity data?
To answer this question it was used a regression model to
assess the effects (influence) of all simultaneously variables
factors (independent variables) on the positive cases of feature
non-conformities. Thus, it was fitted a truncated Poisson model
to predict the average intensity of feature non-conformities as
a function of feature information, such as feature type (X1 );
2 http://www.stata.com
TABLE III
D ESCRIPTIVE STATISTICAL ANALYSIS - FEATURE GRANULARITY AND FEATURE NON - CONFORMITY DATA
Feature Granularity
Fine
Coarse
Feature amount
15
77
Mean
1.60
1.47
Std. Deviation
0.986
1.165
Std error mean
0.254
0.133
TABLE IV
Z ERO - TRUNCATED S IMPLE P OISSON REGRESSION - F EATURE INFORMATION AND FEATURE NON - CONFORMITY
Parameters
Intercept
Feature Granularity
Intercept
Feature Type
Intercept
Feature Hierarchy
Intercept
Feature Interaction
Estimate
0.324
0.067
0.373
0.933
0.416
−0.139
0.027
0.491
feature interaction (X2 ); hierarchical feature (X3 ); and feature
granularity (X4 ) simultaneously (Equation 4). Estimated regression parameters were defined by the maximum likelihood
(Table 5(a)).
log(λi ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4
(4)
Comparing Table IV and Table 5(a), we can observe that
there was an increasing of the estimated influence to the
variable feature interaction on the average intensity of feature
non-conformity; the estimated influence values to the variables
hierarchical features and feature granularity were maintained
negative; and the variable features type decreased the estimated
influence. Although the estimated value to the variable feature
interaction, the final influence in the presence of the others
independent variables in the Poisson regression multivariate
model (Equation 4) was reduced due to the negative values to
the variables hierarchical features and feature granularity. In
addition, a reduced model was fitted to take into account the
influence of the independent variables feature interaction (X̂1 )
and feature granularity (X̂2 ) on the average intensity feature
non-conformities (Equation 5).
log(λi ) = β0 + β1 X̂1 + β2 X̂2
(5)
From Table 5(b), it was observed that the log of average intensity of feature non-conformity decreases when it
compares coarse-grained features with fine-grained features.
However, this effect was not significant at the level of 10%.
On the other hand, the features that had interaction profile
presented a significant effect on the average intensity feature
non-conformities in comparison the features that did not have
interaction profile.
Considering the data from Table 5 and based on the selected
multiple Poisson regression model to estimate feature nonconformities (Equation 5), we could compute the predicted
values for feature non-conformities according to the combination of the values to the parameters of the model, feature
interaction and feature granularity (Table VI). Then, these
results can be used to make prediction. For example, the
Std Error
0.217
0.245
0.107
0.356
0.120
0.229
0.264
0.283
z-value
1.50
0.27
3.48
0.26
3.49
−0.61
0.10
1.74
p-value
0.134
0.785
0
0.793
0.183
0.543
0.919
0.082
expected average intensity of the feature non-conformities for
features without interaction and with fine granularity would
be 1.126; the expected average intensity of the feature nonconformities for features without interaction and with coarse
granularity would be 0.971; the expected average intensity of
the feature non-conformities for features with, at least, one
interaction and fine granularity is 1.921; and the expected
average intensity of the feature non-conformities for features
with, at least, one interaction and coarse granularity, would
be 1.657. These results highlight that fine-grained features
had larger prediction values (feature non-conformities) than
coarse-grained features, as well as, features that had interaction
(dependencies) with another feature estimated larger values
than features which did not have interaction with another one.
V. M AIN FINDINGS OF THE S TUDY
The main findings of this study can be summarized as
following:
a) During the feature specification activity, the scope analysts should give more attention to coarse grained features,
features with interactions, features with hierarchy and mandatory features to achieve a better quality within the documents,
since they indicated more number of non-conformities.
b) The variables feature type and feature hierarchy do not
present significant statistical association with the variable feature granularity using the Fischer exact test and generalization
of the Fischer exact test.
c) The variable feature interaction presented significant
statistical association with the variable feature granularity
using the generalization of the Fischer exact test (p-value
= 1%). As well as, the odds of the coarse-grained features
among those with at least some interaction is 6.9 times higher
compared to features which do not have any interaction profile.
d) The variable feature granularity alone presents small
influence on the variable feature non-conformity (estimated
value = 0.067), however, the model returned low significance
level (p-value = 0.785) using zero-truncated simple Poisson
regression model.
e) From zero-truncated multiple Poisson regression model,
the variable feature interaction presented relevant positive
TABLE V
Z ERO - TRUNCATED M ULTIPLE P OISSON REGRESSION MODELS TO ESTIMATE FEATURE NON - CONFORMITY
(a) Zero-truncated Multiple Poisson regression to estimate feature non-conformity - candidate
model
Parameters
Estimate
Std Error
z-value
p-value
Intercept
0.110
0.309
0.36
0.722
Feature Interaction
0.530
0.324
1.63
0.102
Feature Hierarchy
−0.013
0.228
-0.06
0.954
Feature Type
0.1
0.292
0.34
0.733
Feature Granularity
−0.139
0.302
−0.46
0.646
AIC: 185.9765
Akaike information criterion (AIC) is a measure of the relative goodness of fit of a statistical model.
(b) Zero-truncated Multiple Poisson regression to estimate feature non-conformity - selected
model
Parameters
Estimate
Std Error
z-value
p-value
Intercept
0.119
0.281
0.42
0.673
Feature Interaction
0.534
0.309
1.73
0.084
Feature Granularity
−0.148
0.291
−0.51
0.61
AIC: 182.0345
Akaike information criterion (AIC) is a measure of the relative goodness of fit of a statistical model.
TABLE VI
VALUES TO THE FEATURE NON - CONFORMITIES PREDICTION MODEL - E QUATION 5
Feature Interaction
No
No
Yes
Yes
Feature Granularity
Fine
Coarse
Fine
Coarse
influence on the feature non-conformity (estimated value =
0.534) within of significance level of 10% (p-value = 0.084).
On the other hand, the variable feature granularity presented
negative influence on the variable feature non-conformity
(estimated value = − 0.148) and low significance level (pvalue = 0.61).
f) Features that had at least one interaction estimated larger
feature non-conformities than features which did not have
interaction. This finding needs more investigation (empirical
studies) for observing its reproduction and understanding the
possible reasons for that.
g) Fine-grained features estimated larger feature non-conformities than coarse-grained features. Kästner et al. [4] also
reported problems to handle implementation of fine-grained
features in the SPL context. In this study, we identified that
fine-grained features were highlighted since they estimated
more features non-conformities (RQ4).
VI. T HREATS T O VALIDITY
In order to reduce the threats to validity, countermeasures
were taken during the whole study. The countermeasures
followed the quality criteria in terms of construct, external and
internal validity as discussed in [25]. Moreover, we briefly
described also mitigation strategy to the research questions
definition and negative results.
Construct validity: It makes use of two strategies as
described:
•
Longstanding involvement: In this strategy, the researchers had a long involvement with the object of study
•
λ̂i
1.126
0.971
1.921
1.657
allowing gathering tacit knowledge which helped us to
avoid misunderstandings and misinterpretations [28].
Peer debriefing: It recommends that the analysis and
conclusions be shared and reviewed by other researchers
[28]. This was achieved by conducting the analysis with
three researchers, by performing discussion groups in
which analysis and conclusions were discussed and by
the supervision of a statistical researcher (second author
of this paper).
Internal validity: This threat was mitigated by ensuring the
company anonymity and free access to the company for the
research team.
External validity: Although the study was applied in only
one company, our intention is to build a knowledge base to
enable future analytical generalization where the results are
extended to cases that have similar characteristics.
Reliability: This aspect was achieved by using two tactics:
a detailed empirical study protocol and a structured study
database with all relevant and raw data such as interviews
and meetings tapes, transcripts, documents, and outline of
statistical models.
Research Questions: The set of questions might not have
properly covered all the aspects on the relationship of SPL
inspection and features information. As it was considered a
feasible threat, some discussions among the authors of this
work and some members of the research group (RiSE Labs 3 )
were conducted in order to calibrate the questions.
3 http://labs.rise.com.br/
Negative results: Some correlational and prediction analysis presented negative results. However, these should not be
discarded immediately, and future analysis for these cases must
be provided to increase the validity of the conclusion.
VII. C ONCLUSIONS
The use of the feature concept is a key aspect for organizations interested in achieving improvements in software reuse,
productivity, quality, and costs reduction [2]. Software product
lines, as a software reuse approach, have proven its benefits
in different industrial environments and domains [29][30].
Thus, to achieve these benefits, quality assurance techniques, such as software inspection, should be performed
on the feature specification artifacts, once that feature units
can be considered the kick-off for reusing assets in different
products. In this exploratory study, we investigated relationships between the data of features granularity and features
non-conformity based on a sample of 92 features and 137
features non-conformities. We believe that although there were
some negatives results, some variables need other empirical
investigations to validity the results of this study.
Based on the dataset, we identified that: there was no significant statistical association between the variables feature type
and feature granularity; as well as, there was no significant
statistical association between variables feature hierarchy and
feature granularity. On the other hand, there was significant
statistical association between the variables feature interaction
and feature granularity; the variable feature granularity did not
present significant statistical influence on the variable feature
non-conformity; and investigating the simultaneously influence
of the variables feature interaction and feature granularity on
the variable feature non-conformity, the outcome was that:
the variable feature interaction presented positive influence
on the feature non-conformity, however, the variable feature
granularity presented negative influence on the variable feature
non-conformity with low significance level. Also, during the
feature specification activity, coarse grained features, features
with interactions, features with hierarchy and mandatory features should receive more attention since they revealed to have
more non-conformities.
Furthermore, this work can be seen as a further step towards
the understanding what variables of software projects influence
on the feature non-conformities data in the SPL context. As
future work, we are planning to replicate this study in another
company within the financial domain.
R EFERENCES
[1] P. Clements and L. Northrop, Software Product Lines: Practices and
Patterns. Boston, MA, USA: Addison-Wesley, 2001.
[2] K. Kang, S. Cohen, J. Hess, W. Nowak, and S. Peterson, FeatureOriented Domain Analysis (FODA) Feasibility Study. Technical Report
CMU/SEI-90-TR-21, 1990.
[3] T. Von Der Massen and H. Lichter, “Deficiencies in feature models,” in
Workshop on Software Variability Management for Product Derivation
Towards Tool Support, SPLC 2004, T. Mannisto and J. Bosch, Eds.
Springer Verlag, 2004.
[4] C. Kästner, S. Apel, and M. Kuhlemann, “Granularity in software
product lines,” Proceedings of the 30th international conference on
Software engineering, pp. 311–320, 2008.
[5] I. John and M. Eisenbarth, “A decade of scoping: a survey,” Proceedings
of the 13th International Software Product Line Conference, pp. 31–40,
2009.
[6] I. S. Souza, G. S. S. Gomes, P. A. M. S. Neto, I. C. Machado, E. S.
Almeida, and S. R. L. Meira, “Evidence of software inspection on
feature specification for software product lines,” Journal of Systems and
Software, vol. 86, no. 5, pp. 1172–1190, 2013.
[7] P. A. da Mota Silveira Neto, I. do Carmo Machado, J. D. McGregor, E. S.
de Almeida, and S. R. de Lemos Meira, “A systematic mapping study
of software product lines testing,” Information & Software Technology,
vol. 53, no. 5, pp. 407–423, 2011.
[8] E. Engström and P. Runeson, “Software product line testing - a systematic mapping study,” Information & Software Technology, vol. 53, no. 1,
pp. 2–13, 2011.
[9] M. A. Babar, L. Chen, and F. Shull, “Managing variability in software
product lines,” IEEE Software, vol. 27, no. 3, pp. 89–91, 94, 2010.
[10] G. C. Murphy, A. Lai, R. J. Walker, and M. P. Robillard, “Separating
features in source code: an exploratory study,” Proc. 23rd Int. Conf. on
Soft. Eng. ICSE 2001, vol. 12, pp. 275–284, 2001.
[11] H. Ossher and P. Tarr, “Hyper/j: Multi-dimensional separation of concerns for java,” Proceedings of the 23rd International Conference on
Software Engineering, pp. 821–822, 2000.
[12] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. V. Lopes, J.-M.
Loingtier, and J. Irwin, “Aspect-oriented programming,” 11th European
Conference on Object-Oriented Programming, pp. 220–242, 1997.
[13] M. P. Robillard and G. C. Murphy, An Exploration of a Lightweight
Means of Concern Separation. Aspects and Dimensions of Concern
Workshop, 2000, pp. 1–6.
[14] M. Kalinowski, E. Mendes, D. N. Card, and G. H. Travassos, “Applying
dppi: A defect causal analysis approach using bayesian networks,”
Proceedings of the 11th international conference on Product-Focused
Software Process Improvement, pp. 92–106, 2010.
[15] M. Kalinowski, D. N. Card, and G. H. Travassos, “Evidence-based
guidelines to defect causal analysis,” Software, IEEE, vol. 29, no. 4,
pp. 16–18, 2012.
[16] M. Kalinowski, G. Travassos, and D. Card, “Towards a defect prevention based process improvement approach,” Software Engineering and
Advanced Applications, SEAA ’08., pp. 199–206, 2008.
[17] I. S. Souza, R. P. de Oliveira, G. Gomes, and E. S. de Almeida, “On
the relationship between inspection and evolution in software product
lines: An exploratory study,” in 26th Brazilian Symposium on Software
Engineering. IEEE, 2012, pp. 131–140.
[18] M. S. M. Balbino, E. S. Almeida, and . R. L. M. Meira, “A scoping
process for software product lines,” 23rd International Conference on
Software Engineering and Knowledge Engineering, pp. 717–722, 2011.
[19] I. John, “Using documentation for product line scoping,” IEEE Software,
vol. 27, pp. 42–47, 2010.
[20] A. van Lamsweerde, Requirements Engineering: From System Goals to
UML Models to Software Specifications. Wiley, March 2009.
[21] P. Runeson and M. Höst, “Guidelines for conducting and reporting case
study research in software engineering,” Empirical Software Engineering, vol. 14, pp. 131–164, April 2009.
[22] C. B. Seaman, “Qualitative methods in empirical studies of software
engineering,” IEEE Trans. Softw. Eng., vol. 25, pp. 557–572, July 1999.
[23] T. Gauthier, “Detecting trends using spearman’s rank correlation coefficient,” Environmental Forensics, vol. 2, no. 4, pp. 359–362, 2001.
[24] T. M. Khoshgoftaar, K. Gao, and R. M. Szabo, “An application of zeroinflated poisson regression for software fault prediction,” Proc. 12th Int.
Symp. on Soft. Reliability Eng., pp. 66–73, 2001.
[25] R. K. Yin, Case Study Research: Design and Methods. Sage Publications, 2008.
[26] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability
for Engineers. Wiley, 2006.
[27] A. C. Cameron and P. K. Trivedi, Regression Analysis of Count Data.
Cambridge: Cambridge University Press, September 1998.
[28] D. Karlström and P. Runeson, “Integrating agile software development
into stage-gate managed product development,” Empirical Software
Engineering, vol. 11, pp. 203–225, June 2006.
[29] F. Ahmed, L. Capretz, and S. Sheikh, “Institutionalization of software
product line: An empirical investigation of key organizational factors,”
Journal of Systems and Software, vol. 80, no. 6, pp. 836–849, 2007.
[30] J. F. Bastos, P. A. M. Silveira, E. S. Almeida, and . R. L. M. Meira,
“Adopting software product lines: A systematic mapping study,” 15th
International Conference on Evaluation and Assessment in Software
Engineering, pp. 11–20, 2011.
An Extended Assessment of Data-driven Bayesian
Networks in Software Effort Prediction
Ivan A. P. Tierno and Daltro J. Nunes
Instituto de Informática
UFRGS
Porto Alegre, Brazil
Email: {iaptierno,daltro}@inf.ufrgs.br
Abstract—Software prediction unveils itself as a difficult but
important task which can aid the manager on decision making,
possibly allowing for time and resources sparing, achieving higher
software quality among other benefits. Bayesian Networks are
one of the machine learning techniques proposed to perform
this task. However, the data pre-processing procedures related
to their application remain scarcely investigated in this field. In
this context, this study extends a previously published paper,
benchmarking data-driven Bayesian Networks against mean and
median baseline models and also against ordinary least squares
regression with a logarithmic transformation across three public
datasets. The results were obtained through a 10-fold cross
validation procedure and measured by five accuracy metrics.
Some current limitations of Bayesian Networks are highlighted
and possible improvements are discussed. Furthermore, we assess
the effectiveness of some pre-processing procedures and bring
forward some guidelines on the exploration of data prior to
Bayesian Networks’ model learning. These guidelines can be
useful to any Bayesian Networks that use data for model learning.
Finally, this study also confirms the potential benefits of feature
selection in software effort prediction.
I. I NTRODUCTION
Accurate software predictions can provide significant advantages in project planning and are essential for effective
project management being strongly linked to the success of
software projects. Underestimating the effort can cause delays,
degrade software quality and bring about increased costs and
dissatisfied customers. On the other hand, overestimating the
project’s effort can lose a contract bid or waste resources that
could be allocated elsewhere. Although the primary objetive
of software effort prediction is budgeting, there are also
other important objectives. Boehm et al. [1] mention tradeoff
and risk analysis, project planning and control and software
improvement investment analysis.
Since the nineties, researchers began applying machine
learning techniques for software effort prediction [2] [3]
[4]. Ever since, studies on machine learning techniques for
software prediction have grown more and more common.
Currently this is visibly a thriving trend with many empirical
studies being published regularly and comprising a very active
research field. In a systematic review, Wen et al. [5] identified eight machine learning techniques employed in software
prediction including CART (a type of decision tree) [3],
Case-based Reasoning (CBR) [4], Artificial Neural Networks,
Genetic algorithms, Support Vector Regression among others.
CBR, Artificial Neural Networks and Decision Trees were
considered by Wen et al. [5] the most popular machine learning
techniques in software development effort prediction research.
One of these machine learning techniques are Bayesian
Networks(henceforth BNs), which is the technique we assess
in this study. BNs were initially proposed and are generally
more common in software quality prediction. Since then there
has been a steady increase of efforts towards BNs in software effort prediction and in software projects management
in general. Wen et al. ranked BNs fourth in popularity in
software development effort prediction among the machine
learning techniques. This technique has some distinguishing
features that make it look suitable to deal with the uncertainties
prevalent in this field. BNs will be discussed briefly in the next
section.
This research field has suffered with contradictions and few
clear conclusions. In spite of the large number of empirical
studies there are conflicting results and conclusions instability
[6] [7] [8]. Shepperd and Macdonell [9] state that ‘empirical
evaluation has not led to consistent or easy to interpret results’.
This matters because it is hard to know what advice to offer
to practitioners. There are many examples of contradictions in
comparisons among different machine learning and statistical
techniques, as described for instance in, e.g. [7], [9]. Part of
these inconsistencies stem from differences in the experiments
and sometimes from errors in the procedures like discussed
in, e.g., [9] and [10]. The latter study points out mistakes
in the application of regression models. Myrtveit, Stensrud
and Shepperd [11] discuss reasons for the unreliability of
conclusions in detail, chiefly focusing on validation and measuring, and concluded that more reliable research procedures
are necessary. Several other researchers have made suggestions
about the validation of results in comparative studies, e.g.,
[12], [13] and [9].
With regard to BNs, details on their employment and the
preparation and pre-processing prior to model learning remain
scarcely investigated. There is some uncertainty about its
effectiveness and about the pre-processing procedures applied
prior to model learning. Given the relevancy of BNs in
software prediction research, investigations on its employment
and effectiveness are necessary.
In this context, this study strives to assess the employment
of data-driven BNs in software effort prediction through extensive validation procedures, including analyses on data preprocessing, providing guidelines on how to best explore data,
and discussing BNs’ current limitations and possibilities of
improvements. The investigation of data-driven BNs matters
because even if this might not become the best way to apply
them, the optimization of data exploration is an important
direction of development for this technique. By finding ways
to optimize the exploration of data there can be benefits to any
BNs that use data.
This paper extends a preliminary work [14] by assessing
other pre-processing steps, and extending significantly the
validation by including other metrics and another dataset, and
also by refining the observations on the results.
This paper is organized as follows. In section II we present
a brief overview on BNs. In section III we mention some
closely related studies. In section IV we bring forward the
empirical procedures, datasets used, and how we compared
the prediction systems. In section V we analyze and discuss
the results and finally put forth the conclusions in the last
section.
II. BAYESIAN N ETWORKS
BNs [15] [16] are a modeling technique which boasts
some distinguishing characteristics. A striking feature of this
modelling approach is the possibility, through application of
probability theory, to model uncertainty or subjectivity. The
probability distributions allow for the integration of objective
evaluations, learned from data, with subjective evaluations
defined by experts. Furthermore, this allows the model to
output several possible outcomes with varying degrees of
certainty, unlike deterministic models like linear regression
which simply output a single possible outcome, i.e., a numeric
value.
BNs comprise a qualitative part, i.e., the graph structure
that models the dependencies among a set of variables, and a
quantitative part made up of node probability tables (NPT’s)
which contain the probability distributions for each node. The
graph structure is a directed acyclic graph (DAG) encoding
the dependencies among the variables. The nodes represent the
relevant variables or factors in the domain being modeled, and
each directed arc depicts the dependencies among these factors
which can be causality relationships. The NPT’s contain the
prior probabilities (in case the variables has no parents) or
conditional probabilities (in case the variable has one or more
parents). The conditional probabilities define the state of a
variable given the combination of states of its parents. With
the definition of these probabilities during the training phase
a test record can later be classified. These components are
illustrated on a simple example in Fig. 1.
Fig. 1.
A simple Bayesian Network.
BNs can be modeled fully based on data, through a hybrid
approach, i.e., integrating data modeling and experts knowledge or fully expert-based. When the BNs are learnt from data,
the learning algorithm strives to identify the dependencies
among the variables and thus making up the network structure,
i.e., the DAG. The algorithm will identify a model that best
fits the relationship between the attribute set and the response
variable on the input data (training data). Thereafter, the
probability distributions are learned for every combination
of variables. This happens during the so called training or
learning phase.
The BNs found in this research field most frequently consist
of discrete variables. The tool used in this study currently does
not support continuous variables. Although some tools offer
support to continuous variables, this support has limitations,
e.g., imposing restrictions in the relationships among the
variables or making assumptions about the distributions of
the continuous variables. There are progresses concerning
continuous variables in machine learning research and there
are also constant developments in the BNs tools, so these
limitations could be overcome in the future. For a more
detailed review on BNs we refer the reader to other works
in the field, e.g., [17], [18], [19] and to data mining literature
[15], [16].
III. R ELATED WORK
In this section we describe some closely related studies.
Radlinski and Hoffman [20] carried out a comprehensive
benchmarking study comparing 23 classifiers in WEKA over
four public datasets. The authors state their main research
question is: “Is it possible to easily predict software development effort from local data?”. So, they establish two
specific constraints: easy predictions and using local data,
i.e., data from a single company. This paper focused more
on the practitioners viewpoint, trying to avoid complex and
time-consuming procedures. So, the authors do not address
specific details of the techniques but provide a wide-ranging
assessment of easy-to-use machine learning classifiers. By
comparing so many classifiers this study illustrates very well
the lack of stability of the ranking of the techniques across
different datasets. They mentioned that due to the ranking
instability it is difficult to recommend practitioners with a
particular model even though they did conclude that K*
technique with feature selection was the most accurate overall.
BNs were among the most accurate predictors in two of the
four datasets but did not particularly stand out. They also
demonstrate that most techniques achieve higher accuracy by
performing feature selection.
Mendes and Mosley [13] outline thorough experiments
comparing BNs, CBR, manual stepwise regression and simple
mean and median based models for web effort prediction using
Tukutuku, a proprietary cross-company dataset. The study
compares four automatic and four hybrid BN models. The
results were unfavourable to BNs, with most of the models
being more inaccurate than the median model and two of them
barely matching it. The authors conclude that manual stepwise
regression may be the only effective technique for web effort
estimation. Furthermore, they recommend that researchers
benchmark proposed models against mean and median based
models as they show these can be more effective than more
complex models.
One of the last investigations can be found in [21] wherein
comprehensive experiments are laid out yielding a benchmark
of some statistical and data mining techniques, not including
however, BNs. This study benchmarks numeric predictors, as
opposed to [20] which assesses classifiers, i.e., discrete class
predictors. This study included thirteen techniques over eight
public and private datasets. Their results “indicate that ordinary least squares regression with a logarithmic transformation
performs best”. They also investigate feature subset selection with a wrapper approach confirming the improvements
brought by this technique. The authors also discuss appropriate
procedures and address efforts towards statistically rigorous
analyses.
A survey covering BNs for software development effort
prediction can be found in [19].
IV. E XPERIMENTS SETUP
We assess data-driven BNs by comparing them to ordinary
least squares regression with a logarithmic transformation,
which was found in [21] to be invariably among the most
accurate predictors. We remind the reader once again that we
are comparing a classifier, i.e., a discrete class predictor, to
a regression technique, i.e., a numerical predictor. We do this
by converting the BN’s class predictions to numeric ones by
means of a variant of the method originally proposed in [18]
which will be explained in subsection C.
We decided to experiment performing a logarithmic transformation on the data prior to BNs’ building. So, this variant
is included in the comparison amounting so far to three prediction systems. Furthermore, we also assess the effectiveness
of feature subset selection [22] [15] as a pre-processing step.
This technique has been employed with good results in this
field, e.g., [23], [21], [20]. So, for each of the aforementioned
models there is a variant with the application of feature
selection prior to model building which multiplies by two the
number of prediction systems. So, there are four variants of
BNs and two variants of OLS regression amounting so far to
six prediction systems.
Finally, we include in the comparison mean and median
based models like proposed in [13]. These models simply
use the mean and median of all projects effort as a constant
prediction. These are very simple benchmark models and an
effective model should be able to be more accurate than
them. The comparison with such models allows us to better
assess the effectiveness of the other techniques by establishing
a minimum benchmark of accuracy. The inclusion of such
benchmark models is another recent trend proposed in several
studies like [13] and [9], with the goal of verifying whether
the models are effectively predicting and therefore bringing
clarity to the results. So, with these two benchmark models
we have in total eight prediction systems.
An abstract outline of the experiments we carried out is
shown in Fig. 2. We omitted the different versions of the
dataset and the two models of BNs on log-transformed data
to avoid cluttering up the figure, for intuitiveness’ sake. So,
prepared data is an abstract entity which represents any of
the datasets versions (log-transformed or not, and discretized
or not) and besides the six prediction systems depicted in this
figure there are two BNs on log-transformed data (with and
without FSS) which are not shown. We will explain these
procedures in the next sections.
These experiments were carried out in the WEKA data
mining tool [24]. The next subsections describe briefly the
datasets, the conversion method necessary to compare the
techniques and the metrics used to assess accuracy.
Fig. 2.
Experiments outline.
A. Datasets
A significant barrier for analysis of findings and replication
of experiments has been the lack of publicly available datasets
since the employment of proprietary datasets inhibits the
replication of experiments and confirmation of results. The
PROMISE repository [25] is an initiative that attempts to
counter to some extent the lack of transparency that pervades
this research field. Datasets are made available allowing for
replication and scrutiny of findings with the intent of improving research efforts and to stir up analyses and discussion.
In this work, we used three widely studied datasets available
in the PROMISE repository [25]. these are the Desharnais,
Maxwell and Cocomo81 datasets. These datasets are relatively
clean in comparison to other datasets we have checked. They
are local datasets, i.e., data was collected within a single
company. Table I describes basic information on the datasets.
TABLE I
BASIC INFORMATION ON DATASETS
Data set
Desharnais
Maxwell
Cocomo81
Local data
Domain
Effort unit
Range years
Yes
Yes
Yes
Unknown
Finnish bank
Various domains
Person-Hours
Person-Hours
Person-Months
1981-1988
1985-1993
1970-1981
The histograms in Fig. 3, Fig. 4 and Fig. 5 illustrate the
distribution of data over effort, the dependent variable. Effort
is measured in person-hours on Desharnais and Maxwell and
in person-months of 152 hours in Cocomo81. In all three cases
the variables are positively skewed, i.e., variables with most
records situated towards lower values and a few very high
outlying values. Desharnais is the least skewed of the three
at 2.00. Maxwell is significantly more skewed at 3.35 and
Cocomo81 is the most skewed one at 4.48. Skewness is a
very common characteristic in software project datasets.
This characteristic poses some hindrances for modeling. In
order to carry out linear regression these variables must be
transformed as to approximate a Gaussian distribution. With
regard to BNs, this is also a problem since the discretization
could yield very uneven classes intervals. In such a scenario,
the qual-widths discretization technique [26] [15] can produce
empty classes and dispose most of the dataset population
within just a couple of classes, thus turning the validation
highly dubious. If almost all of the data is within just a
Fig. 3.
Fig. 4.
Desharnais data set.
Maxwell data set.
couple of classes the model can hardly predict wrong or
find meaningful patterns. A very high hit-rate would not be
surprising, but the predictions would be meaningless.
When software managers carry out effort predictions they
do not know, for instance, how long a project will last,
even though they can have an estimate. Therefore, variables
whose values are unknown at the time the prediction is to be
performed must be removed, e.g., ‘Duration’, ‘Defects’. This
is standard practice in the software prediction field. On the
other hand, when a sizing variable is quantified in Functions
points it is usually included since it can be obtained in the
specification phase, depending on the process model.
On Desharnais dataset three variables were removed: the
ID variable, ‘YearEnd’ and ‘Length’. On Maxwell dataset
three variables were removed: ‘Duration’, ‘Time’ and ‘Syear’.
Finally, no variables were removed from Cocomo81 dataset.
In order to carry out OLS regression we removed records
with missing values. This amounts to four records on Desharnais dataset and two records on Maxwell dataset. There
were no missing values on Cocomo81 dataset. For the BNs
models all the records were kept. We also experimented
not removing the missing values for the OLS regression
Fig. 5.
Cocomo81 data set.
model by performing median imputation and the difference
on Desharnais dataset, which is the one with more missing
values, was minimal. So, we decided to show the results on
the dataset without records with missing values because these
are the same we used in our previous paper [14]. On Maxwell
dataset there were two records with missing values and on
Cocomo dataset there were none.
The categorical variables were coded to dummy variables
for the linear regression model following good statistical
practices [10]. This study also suggests the removal of outliers.
Although this is the standard practice for statistical procedures,
we decided to keep the outliers for both models for two
reasons: To keep the same conditions for both models; and
chiefly because these outliers are actual projects which are
rare but can happen. They are not noisy or irrelevant entries.
Other studies in software prediction also keep the outliers, e.g.,
[21], [20].
For more detailed information on these datasets we refer
the reader to [20] and to the original works referenced in the
PROMISE repository [25].
B. Comparing the Predictions
The prediction systems are compared through numerical
metrics. This has been another controversial topic. There is
no consensus on what is the most reliable metric [11]. the
standard metric some years ago used to be MMRE [27],
but due to some flaws it lost popularity. MMRE, like other
numerical metrics used in this study, is based on the magnitude
of relative error. MRE is a measure of the relative error of the
actual effort ei against the predicted effort êi .
|ei − êi |
(1)
ei
MMRE measures the mean of all the predictions’ MREs.
This metric has not passed without criticism [27] [6], for it is
highly affected by outliers and it favours models that underestimate. MMRE is biased towards underestimates because the
magnitude of the error is unbounded when overestimating and
limited to at most 1 (or 100%) when underestimating. This is
well explained by means of a didactic example in [9]. This bias
entails that models that tend to underestimate will be likely
to have smaller MREs overall, therefore performing better
according to MRE based metrics. Even though this bias is
present in all MRE based metrics it is specially so in MMRE.
MdMRE is the median of the MRE’s. It smoothes out
MMRE’s bias, for it is more robust to outliers. Amply inaccurate predictions do not bear on MdMRE like on MMRE.
So, on the one hand it shows which models are generally
more accurate, but on the other hand it conceals which models
can be occasionally very inaccurate. This effect is even more
pronounced on Pred metric because it ignores completely the
predictions with large errors. Pred measures how frequently
predictions fall within a specified percentage of the actual
effort, e.g., Pred25 tells us how often the predicted effort is
within 25% of the project’s actual effort (25 is a common
parameter value for this metric). Therefore, this metric ignores
the predictions whose errors are in excess of 25% magnitude,
i.e., it does not matter for this metric if the error is 30%
or 200% (assuming Pred25 ). This is a limitation which we
criticize about these metrics. Obtaining a model whose predictions rarely lie too far from the actual value is certainly
M REi =
advantageous. This is a desirable quality in a model and these
metrics overlook this aspect.
Several studies proposed new metrics discussing their characteristics. But none of these metrics was widely adopted in
the research field. MdMRE and Pred appear to be still the
most popular. Foss et al. [27] concluded that every metric
studied has flaws or limitations and that it is unlikely that a
single entirely reliable metric will be found. So, the use of
complementary metrics is recommended.
Miyazaki et al. [28], being the first to observe MRE’s bias
towards underestimates, proposed MBRE (Mean Balanced
Relative Error). This metric addresses this flaw because it
makes the relative error unbounded towards both underestimates and overestimates. By making the ratio relative to the
lowest value (between actual and predicted values) the bias of
MRE based metrics is eliminated, therefore avoiding favouring
models that underestimate.
BREi =
|ei − êi |
minimum(ei , êi )
(2)
However, it has a flaw in that it does not account for negative
predictions. Linear regression models can at times predict a
negative number and therefore distort a bit the results under
MBRE. Kitchenham et al. [12] propose the use of the absolute
residuals as another alternative to bypass these problems of
MRE based metrics. MAR (Mean Absolute Residuals) being
an absolute measure also avoids this bias of ratio metrics like
MRE. MAR has the disadvantage of not being comparable
across different datasets.
M ARi =
n
X
|ei − êi |
i
n
TABLE II
N UMERICAL CONVERSION FOR BN S ON D ESHARNAIS DATA SET.
Prediction System
Bayesian Networks (mean)
Bayesian Networks (median)
MMRE
MdMRE
Pred
MAR
70
57.23
35.65
32.66
38.27
33.33
2556.98
2153.52
TABLE III
N UMERICAL CONVERSION FOR BN S WITH FSS
SET.
Prediction System
Bayesian Networks + FSS (mean)
Bayesian Networks + FSS (median)
ON
D ESHARNAIS DATA
MMRE
MdMRE
Pred
MAR
68.94
56.18
35.49
34.16
39.5
39.5
2509.52
2133.84
(3)
We consider our selection of metrics to be robust with MAR
and MBRE being complementary to the MRE based metrics
and making the evaluation more reliable. Higher accuracy in
MMRE, MdMRE, MAR and MBRE is inferred from lower
values, whereas for Pred metric, the higher the value the more
accurate the model. In our result tables, the results under
MMRE, MdMRE, Pred and MBRE are multiplied by 100 to
keep them in a percentage perspective, e.g., 0.253 turns into
25.3.
C. Comparing the Bayesian Classifier to regression techniques
In order to compare BNs’ results to linear regression we
used a variant of the conversion method first proposed in [18],
and also used in [13], in which the numerical prediction is the
sum of the multiplication of each class’ mean by its respective
class probability after the probabilities are normalized so that
their sum equals one. Instead of using the mean however, we
used the median. Each class’ median value Md is multiplied
by its respective normalized class probability ρ, output in the
probability distributions of the BN’s predictions. See formula
below.
Ef f ort = ρclass1 Mdclass1 + ... + ρclassN MdclassN .
under MdMRE and Pred metric and significant and consistent improvements in MMRE and MAR results when using
the median for the numerical conversion. This modification
increased accuracy and lessened the amount of outliers, i.e.,
wildly inaccurate predictions. This happens because the mean
of each class is more affected by outliers than the median.
These datasets are positively skewed, therefore each class’s
mean value (and specially the highest effort class) will be
closer to where the outliers are and farther from the majority
of the data, pushing the numerical conversion of the output
towards higher values. Therefore, when skewness is present
the median is a more faithful and accurate representative of
the data which makes up each class. An evidence supporting
this reasoning is that the larger improvements were achieved
on Maxwell dataset which is the more skewed one.
(4)
Like the aforementioned studies, we used the mean in a
preliminary study [14]. We report here accuracy improvements
The effectiveness of this modification can be seen in the tables. Table II shows the results for BNs on Desharnais dataset.
Table III shows the results for BNs with the employment of
feature subset selection on the same dataset. Tables II and
III show the results on Maxwell dataset. These tables show
the results comparing the conversion with the mean against
the conversion with the median. BNs with and without feature
subset selection are different prediction systems. So, the effect
of the conversion method can be assessed by comparing the
results on the same prediction system. Comparisons between
the two prediction systems do not belong in this section and
will be discussed in the results section. Here we are discussing
only the improvements provided by this adaptation to the
method proposed in [18].
TABLE IV
N UMERICAL CONVERSION FOR BN S ON M AXWELL DATA SET.
Prediction System
MMRE
MdMRE
Pred
MAR
Bayesian Networks (mean)
Bayesian Networks (median)
132.69
86.18
64.44
58.77
22.58
24.19
6726.23
4655.29
TABLE V
N UMERICAL CONVERSION FOR BN S WITH FSS
ON
M AXWELL DATA SET.
Prediction System
MMRE
MdMRE
Pred
MAR
Bayesian Networks + FSS (mean)
Bayesian Networks + FSS (median)
163.53
97.50
67.74
55.99
19.35
27.42
6281.83
4854.74
The effect of using the median in the conversion is quite
clear for both prediction systems and in both datasets. However on Desharnais dataset under Pred metric there is no
improvement. This can be ascribed to the limitation about
Pred discussed in the previous section. This metric ignores
predictions whose errors are larger than the parameter used,
i.e., 25. All errors over this threshold are ignored. So, an error
that is reduced from 100% MRE to 50% MRE will not affect
this metric despite being a valuable improvement. We can infer
from this that the improvements happened in the predictions
that lie outside the 25% error range since all the other metrics
clearly show there were improvements.
We can see the impact of this adaptation is quite significant
on the Maxwell dataset, which is the more skewed one.
This result can probably be more easily grasped in all detail
by the reader after reading the analysis and discussion of
results in the next section.
V. R ESULTS AND A NALYSIS
Table VI reports on the results for the Desharnais dataset
according to the continuous metrics previously exposed. On
the Desharnais dataset there is an obvious improvement in
the BNs’ hit-rates when applying feature selection. The hitrates are simply the percentage of times the classifier predicted
the right class (therefore, the higher the value the more
precise the model). However, when we consider the continuous
metrics there were generally no improvements except under
Pred. Pred metric resembles the hit-rates in its characteristic
of only considering the accurate predictions and ignoring
predictions lying far from the actual value. This shows there
were more accurate predictions but that there were also more
wrong predictions since the other metrics do not show improvements. This illustrates the limitation of Pred metric that
we highlighted in subsection B of the previous section. For
OLS regression, there is a small improvement under MMRE,
MdMRE, MAR and MBRE and a marginal degradation under
Pred metric. The improvements were relatively small because
the number of variables dwindles in such cases and the feature
selection technique cannot find much improvements by further
decreasing the number of variables.
The accuracy of the BNs on log-transformed data was about
the same as on the non transformed data. The log transformation did not bring improvements to BNs’ predictions.
BNs’ performance was very constant regardless of data preprocessing. So, on this dataset, BNs performed relatively well
but were more prone to large inaccuracies.
Finally, BNs clearly overcame the baseline models.
TABLE VI
M ODELS PERFORMANCE ON D ESHARNAIS DATA SET
Predictor
Hit-rate
MMRE
MdMRE
Pred
MAR
MBRE
BNs
BNs+FSS
BNs+log
BNs+log+FSS
OLS+log
OLS+log+FSS
Mean model
Median model
46.91%
54.32%
44.44%
48.15%
-
57.23
56.18
56.37
57.64
37.62
34.24
121.66
78.46
32.66
34.16
33.61
36.42
29.19
27.66
59.49
42.03
33.33
39.5
34.57
38.27
46.75
45.45
18.51
29.62
2153.52
2133.84
2128.65
2165.47
1731.53
1567.93
3161.52
2861.53
65.83
64.52
67.38
72.19
48.04
42.54
140.04
120.42
Table VII reports on the results for the Maxwell dataset.
For being the dataset with the largest amount of variables in
this study, it is likely to contain irrelevant variables and benefit
the most by undergoing feature selection. This expectation is
fulfilled for OLS regression. Feature selection reduced by half
the mean of residuals and all the other metrics show large
improvements as well.
But again, like on Desharnais dataset, BNs’ performance
did not improve convincingly with the application of feature
selection. There is a clear improvement on the hit-rates and
an improvement under Pred metric, but the other metrics show
that the increase of good predictions (i.e., predictions close to
the actual value) was offset by larger errors.
It is interesting to observe that when data did not undergo
feature selection, the performance of BNs is comparable to
the performance of OLS regression. But with the application
of feature selection OLS regression has a large improvement
in accuracy as opposed to BNs which do not collect any
improvement. This highlights that the BNs models are missing
very significant improvements in accuracy which are expected
with the application of feature selection.
With regard to the logarithmic transformation, the results
show small improvements for BNs under all metrics but Pred
as opposed to the Desharnais dataset in which there was no
effect.
Like on Desharnais dataset, BNs clearly overcame the
baseline models. In our view, an important observation on this
dataset is the improvement with feature selection that is being
missed by BNs. We will discuss the reasons for this after
exposing all results.
TABLE VII
M ODELS PERFORMANCE ON M AXWELL DATA SET
Predictor
Hit-rate
MMRE
MdMRE
Pred
MAR
MBRE
BNs
BNs+FSS
BNs+log
BNs+log+FSS
OLS+log
OLS+log+FSS
Mean model
Median model
40.32%
51.61%
40.32%
51.61%
-
86.18
97.5
73.41
70.67
76.86
42.57
119.67
108.95
58.77
55.99
52.72
53.88
43.78
28.62
52.96
66.28
24.19
27.41
19.35
25.81
30
40
19.35
20.97
4655.29
4854.74
4550.90
4576.05
4932.6
2500.04
5616.54
5654.11
110.48
122.19
104.54
106.44
101.19
52.28
225.64
180.91
Table VIII reports on the results for Cocomo81. On this
dataset, the logarithmic transformation did yield an observable improvement on the BNs’ predictions, specially under
MMRE, This suggests a decrease of large overestimates. We
can observe the difference in performance compared to OLS
regression grew in comparison to the previous datasets, even
though this effect can be slightly reduced by the application
of the logarithmic transformation.
Feature selection brought an improvement for OLS regression though not as pronounced as on Maxwell. For BNs, the
same pattern of improved hit-rates and no improvements under
other metrics which was observed in the other dataset stands
on this dataset. This appears to be related to the skewness
of the datasets and the loss of precision brought about by
the discretization process. Skewness increases this imprecision
because it makes the classes more uneven. The logarithmic
transformation is only to some extent able to reduce this effect.
Nevertheless, even in this very skewed dataset they were
able to overcome both baseline models.
Table IX shows the frequency of underestimates and overestimates for each model over the three datasets. OLS models
have a tendency to underestimate, which is considered less
desirable than a tendency to overestimate.
The variables most frequently identified by the feature
selection algorithm were related to ‘Size’. In all datasets
studies here, a size variable was selected. This variable appears
to be frequently the one with the highest predictive value for
effort estimation.
TABLE VIII
M ODELS PERFORMANCE ON C OCOMO 81
DATA SET
Predictor
Hit-rate
MMRE
MdMRE
Pred
MAR
MBRE
BNs
BNs+FSS
BNs+log
BNs+log+FSS
OLS+log
OLS+log+FSS
Mean model
Median model
50.79%
55.56%
52.38%
55.56%
-
134.85
270.64
91.19
76.94
46.6
44.28
1775.35
235.42
58.64
130.37
53.64
64.93
30.49
22.98
571.16
86.25
25.81
9.68
19.35
25.81
44.44
53.96
4.76
15.87
551.95
606.22
536.54
530.61
278
297.47
891.64
642.63
197.82
336.39
233.15
212.73
61.83
55.97
1905.81
842.24
We can observe in all of these results that feature selection
improved clearly and consistently the hit-rates of BNs and the
accuracy of linear regression over all datasets. This effect is
very pronounced on the Maxwell dataset which is the one
with the highest number of variables. Such improvements
are expected because the larger the amount of variables in a
dataset, the more likely it is for the dataset to contain irrelevant
or redundant variables. This emphasizes the importance of
applying the feature selection especially on datasets with
many variables. It also highlights the fact that many variables
in software projects datasets have a small predictive value
and can actually make the models less accurate. Therefore,
collecting a smaller amount of variables focusing on high data
quality may be more interesting for data-based predictions.
This finding is a confirmation of the findings of previous
studies, e.g.,[23] , [21] and [20].
Fig. 6.
BNs missing expected improvements from FSS.
In spite of these clear improvements however, we can see
that the improvements of BNs predictions when measured by
the continuous metrics was small or at times the accuracy even
worsened. Specially on the Maxwell and Cocomo81 datasets,
on which the predictions were significantly less accurate than
without feature selection as opposed to what one would expect.
This contradiction is illustrated in Fig. 6, where we can see improvements in hit-rates and a degradation according to MBRE.
According to data mining literature, wrapper approaches like
the one applied here use the algorithm’s own accuracy measure
to assess the feature subset [22] [15]. And it is obvious
the BNs algorithm is not using this numerical conversion to
measure accuracy. The model selection is clearly favouring
the hit-rates. This brings into question the validity of hit-rates
as an accuracy measure or at least highlights its limitation.
Improved hit-rates were offset by larger magnitude errors,
i.e., less wrong predictions but when the predictions were
wrong they were wrong by a larger margin. This could also
be seen in the confusion matrices, but they were omitted due
to lack of room. So, does the improved hit-rate really reflect
a more accurate model? In all these experiments, BNs ended
up missing the improvements expected from feature selection.
This could make a significant difference in Maxwell and
Cocomo81 datasets which are the ones with larger amounts
of variables.
It follows from this observation that an interesting development for BNs would be to investigate the feasibility of
incorporating this numerical conversion into the BNs algorithms and tools, using it as a measure of accuracy instead of
the hit-rates or error-rates. This modification could bring in
some improvements in the predictions and also in the effect
of the feature selection technique. The application of feature
selection would find improvements in overall accuracy even
if with lower hit-rates. As it is, the potential improvements
expected from feature selection are being wasted in the strive
for higher hit-rates. Alternatively, a suggestion for future
research is to experiment with other BNs search algorithms,
score types and CPT estimators and check out whether these
bypass this focus on hit-rates. In this study we restricted
ourselves to the K2 search algorithm [29] with Bayes method
for scoring of the networks and Simple estimator to estimate
the NPTs.
TABLE IX
F REQUENCY OF U NDERESTIMATES AND OVERESTIMATES
Prediction System
Overestimates (count)
Underestimates (count)
BNs
BNs + FSS
BNs + log
BNs + log + FSS
OLS + log
OLS + log + FSS
110
127
99
104
86
91
96
79
107
102
114
109
We can also observe a trend in these results. BNs accuracy
degrades according to the datasets’ skewness. With increases
in skewness BNs struggle to predict accurately. BNs best
performance in these experiments was achieved in the least
skewed dataset, i.e., Desharnais. When the data is too skewed
the discretized classes become too uneven and there is an increased loss of precision with the largest discretized intervals.
The highest effort classes tend to be very sparse. An example is
the highest effort class defined for the Maxwell dataset which
spans a wider interval than all others put together (ranges
from 10000 to 64000 person-hours), thus being very imprecise.
Besides the effect on the discretization, there is also an effect
on the numerical conversion because even a small probability
of the highest effort class (Very High) affects the conversion
quite significantly. In Fig. 7 we illustrate this degradation by
dividing the error margin of BNs by the error of OLS, for each
dataset and according to two metrics. We can see that BNs’
error margin increases significantly in comparison to OLS
Fig. 7.
Accuracy degradation of BNs according to dataset skewness.
as the skewness of the dataset increases under both metrics
(datasets are sorted from left to right according to skewness).
Much of the imprecison of the BNs can be ascribed to
the discretization process. This subject has been neglected
to some extent in this research field and the establishment
of guidelines on this could benefit research initiatives. The
imprecision brought about by the discretization process is
directly related to the skewness of the datasets. In this scenario
of highly skewed datasets, the equal-frequencies discretization
generates classes’ intervals of too different widths and the
numerical conversion will show larger error margins. The
alternative of equal-widths discretization causes meaningless
results, for there will be empty or near empty classes and
the model learning will simply state the obvious, predicting
nearly always the same class which is the lowest effort class
since it contains most of the records. High hit-rates are not
only unsurprising but very likely when using equal-widths in
very skewed datasets. Unless a log transformation is applied to
the data, predictions based on skewed data discretized with the
equal-widths method bring in deceitful results. Related to these
findings are the results of [30], which compared equal-widths,
equal-frequencies and k-means discretization on a subset of
a well known dataset and concluded that equal-frequencies
with a log transformation can improve the accuracy results
according to most evaluation criteria. Further investigations
on discretization methods are necessary.
An interesting undertaking was to investigate the effect of
the log transformation on the Bayesian classifier. Even though
a couple of studies used this transformation, we are not aware
of studies assessing its effects. The log transformation was
able to provide only slight improvements of accuracy. The
results show that in very skewed datasets, transforming the
data can be beneficial. Fig. 8 illustrates this improvement
according to MdMRE metric. As another suggestion for future
research, we observe that it would be interesting to try out
this data transformation with BNs that support continuous
variables since in these experiments much of the benefit of
performing this transformation appears to have been lost with
the discretization.
These experiments on data-driven BNs are relevant because
the way data is explored can have a significant impact on
the model’s performance. Much of the excitement over BNs
revolves around their capability to integrate objective and subjective knowledge. Therefore, learning how to optimize the use
Fig. 8.
Effect of log transformation on BNs (MdMRE).
of data (i.e., the objective part) can improve the performance
of not only data-driven BNs, but also hybrid BNs which appear
to be the most promising for this research field. Even though
BNs solely based on data may not become the most accurate
approach in software effort prediction, improvements on the
use of data for BNs benefit this technique as a whole and given
its relevancy in software engineering, these investigations are
necessary. Optimizing the performance of the data mining
capabilities of BNs is an essential part in the development
of this modelling technique.
Our results on these datasets are more optimistic for BNs
than the ones reported in [13], which were obtained on another
dataset. Our experiments show the BNs models struggle in
very skewed datasets but are still capable of achieving a
minimum standard of accuracy. In [13], most BNs, including
hybrid BNs, performed worse than the baseline models. Fig. 9
compares the BNs prediction systems to the baseline models
according to MBRE metric.
Fig. 9.
Comparison between BNs and baseline models (MBRE).
From our studies on the literature and our own experiments,
we observe that it appears to be hard to overcome OLS
regression when it is properly applied. Our results on OLS
regression confirm the results of [21] and the results of [13].
While OLS regression does perform better with regard to
accuracy, one must observe that OLS regression as a well
established statistical technique is optimized to its best. On the
other hand, we have shown in this study that techniques like
BNs have room for improvements and are under constant development. As BNs theory evolves and the tools catch up with
the developments, more accurate predictions will be possible.
Ideally, if data-driven BNs catch up with OLS regression, they
will be very advantageous due to their flexibility and powerful
experimenting features. When such a standard is achieved BNs
users will be able to trust this technique is exploring data as
well as the most accurate data-based models.
Specifically, we have observed room for improvements
for BNs with regard to discretization techniques and experimenting with different model selection methods which could
provide improvements in accuracy under other metrics than the
hit-rates and also optimizing the effects of feature selection.
This appears to be a fundamental problem. Furthermore,
there are developments in data mining research concerning
support for ordinal and continuous variables. These could also
bring further improvements in accuracy. And besides these
improvements on BNs’ data mining capabilities, there are also
improvements concerning support for experts’ model building.
The BNs tools are currently a limitation [19]. The latest
developments are not available for most of the tools. In these
experiments we did not have the opportunity to experiment
with continuous variables nor with dynamic discretization. It
would be interesting to verify the improvements techniques
like dynamic discretization proposed in [31] could bring in.
Although WEKA offers validation advantages over other tools,
it does not have other developments from BNs theory. As we
already mentioned, an interesting development would be the
incorporation of the numerical conversion method. This conversion is not automated in the tools and it can be somewhat
cumbersome to perform which may hinder its employment.
Having this conversion automated into the tools could be
interesting.
Some studies on BNs indicate that BNs’ main strength for
the software prediction area lies in their possibility to incorporate domain knowledge and qualitative factors, therefore
favouring hybrid or expert-driven approaches. Currently, an
advantage of data-driven models like this, as pointed out in
[20], is that by owning a projects dataset it is possible to
obtain quick predictions as supporting evidence for the expert’s
prediction, as opposed to expert based networks which take
much more effort to build and to have the NPT’s elicited. The
employment of data-based models to support expert estimates
has been indicated to practitioners as a means to increase
safety and reliability on experts’ estimates, since the situation
with expert-based estimations has not been easier than the
situation seen in this research field.
Finally, an observation obtained with this study and the
difficulties in the field is that it is important to show faithful
and realistic results even if they are not positive towards a
particular technique. This research field has suffered in the last
twenty years due to over-optimism towards some techniques.
In recent years, efforts towards correcting inconsistencies and
addressing reasons for conflicting results are on the rise even if
these show a less than flattering state of affairs in the field. To
move forward it is important to recognize the actual situation
paving the way for improvements and solutions.
VI. C ONCLUSION
This study provided a sound assessment of automatic BNs
by means of a comparison with a well established statistical
technique and with benchmark models, thereby illustrating its
current limitations and possibilities of improvements. BNs’
limitations are discussed and some guidelines on its employment are provided. Specifically, the skewness of datasets
prevalent in this research field and the discretization are shown
to bring about inaccuracies that limit BNs’ effectiveness.
One suggestion arising from these observations and set forth
to the research community is to investigate the feasibility
of incorporating the numerical conversion into BNs model
building as we consider it portrays accuracy more faithfully
than the basic hit-rates. This could make BNs models generally
more accurate even if achieving lower hit-rates. Also, the
inclusion of this conversion in the tools would be interesting
for research undertakings.
We consider this study discusses important matters that
are scarcely discussed in software prediction studies and that
can be a source of confusion. Most studies have not addressed much attention to dataset properties and implications
on model’s functioning. Shedding light on these somewhat
neglected topics is an important step to address some of the
current difficulties in the field. This study showed some of
the problems arising from the datasets in the field and the
constraints they impose specially on classifiers. Much of this
is related to the discretization process and the uneven classes
that it generates. We brought forward some points concerning
the exploration of data which we believe to be important for
the development of BNs.
There is a limit on how accurate data-driven prediction
techniques can be depending on the data used. Therefore, more
efforts should be addressed in studying software prediction
datasets properties and data pre-processing in order to increase
prediction accuracy. The performance of these models is
highly dependent on data quality, which is a subject that
has not received sufficient attention. Significant improvements
could come from investigations on this.
Our observations indicate that BNs have a potential for
data-based predictions but still need improvements to catch
up with the most accurate data based models. In spite of the
apparent advantage of linear models in this scenario, i.e., datadriven modeling, it must be observed that this is only part of
the potentiality of BNs. BNs offer experimenting possibilities
beyond that of linear regression. The linear regression method
can only provide a point estimate, whereas BNs meet other
requirements expected from a prediction model.
Furthermore, due to the human factors and inherent uncertainties in software projects, the capability to incorporate
expert’s subjective knowledge can provide an advantage over
models solely based on data. Bayesian Networks appear to be
one of the most suitable techniques for future progresses in this
aspect. BNs theory and tools are under constant development
and some technical breakthroughs regarding discretization
and NPT’s elicitation appear to herald progresses for BNs
in software prediction and software projects management in
general.
A. Future Work
A topic that could provide some improvements for the software prediction field and that warrants investigations is data
pre-processing. Carrying out this work we observed the impact
discretization, data transformations and feature selection can
have on the models’ performance. Moreover, we observed the
implications of and hindrances posed by the characteristics
of software projects datasets. In our view, discretization is a
topic that needs thorough investigations as there are currently
no guidelines on this.
In this work we applied a specific feature subset selection
technique (a Wrapper approach with BestFirst algorithm ). It
would be interesting to assess whether other feature selection
techniques can bypass this focus on hit-rates that this wrapper
approach demonstrated. Good improvements could be obtained if BNs could better extract the accuracy improvements
expected from feature selection.
Another suggestion is to experiment with other learning and
selection algorithms, as in this work we restricted ourselves to
the K2 search algorithm with Bayes method for scoring of the
networks and Simple estimator to estimate the NPTs. We have
the expectancy that other algorithms could assess accuracy in
a different way, as in this study the algorithms were clearly
favouring the hit-rates, which we questioned as an accuracy
measure.
Furthermore, investigating BNs with continuous variables
and the related pre-processing procedures could yield interesting results.
Also, statistical significance tests could be performed to
enhance the validation of the results.
R EFERENCES
[1] B. W. Boehm, Software Engineering Economics. Englewood Cliffs,
NJ: Prentice Hall, 1981.
[2] N. E. Fenton and M. Neil, “A critique of software defect prediction
models,” IEEE Trans. Softw. Eng., vol. 25, no. 5, pp. 675–689, 1999.
[3] G. R. Finnie, G. E. Wittig, and J.-M. Desharnais, “A comparison
of software effort estimation techniques: using function points with
neural networks, case-based reasoning and regression models,” J. Syst.
Softw., vol. 39, no. 3, pp. 281–289, Dec. 1997. [Online]. Available:
http://dx.doi.org/10.1016/S0164-1212(97)00055-1
[4] M. Shepperd and C. Schofield, “Estimating software project effort using
analogies,” IEEE Trans. Softw. Eng., vol. 23, pp. 736–743, Nov. 1997.
[Online]. Available: http://dl.acm.org/citation.cfm?id=269857.269863
[5] J. Wen, S. Li, Z. Lin, Y. Hu, and C. Huang, “Systematic literature review
of machine learning based software development effort estimation
models,” Inf. Softw. Technol., vol. 54, no. 1, pp. 41–59, Jan. 2012.
[Online]. Available: http://dx.doi.org/10.1016/j.infsof.2011.09.002
[6] M. Korte and D. Port, “Confidence in software cost estimation results
based on mmre and pred,” in Proceedings of the 4th international
workshop on Predictor models in software engineering, ser. PROMISE
’08. New York, NY, USA: ACM, 2008, pp. 63–70. [Online]. Available:
http://doi.acm.org/10.1145/1370788.1370804
[7] C. Mair and M. J. Shepperd, “The consistency of empirical comparisons
of regression and analogy-based software project cost prediction,” in
Proceedings ISESE’05, 2005, pp. 509–518.
[8] T. Menzies, O. Jalali, J. Hihn, D. Baker, and K. Lum, “Stable rankings
for different effort models,” Automated Software Engg., vol. 17, pp.
409–437, Dec. 2010. [Online]. Available: http://dx.doi.org/10.1007/
s10515-010-0070-z
[9] M. Shepperd and S. MacDonell, “Evaluating prediction systems in
software project estimation,” Inf. Softw. Technol., vol. 54, no. 8, pp.
820–827, Aug. 2012. [Online]. Available: http://dx.doi.org/10.1016/j.
infsof.2011.12.008
[10] B. Kitchenham and E. Mendes, “Why comparative effort prediction studies may be invalid,” in Proceedings of the 5th International Conference
on Predictor Models in Software Engineering, PROMISE ’09. New
York, NY, USA: ACM, 2009, pp. 1–5.
[11] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in
comparative studies of software prediction models,” IEEE Trans. Softw.
Eng., vol. 31, no. 5, pp. 380–391, May 2005. [Online]. Available:
http://dx.doi.org/10.1109/TSE.2005.58
[12] B. Kitchenham, L. Pickard, S. G. MacDonell, and M. J. Shepperd, “What
accuracy statistics really measure,” IEE Proceedings - Software, vol. 148,
no. 3, pp. 81–85, 2001.
[13] E. Mendes and N. Mosley, “Bayesian network models for web
effort prediction: A comparative study,” Software Engineering
IEEE Transactions on, vol. 34, no. 6, pp. 723–737, 2008.
[Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=4589218
[14] I. A. Tierno and D. J. Nunes, “Assessment of automatically built
bayesian networks in software effort prediction,” Ibero-American Conference on Software Engineering, Buenos Aires - Argentina, pp. 196–
209, Apr. 2012.
[15] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining,
(First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005.
[16] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine
Learning Tools and Techniques, 3rd ed. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 2011.
[17] M. N. N. Fenton and L. Radlinski, “Software project and quality modelling using bayesian networks,” in Artificial Intelligence Applications
for Improved Software Engineering Development: New Prospects. (Part
of the Advances in Intelligent Information Technologies (AIIT) Book
Series). Information Science Reference. ISBN: 978-1-60566-758-4,
2009, pp. 223–231, edited by: F. Meziane and S. Vadera.
[18] P. C. Pendharkar, G. H. Subramanian, and J. A. Rodger, “A probabilistic
model for predicting software development effort,” IEEE Trans. Softw.
Eng., vol. 31, no. 7, pp. 615–624, 2005.
[19] L. Radlinski, “A survey of bayesian net models for software development
effort prediction,” International Journal of Software Engineering and
Computing, vol. 2, no. 2, pp. 95–109, 2010.
[20] L. Radlinski and W. Hoffmann, “On predicting software development
effort using machine learning techniques and local data,” International
Journal of Software Engineering and Computing, vol. 2, no. 2, pp. 123–
136, 2010.
[21] K. Dejaeger, W. Verbeke, D. Martens, and B. Baesens, “Data mining
techniques for software effort estimation: A comparative study,” IEEE
Trans. Software Eng., vol. 38, no. 2, pp. 375–397, 2012.
[22] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniques
for discrete class data mining,” IEEE Trans. on Knowl. and Data Eng.,
vol. 15, no. 6, pp. 1437–1447, 2003.
[23] Z. Chen, B. Boehm, T. Menzies, and D. Port, “Finding the right data for
software cost modeling,” IEEE Softw., vol. 22, no. 6, pp. 38–46, 2005.
[24] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I. H. Witten, “The weka data mining software: an update,” SIGKDD
Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009. [Online]. Available:
http://dx.doi.org/10.1145/1656274.1656278
[25] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters,
and B. Turhan, “The promise repository of empirical software
engineering data. available at: <http://promisedata.googlecode.com>.
viewed in apr. 16th, 2013,” June 2012. [Online]. Available: http:
//promisedata.googlecode.com
[26] H. Liu, F. Hussain, C. L. Tan, and M. Dash, “Discretization: An enabling
technique,” Data Min. Knowl. Discov., vol. 6, pp. 393–423, Oct. 2002.
[Online]. Available: http://dl.acm.org/citation.cfm?id=593435.593535
[27] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A simulation
study of the model evaluation criterion mmre,” IEEE Trans.
Softw. Eng., vol. 29, pp. 985–995, Nov. 2003. [Online]. Available:
http://dl.acm.org/citation.cfm?id=951850.951936
[28] Y. Miyazaki, A. Takanou, H. Nozaki, N. Nakagawa, and K. Okada,
“Method to estimate parameter values in software prediction models,”
Inf. Softw. Technol., vol. 33, no. 3, pp. 239–243, Apr. 1991. [Online].
Available: http://dx.doi.org/10.1016/0950-5849(91)90139-3
[29] G. F. Cooper and E. Herskovits, “A bayesian method for the induction
of probabilistic networks from data,” Mach. Learn., vol. 9, pp.
309–347, Oct. 1992. [Online]. Available: http://dl.acm.org/citation.cfm?
id=145254.145259
[30] M. FernÁndez-Diego and J.-M. Torralba-Martínez, “Discretization
methods for nbc in effort estimation: an empirical comparison based
on isbsg projects,” in Proceedings of the ACM-IEEE international
symposium on Empirical software engineering and measurement, ser.
ESEM ’12. New York, NY, USA: ACM, 2012, pp. 103–106. [Online].
Available: http://doi.acm.org/10.1145/2372251.2372268
[31] M. Neil, M. Tailor, and D. Marquez, “Inference in hybrid
bayesian networks using dynamic discretization,” Statistics and
Computing, vol. 17, pp. 219–233, 2007. [Online]. Available: http:
//dl.acm.org/citation.cfm?id=1285820.1285821