New methodologies towards an automatic optical
Transcrição
New methodologies towards an automatic optical
Ana Maria Rebelo New methodologies towards an automatic optical recognition of handwritten musical scores Dissertação de Mestrado Universidade do Porto Outubro 2008 Universidade do Porto Departamento de Matemática Aplicada Tese submetida à Faculdade de Ciências da Universidade do Porto para obtenção do grau de Mestre em Engenharia Matemática New methodologies towards an automatic optical recognition of handwritten musical scores Ana Maria Rebelo ([email protected]) Dissertação realizada sob a supervisão do Professor Doutor Jaime S. Cardoso (INESC Porto, FEUP) e sob a co-orientação do Professor Doutor Joaquim F. Pinto da Costa (DMA, FCUP). Porto, Outubro 2008 A painter paints pictures on canvas. But musicians paint their pictures on silence. Leopold Stokowski Abstract Many music works produced in the past are currently available only as original manuscripts or as photocopies. Their preservation has entailed a vast amount of research in the last years. The digitalization has been commonly used as a possible preservation method which comprises setting up a digital copy. Despite the fact of an easy accessibility in a machine-readable format, which encourages browsing, retrieval, search and analysis and, most importantly, the preservation of endangered works, while providing a generalized access to digital content, actually it only keeps a portion of the information. Several systems have been presented in order to fulll the lack of a tool able to analyse and perform semantic search operations. Carrying this task manually is very time consuming and error prone. While optical music recognition systems usually perform well on printed scores, the processing of handwritten musical scores by computers remains far from ideal. One of the fundamental stages in the optical music recognition is the detection and subsequent removal of sta lines. In this project we investigate a general-purpose, knowledge-free method for the automatic detection of music sta lines based on stable paths approach. Lines aected by curvature, discontinuities, and inclination are robustly detected. A sta removal algorithm is also developed by adapting an existing line removal approach to use the stable paths algorithm at the detection stage. Experimental results show that the proposed technique consistently outperforms well-established algorithms. The developed approach will be integrated, as future work, in a web based system providing seamless access to browsing, retrieval, search and analysis of submitted scores. Symbol detection is also crucial for a good performance in an optical music recognition system. In this work we propose a segmentation method based on the hierarchical decomposition of music sheets. The symbols are split into four dierent types to facilitate their extraction. Since we are working with handwritten scores many situations may occur. The work degradation, highly handwritten dierent types as well as other possible undesirable discontinuities are situations to take into account. One way to diminish these situations in the classication process is to create a database where several distortions are simulated. So, a new methodology is presented where elastic matching is used in conjunction with several classiers, for instance neural networks and hidden Markov models. A profound comparative study is made about classiers with and without the elastic matching applied to handwritten scores. It is expected that the obtained results outperform the actual state of the art. I Resumo Muitos dos trabalhos musicais produzidos no passado estão actualmente disponíveis apenas como manuscritos originais ou fotocópias. de pesquisas nos últimos anos. A sua preservação tem causado um vasto leque A digitalização tem sido usualmente utilizada como possível método de preservação em que se resume à criação de uma cópia digital. Apesar de constituir uma acessibilidade fácil num formato legível por computador, que permite a consulta, a recuperação, a procura e a análise, e mais importante, a preservação das obras em risco, enquanto promove um acesso generalizado ao material digital, na verdade apenas guarda uma parte da informação. Vários sistemas têm sido apresentados para preencher a falta de uma ferramenta que seja capaz de realizar análises e executar operações de pesquisa semântica. Executar esta tarefa manualmente é deveras dispendioso e susceptível a erros. Enquanto que os sistemas de reconhecimento óptico musical apresentam, geralmente, um bom desempenho para partituras impressas, para partituras manuscritas o processo recorrendo ao computador continua longe do ideal. Uma das etapas fundamentais em reconhecimento óptico é a detecção e subsequente remoção das linhas de pauta. Neste projecto é investigado um método de uso geral e conhecimento livre para a detecção automática das linhas de música baseado na aproximação do caminho estável. Linhas afectadas por curvatura, discontinuidades e inclinação são detectadas robustamente. Também é desenvolvido um algoritmo de remoção de linhas adaptando um processo já existente de remoção para usar o algoritmo do caminho estável na fase de detecção. Os resultados experimentais obtidos mostram que a técnica proposta supera consistentemente os algoritmos bem estabelecidos. A aproximação desenvolvida será agora integrada, num trabalho futuro, num sistema web, proporcionando um acesso à navegação, recuperação, pesquisa e análise das partituras submetidas. A detecção dos símbolos é também crucial para um bom desempenho de um sistema de reconhecimento óptico musical. Neste trabalho é proposto um método de segmentação baseado numa decomposição hierárquica da folha de música. Os símbolos são divididos em quatro tipos diferentes para facilitar a sua extracção. Como estamos a lidar com partituras manuscritas, diversas situações são susceptíveis de ocorrer. A degradação do trabalho, diferentes tipos de manuscritos, assim como outras possíveis descontinuidades não desejáveis são situações a ter em conta. Uma forma de diminuir estes problemas, no processo de classicação, é criar uma base de dados onde se simulam várias distorções. Neste sentido, uma nova metodologia é apresentada, onde se recorre à modelação elástica em conjunção com vários classicadores, como por exemplo, redes neuronais e cadeias de Markov escondidas. É apresentado um profundo estudo comparativo sobre estes classicadores com e sem modelação elástica aplicada às partituras manuscritas. É esperado que os resultados obtidos superem o estado actual da arte. Acknowledgements Was this project successfully accomplished? The period was long and sometimes dicult; however I believe that the answer is yes. I am thankful to a set of persons who made all this work possible and fun to perform. I came to INESC Porto to nish my undergraduate in Mathematical Applied to Technology and I ended up staying for another year for the realization of my Master course in Mathematical Engineering. I am sincerely grateful to Prof. Dr. Jaime S. Cardoso, my thesis supervisor, for giving me the opportunity to continue to work with him, by the trust in my work, by the availability and patience always demonstrated, for his guidance given with high rigor and professionalism and by his wise knowledge. I also thank my thesis co-supervisor, Prof. Dr. Joaquim Pinto da Costa, for the availability to help and guide with all the rigor and erudition. It was a great pleasure to work in the INESC Porto with a team as fabulous as the UTM that done everything to integrate me in the group as soon as possible. To all my colleagues who accompanied me on this journey and in one way or another supported me and encouraged me, my sincere thanks. I also do a special thank to Artur Capela, my colleague in the OMR project, for all the patience and support given, and to Prof. Dr. Carlos Guedes, professor from ESMAE associated with this Project. INESC Porto for providing the right environment for (Fundação para a Ciência e Tecnologia ) for nancial support. I thank high-quality research and FCT To my parents who made a huge eort to tolerate and help the realization of my dreams and aspirations. To my twin sister my greatest appreciation for all that she represents for me, for her support and the countless hours of complicity available. To D. Lurdes, a very special friend who made me face the diculties with courage and determination. Last but not least, I thank a person very special to me, accompanied me on this journey with great patience and spirit winner, helping me in this work with his critical and rigorous eye. Thank you for your presence at all times. I would like to nish these acknowledgments stressing that with determination and humbleness our dreams can take place. Ana Maria October, Silva Rebelo, 2008 III Contents Contents IV List of Figures VI List of Tables VIII Glossary X I Introduction 1 1 Introduction 3 1.1 OMR Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Contributions of this Project and Related Publications . . . . . . . . . . . . . . 7 1.4 Dissertation Structure 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Works 10 2.1 State of the Art in Sta Lines Detection . . . . . . . . . . . . . . . . . . . . . . 10 2.2 State of the Art in Musical Symbol Extraction and Classication . . . . . . . . 12 2.3 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 II Detection 21 3 Sta Lines Detection and Removal 23 3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Underlying Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Algorithm Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Stable Paths on a Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Proposed Algorithm 28 3.6 Design of the Weight Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7 Sta Line Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8 Database of Music Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.9 Evaluation Metrics and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV IIISegmentation and Classication 47 4 Segmentation and Classication Process 49 4.1 Musical Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Segmentation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Classication Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 IVConclusions and Future Work 71 5 Conclusion 73 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 References 76 V Appendix 81 A Fundamentals 83 A.1 Primal Problem vs Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.2 Error Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.3 Dalitz's Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 A.4 Matching in Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 A.5 Otsu Threshold Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.6 Sta Line Removal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.7 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.8 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B Table of confusion 97 B.1 Results Obtained Without Elastic Matching for the Handwritten Music Symbols 97 B.2 Results Obtained Without Elastic Matching for the Printed Music Symbols 99 B.3 Results Obtained With Elastic Matching for the Handwritten Music Symbols B.4 Results Obtained With Elastic Matching for the Printed Music Symbols C Articles Submited to Conferences . . . 101 . . . . 103 105 List of Figures 1.1 Generic system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 OMR architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 A directed acyclic graphs and its linearization. . . . . . . . . . . . . . . . . . . . . 14 3.1 An exemplicative example of the methodology. . . . . . . . . . . . . . . . . . . . . 26 3.2 Stable paths on a toy example. 27 3.3 Exemplication of stable paths for Figure 3.1(a). 3.4 Length of a chord through a skeleton point at some angle 3.5 Example (from [DDCF08]). 3.6 Two examples of music scores from the test set used on the experimental evaluation. 3.7 Some examples of applied deformations from the original image: a) Original; b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ϕ. 28 . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 Curvature c) Degradation after Kanungo; d) Sta line thickness variation; e) Sta line y-variation; f ) Typeset emulation; g) Rotation; h) White Speckles; i) Sta line Interruptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 Generation of the deformed images and the respective Ground-Truth. . . . . . . . . 38 3.9 Examples of the results obtained using the sta line removal algorithm in our test set. 45 4.1 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Example: a) Tie above noteheads; b) Tie under noteheads. . . . . . . . . . . . . . 54 4.4 An exemplicative example of the variability in the music symbols. . . . . . . . . . 57 4.5 Segmentation process I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 Segmentation process II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.7 An exemplicative example of two dierent connected components in the same bounding box. 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 An example of the existence of inconsistency in the beam thickness and in the link with the stems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Beam segmentation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.10 Notes segmentation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.11 An example of sharp detection on a real score. . . . . . . . . . . . . . . . . . . . . 60 4.12 An example of sharp detection on a ideal score. . . . . . . . . . . . . . . . . . . . . 61 4.9 4.13 An example of how the music symbols that are in the ideal scores were manually splited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.14 Results of the error metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 VI 4.15 Results of the error metrics for the rotation deformation. . . . . . . . . . . . . . . . 63 4.16 Results of the error metrics for the curvature deformation. . . . . . . . . . . . . . . 63 4.17 Neural network topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.1 Neural network architecture with three layers, two inputs and one output. . . . . . 87 A.2 Forward propagation of the signal in the neural network. . . . . . . . . . . . . . . . 88 A.3 Comparison between the signal output and the target. . . . . . . . . . . . . . . . . 88 A.4 Propagation of the error signal in backward mode in the neural network. . . . . . 89 A.5 The calculation of weights in the neural network. . . . . . . . . . . . . . . . . . . . 90 d List of Tables 3.1 Deformations in the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Ranges of deformation parameters used in the tests: 3.3 Eect of dierent deformations on the overall sta detection error rates in percent- min:step:max. . . . . . . . . 35 37 age: average (standard deviation) of the false detection rate and miss detection rate. See [DDCF08] for parameter details. 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . Detection performance on real music scores in percentage: average (standard deviation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 r. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Eect of dierent deformations on the overall sta removal error rates in percentage: average (standard deviation) Sta line interruption error. 3.9 43 Eect of dierent deformations on the overall sta removal error rates in percentage: average (standard deviation) Individual pixels error. 3.8 42 Removal performance on real music scores (in percentage): average (standard deviation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 41 Sta segment extraction errors based on the number of segments in an equivalence class 3.6 40 . . . . . . . . . . . . . 44 Eect of dierent deformations on the overall sta removal error rates in percentage: average (standard deviation) Segmentation region level. . . . . . . . . . . . . . . 45 4.1 Clefs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Flags and beams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Rests. 53 4.5 Ties and Slurs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Accents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Time signatures. 56 4.8 Number of neurons in the hidden layer (not)using elastic matching method (EM) in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 65 The number of states in the hidden Markov model with and without the elastic matching method (EM) in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 The values of the parameters C and γ 66 in the support vector machines with and without the elastic matching method (EM) in the dataset. . . . . . . . . . . . . . . 66 4.11 Classes list of handwritten and printed music symbols. . . . . . . . . . . . . . . . . 67 4.12 Results obtained with the test set in the classication process for the handwritten music symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII 68 4.13 Results obtained with the test set in the classication process for the printed music symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.14 Results obtained with the test set in the classication process for the handwritten music symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.15 Results obtained with the test set in the classication process for the printed music symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.16 The performed of the natural and sharp symbols. . . . . . . . . . . . . . . . . . . . 70 B.1 Table of confusion of the nearest neighbor classier. . . . . . . . . . . . . . . . . . 97 B.2 Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . . 97 B.3 Table of confusion of the support vector machines. . . . . . . . . . . . . . . . . . . 98 B.4 Table of confusion of the hidden Markov model. . . . . . . . . . . . . . . . . . . . . 98 B.5 Table of confusion of the nearest neighbor classier. . . . . . . . . . . . . . . . . . 99 B.6 Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . . 99 B.7 Table of confusion of the support vector machines. . . . . . . . . . . . . . . . . . . 99 B.8 Table of confusion of the hidden Markov Model. . . . . . . . . . . . . . . . . . . . . 100 B.9 Table of confusion of the nearest neighbor classier. . . . . . . . . . . . . . . . . . 101 B.10 Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . . 101 B.11 Table of confusion of the support vector machines. . . . . . . . . . . . . . . . . . . 101 B.12 Table of confusion of the hidden Markov model. . . . . . . . . . . . . . . . . . . . . 102 B.13 Table of confusion of the nearest neighbor classier. . . . . . . . . . . . . . . . . . 103 B.14 Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . . 103 B.15 Table of confusion of the support vector machines. . . . . . . . . . . . . . . . . . . 103 B.16 Table of confusion of the hidden Markov model. . . . . . . . . . . . . . . . . . . . . 104 Glossary OMR Optical Music Recognition SVM Support Vector Machine HMM Hidden Markov Model EM Elastic Matching LTH Line Track Height X Part I Introduction 1 Chapter 1 Introduction Music, from Greek µυσική (τ ´χνη ) musiké (téchne), which means the art of the muses, can be dened as an organized sequence of sounds and silences so as to produce aesthetic pleasure in the listener. There are evidences, by Pictographs, that music is known and practiced since prehistory. Over the years, the music expanded in many several music styles and for many dierent purposes, like educational or therapy. All known cultures have their own musical practice. Music is a pivotal part in the cultural heritage of any society. In this way, its preservation, in all of its forms, must be pursued. On the face of it, the Universal Declaration on Cultural Diversity adopted by the General Conference of UNESCO on 2001 asserts that cultural diversity is as necessary for humankind as biodiversity is for nature, and that policies to promote and protect cultural diversity thus are an integral part of sustainable development1 . Portugal has a notorious lack in music publishing from virtually all eras of its musical history. In spite of most of the original manuscripts of music known before twentieth century being kept in the national library in Lisbon, there is not any repository of musical information from the last century. Although there are recent eorts to catalogue and to preserve in digital form 2 the Portuguese music from the twentieth century notably the Music Information Center 3 and the section on musical heritage from the Institute of the Arts website most of the mu- sic pre-dating computer notation software was never published and still exists in the form of manuscripts or photocopies spread out all over the country in discreet places. The risk of irreversibly losing this rich cultural heritage is a reality that must be taken seriously and dealt with accordingly. Digitization has been commonly used as a possible tool for preservation, oering easy duplications, distribution, and digital processing. However, transforming the paper-based music scores and manuscripts into a machine-readable symbolic format (facilitating operations such as search, retrieval and analysis), requires an Optical Music Recognition (OMR) system. Unfortunately, the actual state-of-the-art of handwritten music recognition is far from providing a satisfactory solution. The project Automatic recognition of handwritten music scores initiated in 2007 by Instituto de Engenharia de Sistemas e Computadores do Porto (INESC Porto) and Escola Superior de Música e das Artes do Espectáculo (ESMAE) was the starting point for creating an OMR 1 http://www.unesco.org/bpi/eng/unescopress/2001/01-112e.shtml http://www.mic.pt 3 http://patrimonio.dgartes.pt/?lang=pt 2 3 Chapter 1. Introduction 4 system that addresses some of the identied problems which will be described in detail in the following sections. It is the aim of this project to overcome the problem of musical symbol recognition in handwritten scores through the research and application of the most recent techniques of machine learning and articial intelligence. Moreover, there is also the intention of creating a web-based system providing generalized access to a wide corpus of handwritten unpublished music en- coded in digital format. The database will not only centralize as much information as possible but will also serve to preserve the musical heritage in an innovative way with a wide range of possibilities [CCRG08a, CRCG08]. The architecture for the proposed system is represented in Figure 1.1. Figure 1.1: Generic system architecture. The system is composed by three dierent entities: repository, web server and web browser. Briey, the Repository module stores the original scanned score, the digital counterpart in MusicXML and all the descriptive metadata inserted by the user. All the remaining system contents, such as the user information, are also stored in this entity. The Web Server is the user access point to the system as well as to all of its processing modules run on the server, encompassing the search engine and the optical recognition engine for the musical scores. The Web Server interacts with the Repository and with the Web Browser, which establishes the interface between the user and the system. The user interface on a Web Browser allows the complete management of the musical scores and associated metadata, as well as carrying out the system administration. The project presented here is innovative for several reasons: 1. It has the ambition to develop a formalism to model, in a consistency way, the heterogeneous knowledge about the language and musical notation. 2. Aims to explore the wealth and the potential from the most recent techniques of machine learning and articial intelligence, not only for representing and merging knowledge, but also to make decisions. 1.1. OMR Architecture 3. Intends to digitalize and preserve a wide 5 corpus of handwritten scores in an unprecedented way. 4. Will include an OMR engine integrated with an archiving system and a user-friendly interface for searching, browsing and edition. The digitized scores will be stored in MusicXML, a recent and expanding music interchange format designed for notation, analysis, retrieval, and performance applications. 5. Intends to make accessible online the repository of handwritten scores for enjoyment, educational and musicological purposes. The work of this thesis, as carried out in this project, aims to develop new OMR algorithms for the web-based sytem. 1.1 OMR Architecture The principal aims of an OMR application are the recognition, the representation and the storage of musical scores in a machine-readable format. An OMR program should thus be able to recognize the musical content and make the semantic analysis of each musical symbol of a musical work. In the end, all the musical information should be saved in an output format that is easily readable by a computer. In fact, the architecture of an OMR system is dependent on the methods used in the segmentation and recognition steps. Generally, the OMR process can be divided into three principal modules (see Figure 1.2): 1. Recognition of musical symbols from a music sheet. 2. Reconstruction of the musical information to build a logical description of musical notation. 3. Construction of a musical notation model for its representation as a symbolic description of the musical sheet. It is important to note that this is a sequential architecture. In other words, the results obtained in one phase are the input to the next phase. The rst module is typically further divided into three stages: Ê Image pre-processing: it consists in the application of several techniques (e.g. binarization, noise removal, blurring, deskewing, among others) to make the recognition process more robust and ecient. Ë Sta lines detection and removal to obtain an image containing only the musical symbols. Ì Objects segmentation and basic symbol recognition. The second and third modules (Musical Notation Reconstruction and Final Representation Construction) are intrinsically intertwined. In the second module of musical notation reconstruction, the symbols primitives are merged to form musical symbols. Usually, in this step Chapter 1. Introduction 6 graphical and syntactic rules are used. This is important for the introduction of context information to validate and solve ambiguities from the last module of music symbol recognition. It is also produced a document where detected symbols are interpreted to assign them a musical meaning. In the third module of nal representation construction this document is converted into a format of musical description, such as MusicXML, that allows for storage. In this thesis I focused in the step of music symbol recognition. The system is composed by a preliminary stage of segmentation and a stage of classication of musical symbols. More specically, the segmentation phase is composed by sta line detection and removal, followed by the stage of musical symbol extraction. In the classication step four distinct classiers are used and an elastic matching method see Figure 1.2. This method is applied to all classiers to simulate the possibility of deformations in the musical symbols. Moreover, it also permits the enhancing of the test set. Figure 1.2: OMR architecture. 1.2 Objectives In order to overcome the limitations from the actual methods in musical symbol recognition on handwritten musical scores and to organize them hierarchically, I researched the application of the most recent techniques of machine learning and articial intelligence. I also developed algorithms for the musical symbol detection and recognition. The musical notation is one of the languages most widely known. It has had a continuous development over time, with requirements of consistency and precision. For this reason, the proposed methodology should be 1.3. Contributions of this Project and Related Publications 7 naturally adaptable to handwritten scores and to dierent standard musical notations. In general, the rst step in the process of handwritten musical scores recognition is sta lines detection. This operation is one of the most important phases in musical symbol recognition, because it determines the results for subsequent proceedings. With this purpose in mind we had the following objectives: Ê To test thoroughly the proposed algorithm for the sta lines detection using an appropriate database and error metrics, and compare this algorithm with others from the state of the art. Ë To create a new database of ideal musical scores from an existing one, where several controled deformations are applied to the ideal musical scores to simulate feasible problems of handwritten musical scores. Ì To use the proposed algorithm for the sta lines detection as a rst step in some state of the art sta removal algorithms and conduct a series of experiments using appropriate error metrics. The musical symbol extraction is the next step before the classication. process is based in existing algorithms [DDCF08]. recognize were split into four dierent types: The segmentation In this work, the symbols we want to the symbols that were featured by a vertical segment, the symbols that link the notes, the remaining symbols connected to sta lines and the symbols above and under sta lines. Since we are working with handwritten scores many situations may occur. The score degradation and highly dierent handwritten styles are possible undesirable situations that may occur. The objectives for this phase were: ¬ Research and development of an algorithm to extract the handwritten musical symbols. Simulate the variability with an elastic matching method in the music symbols, with the aim of preparing the classier for several dierent situations that may occur in handwritten music symbols. ® Compare and study dierent classiers for the musical symbols recognition. 1.3 Contributions of this Project and Related Publications This dissertation presents the following contributions for the preservation and the general access to musical and cultural heritage: 1. The introduction to the music analysis community of the algorithm of the sta lines detection based in the Stable Paths approach. 2. The creation of a database of real scores with its references: detection and removal references. 3. New algorithms for the automatic detection and classication of musical symbols. In this way, analysis and search of the score by the user will be more easily. Chapter 1. Introduction 8 4. Integration of the new algorithms in the web-based system being developed under the project. The work related with the Project where the dissertation belongs, already resulted in the publication of the followings papers: A Connected Path Approach for Sta Detection on a Music Score , Jaime S. Cardoso, Artur Capela, Ana Rebelo, Carlos Guedes, in the Image Processing (ICIP 2008). Integrated Recognition System for Music Scores , Artur Capela, Jaime S. Cardoso, Ana Rebelo, Carlos Guedes, in the IEEE International Conference on International Computer Music Conference (ICMC 2008). Sta Line Detection and Removal with Stable Paths , Artur Capela, Ana Rebelo, Jaime International Conference on Signal Processing and Multimedia Applications (SIGMAP 2008), pages 263270, 2008. S. Cardoso, Carlos Guedes, in Proceedings of the And it waits for the result of the submission of the: Optical Recognition of Music Symbols: a comparative study, Ana Rebelo, Artur Capela and Jaime S. Cardoso, in (IJDAR 2008). International Journal of Document Analysis and Recognition Sta Detection with Stable Paths, Jaime S. Cardoso, Artur Capela, Ana Rebelo, Carlos Guedes, Joaquim Pinto da Costa, in Intelligence (TPAMI 2008). IEEE Transaction on Pattern Analysis and Machine During the second semester of the academic year 06/07 one more paper was written. The rst version of the proposed algorithm has already been presented and published: A Shortest Path Approach for Sta Line Detection , Ana Rebelo, Artur Capela, Joaquim F. Pinto da Costa, Carlos Guedes, Eurico Carrapatoso, Jaime S. Cardoso, in Proceedings International Conference on Automated Production of Cross Media Content for Multi-channel Distribution (AxMedis 2007), pages 7985, Nov. 2007. of the 1.4 Dissertation Structure This dissertation is organized in 5 chapters that describe the work developed in the last year. It also has a set of appendices with complementary information that will help the description of the work done. After this introductory Chapter, the description of the works related with this project will be made in Chapter 2. It is constituted by three states: a presentation of the state of the art of the algorithms in the sta lines detection area and the already existing algorithms in the eld of sta lines removal; a presentation of the state of the art of the algorithms in the musical symbols classication area; and a background presentation of the several methods used in the classication phase. In Chapter 3, the proposed stable paths algorithm to detect the sta lines is described. The database used, the error metrics applied in the several experiments 1.4. Dissertation Structure 9 done to test the stable paths algorithm, and the results obtained are also presented. In chapter 4, the segmentation and classication phase is presented. A brief description of the musical symbols that the algorithm tries to recognize is also described. Finally, conclusions are drawn and future work is suggested in chapter 5. Chapter 2 Related Works The investigation in the OMR eld began with Pruslin [Pru66] and Prerau [Pre70]. However, it was only in the decade of 1980's, when the equipment of digitalization became accessible, that work in this area has expanded [Car89, Ng95, Coü96, Bai97]. Over the years, there have been appearing several commercial OMR software, but no one has a satisfactory performance in terms of precision and robustness. The complexity of the OMR task caused by the bidimensional structure of the musical notation and also by the existence of several combined symbols organized around the note heads have been conditioning the progress in this area. Until now, even the most developed recognition systems (PIANOSCAN, NOTESCAN in Nightingale, MIDISCAN in Finale, PHOTOSCORE in Sibelius, SMARTSCORE, SHARPEYE, etc) can not identify all music notations. Besides that, classic OMR is more focused in regular printed music sheets; so, a good eciency is only obtained when this type of scores are processed. In this thesis, the recognition of standard handwritten music is the goal. With these manuscripts emerge new and dierent additional problems comparatively with printed music. Handwritten musical scores tend to be rather irregular and determined by the authors' own writing style. Moreover, if we consider that most of these works are old, the quality of paper in which it is written might have degraded throughout the years, making it a lot harder to correctly identify its contents. Handwritten musical scores are likely to have changes of size, shape and intensity of handwritten symbols by the same author into the same score. and recognition process is even more complicated. As a result, the detection Furthermore, problems exist not only in this level but also in the detection of the handwritten sta lines. These are rarely straight and horizontal, and are not parallel to each other. For example, some staves may be tilted one way or another on the same page or they may be curved. Depending on the level of the paper degradation there may also exist discontinuities in the sta lines and in the remaining symbols. In general, these scores have many years of existence, and therefore there is a sharp decay in the quality of the paper and ink. Many works are also photocopies from photocopies which adds noise through the degree of copy. 2.1 State of the Art in Sta Lines Detection One important stage in OMR is sta line detection. The reason for that is the possibility to isolate the musical symbols present in the score enabling their extraction. Sta lines detection 10 2.1. State of the Art in Sta Lines Detection 11 is, consequently, one of the fundamental steps in all OMR processes, being the following proceedings very dependently on the performance and results obtained in this initial phase. In doing so, any OMR algorithm begins with this operation. The problem of sta line detection is often considered simultaneously with the goal of their removal, although exceptions exist [Mat85, MO07, Szw05, Pug06]. The simplest approach consists in nding local maxima in the horizontal projection of the black pixels of the image + [KI90, Fuj04, RCF 93, MN96, BBN01, TSM06]. These local maxima represent line positions, assuming straight and horizontal lines. Several horizontal projections can be made with different image rotation angles, keeping the image in which the local maxima are bigger. This eliminates the assumption that the lines are always horizontal. identifying sta lines is to use vertical scan lines [Car89]. An alternative strategy for This process is based on a Line Adjacency Graph (LAG). LAG searches for potential sections of lines: sections that satisfy criteria related to aspect ratio, connectedness and curvature. More recent works present a nearly sophisticated use of a combination of projection techniques in order to improve the basic approach [Bai97, BBN01, RB05]. Fujinaga [Fuj04] incorporates a set of image processing techniques in the algorithm, including run-length coding (RLC), connected-component analysis, and projections. After applying the RLC to nd the thickness of sta lines and the space between the sta lines, any vertical black run that is more than twice the sta line height is removed from the original. Then, the connected components are scanned in order to eliminate any component whose width is less than the sta space height. After a global deskewing, taller components, such as slurs and dynamic wedges are removed. Other techniques for nding sta lines include the grouping of vertical columns based on their spacing, thickness and vertical position on the image [RP96], rule-based classication of thin horizontal line segments [Mah82], and line tracing [Pre70, RT88]. The methods proposed in [MO07, Szw05] operate on a set of sta segments, with methods for linking two segments horizontally and vertically and merging two segments with overlapping position into one. Dalitz [DDCF08] is an improvement on these methods. In spite of the variety of methods available, they all suer from some limitations. In particular, lines with some curvature or discontinuities are inadequately resolved. The dash detector [LCL93] is one of a few works that try to handle discontinuities. The dash detector is an algorithm that searches the image, pixel by pixel, nding black pixel regions that it classies as stains or dashes. Then, it tries to unite the dashes to construct lines. A common problem to all the above-mentioned techniques is that they try to build sta lines from local information, without properly incorporating global information in the detection process. To our knowledge, none of the proposed methods in the literature tries to dene a reasonable process from the intrinsic properties of sta lines, namelly the fact that they are the only extensive black objects on the music score. Usually, we argue that the most interesting Chapter 2. Related Works 12 techniques arise when one denes the detection process as the result of optimizing some global function. In this thesis, we suggest a graph-theoretic framework where sta lines are the solutions of a global optimization process. 2.2 State of the Art in Musical Symbol Extraction and Classication The segmentation task has been the object of study by some authors [Mah82, RT88, Car89, RB05, TSM06]. This process deserves some careful attention because it interferes with the classication stage. Major problems result from the diculty to obtain individual meaningful objects. This is due to the printing and digitalization as well as to the paper degradation and the lack of a standard notation. In addition there are distortions caused by sta lines, broken and touching symbols as well as high density of symbols, with dierent sizes and shapes. Besides, few research works have been done in handwritten scores [FLS05]. The most usual approach consists in extracting elementary graphic symbols: note heads, rests, dots, etc, that can be composed to build music notation. Usually, the primitive segmentation step is made along with the classication task [Mah82, RT88, Car89, KI90, RB05, TSM06], however exceptions exist [Fuj04, BBN01, Pug06]. In [Mah82] and [Car89] it is applied the same technique used in sta lines detection to the detection of musical symbols. Mahoney [Mah82] builds a set of candidates to one or more symbols types and then use descriptors to select the matching candidates. This technique is not feasible for handwritten scores, because the manuscript symbols do not have stable shapes and sizes. the LAG to extract symbols. Carter [Car89], in his turn, uses The objects resulting from this are classied according to the bounding box size, to the number and organization of their constituent sections. Once again this process of classication is not promissing for handwritten symbols, for the reasons presented above. Other authors [Fuj04, BBN01] have also chosen to apply projections to detect symbols primitives. The recognition is done using features extracted from the projection proles. In [Fuj04] the k-nearest neighbor rule is used in the classication phase, whether neural networks is the classier used in [BBN01]. + Randriamahefa [RCF 93] proposed a structural method based in construction of graphs for each symbol. These are isolated by using a growing region method and thinning. Template matching is adopted in [Mat85, MN96, Ros02, RB05, TSM06]. In [Ros02, RB05] a fuzzy model that depends on a robust symbol detection and template matching results was developed. This method aims to deal with uncertainty, exibility and fuzziness at a symbol level. In fact in this approach the segmentation process has two steps: individual analysis of musical symbols and fuzzy model construction. In the rst step, the vertical segments are detected by region growing method and template matching. Then, beams are detected by a region growing algorithm and a modied Hough transform. The remaining symbols are extracted again by template matching. From this rst step results three recognition hypotheses; each assigning the pattern to a possible class. The fuzzy model is used to make a consistent decision. The proposed process 2.3. Background Knowledge 13 incorporates graphical and syntactic rules. Besides, it allows the possibility of applying learning procedures when potential errors occur, in an eort to gain robustness. Other techniques for extracting and classifying musical symbols include rule-based systems to represent all the musical information [RT88], a collection of processing modules that communicate by a common working memory [KI90] and pixel tracking with template matching [TSM06]. Toyama [TSM06] checks for coherency in the primitive symbols detected by estimating touching positions. This evaluation is done by music writing rules. Coüasnon [CC93, Coü96] proposed a recognition process entirely controlled by grammar which formalizes the musical knowledge. In [RP96] the segmentation process involves three stages: line and curves detection by LAG, accidentals, rests and clefs detection by a character prole method and note heads recognition by template matching. The contextual recognition is done by graph grammars. In [Pug06] the segmentation task is based in Hidden Markov Models. This process performs segmentation and classication simultaneously. However, this technique results only for very simple scores, that is, scores without slurs or more that one symbol in the same column and sta. We can conclude, based on this brief description of the state of the art, that although the OMR systems already contribute to the analysis of musical scores by several methods, it is still far from solved the problems imposed by handwritten scores. Therefore, new research and new algorithms are important and necessary. By all the works already presented, only the fuzzy model proposed by [RB05] lead us to deduce that maybe it is a good method to take into account in a future work. It will be enriching for this project to do a comparative study of this method with elastic matching and also with a segmentation process proposed in this thesis. 2.3 Background Knowledge In this Section we review some concepts necessary for a better understanding of the work presented in the course of this thesis. 2.3.1 Graphs A graph is a pair subsets of V. G G = (V, E) of sets such that is composed of two sets the set of edges (or lines) (p, q), p, q ∈ V V and E ⊆ [V ]2 1 E. V the elements of E are 2-element is the set of vertices (or nodes), and E is [Die05]. on V . V (G) is the vertex set of a graph G and E(G) its edge The number of vertices of a graph G is its order, |G|, and ||G|| is the number of edges. A vertex v is incident with an edge e if v ∈ e; then e is an edge at v . Endvertices or ends are two vertices incident with an edge, and an edge joins its ends. If x ∈ X and A graph with vertex set V is called a graph 2 set . y ∈ Y, then xy is an set of all the edges in 1 X−Y E edge. at a vertex E(X, Y ) v is the set of all is denoted by E(v). X−Y edges in a set Two vertices x, y E. The of a graph G [V ]k ≡ the set of all k-element of V . Note that these conventions are independent of any names, for instance if we have a graph H = (W, F ) the vertex set w is still referred to as V (H), not as W (H). 2 Chapter 2. Related Works 14 are adjacent, if xy is an edge of G. The graph is weighted if a weight the edges are directed, i.e., Two edges e 6= f w(p, q) is associated to each edge, (p, q) 6= (q, p). e an initial vertex init(e) A path is a (non-empty) graph xi where the x1 ,. . .,xk−1 are all distinct. P = (V, E) are the inner The init : E → V and a terminal vertex V = {x0 , x1 , . . . , xk } end vertices of P. and it is called a In other words, a digraph is a pair sets of vertices and edges together with two maps every edge are adjacent if they have an end in common. and ter(e). (V, E) ter : E → V digraph if of disjoint assigning to of the form E = {x0 x1 , x1 x2 , . . . , xk−1 xk } , x0 vertices The and path cost xk are linked by P and the vertices is the sum of each arc weight in the path. The length of a path is the number of its edges, and the path of length k is denoted by connected graph G is a graph where two of its vertices are linked by a path graph D is oriented with an orientation of a graph G if V (D) = V (G) and P k . A non-empty in G. A direct E(D) = E(G), and if {init(e), ter(e)} = (x, y) for every edge e = xy . In graph theory, the shortest-path problem seeks the shortest path connecting two nodes; ecient methods are available to solve this problem, such as the dynamic programming algorithms. Dynamic Programming Before presenting the operations of dynamic programming techniques let us take a look to the following example [SDV06]. Figure 2.1 represents a directed acyclic graphs and its linearization (topological ordering the nodes are arranged on a line so that all edges go from left to right). This linearization process is important for the shortest path. 4 A C 1 E start 3 5 1 2 6 B D 6 3 E start 2 B 5 4 A C 1 D 1 Figure 2.1: A directed acyclic graphs and its linearization. Suppose we want to gure out distances from node let's focus on node C. E to the other nodes. For concreteness, The only way to get to it is through its predecessors, B or A; so to nd 2.3. Background Knowledge the shortest path to C, 15 we need only compare these two routes: dist(C) = min {dist(B) + 3, dist(A) + 4} . A similar relation can be written for every node. If we compute these to-right order of Figure 2.1, we can certainly get to a node information needed to compute dist(v). v, dist values in the left- and then we already have the We are therefore able to compute all distances in a single pass: initialize all dist(·) values to ∞ dist(s) = 0 for each v ∈ V {s}, in linearized order: dist(v) = min(u,v)∈E {dist(u) + l(u, v)} In face of it, we can note that this algorithm is solving a collection of subproblems, it starts with the smallest of the distances, dist(e), {dist(u) : u ∈ V }: since it immediately knows its answer to be 0; then, it proceeds with progressively larger subproblems distances to vertices that are further and further along in the linearization where it is thinking of a subproblem as large if it needs to have solved a lot of other subproblems before it can get to it. This is a very general technique. At each node, the algorithm computes some function of the values of the node's predecessors. In this case, the particular function is a minimum of sums. This is dynamic programming. This is a powerful algorithmic paradigm for eciently solving a wide range of search and optimization problems which exhibit the characteristics of overlapping subproblems (the problem can be broken down into subproblems which are reused several times) and optimal substructure (optimal solutions of subproblems can be used to nd the optimal solutions of the overall problem). Now, it is crucial to know what the subproblems are when we are solving a problem by dynamic programming. Each node will represent a subproblem, and each edge will represent a precedence constraint, of the form (i − 1, j) → (i, j), (i, j − 1) → (i, j), and (i − 1, j − 1) → (i, j), on the order in which the subproblems are tackled. We can also put weights on the edges. 2.3.2 Classication Methods Dierent approaches were evaluated in this work for the classication problem: hidden markov models, support vector machine, neural networks and k-nearest neighbor. Besides that, concerning the fact that we are trying to classify handwritten musical symbols, we state one main problem: the enormous variations in symbols. Therefore, the classication process needs to take into account the variability in the writing style. An Elastic Matching method [Nis95, JZ97] was used to expand our database to achieve a better approach for these several variances. In the next subsections a brief explanation of these methods will be provided. Chapter 2. Related Works 16 Hidden Markov Models Hidden Markov Models (HMMs) have rarely been used in OMR except in experiences made by [KPM96, MMM04, Pug06]. The application of this technique in musical symbol classication had its origins on optical character recognition. One of the reasons for the use of HMM lies in its capability to perform segmentation and recognition at the same time. Limited by time, for now, HMM were only used to recognize the symbols. a doubly stochastic process, that generates symbols sequence, with an underlying stochastic process that is hidden and can only be detected through another process whose realizations are observable [BS00]. The hidden process consists of a set of states connected to each A HMM is other by transition probability. Left-right HMMs were used here. Transitions probabilities from a state i to another state i, j ≤ N . j are given by A = {aij }, where aij = P [qt+1 = Sj |qt = Si ] , 1≤ The observed process consists of a set of outputs or observations. Each observation is contained in a state with some probability density function. The set of observations probabilities is given by bj (k) B = bj (k), where bj (k) = P [ot = xk |qt = Sj ], 1 ≤ k ≤ M , j = 1, 2, ..., N . represents the probability of the observation in time where t π and qt represents the state in time t. xk in state Sj , oj denotes the observation HMM can now be formulated as λ = (A; B; π), is a set of initial probabilities of states [WFJC01]. Support Vector Machines Support Vector Machines (SVMs), pioneered by Vapnik [Vap98], deal with the problem of classication as a problem of quadratic optimization. This technique has as main idea the construction of a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized. Support vector machines classify the data using support vectors [Hay98]. Without loose of generalization, SVMs try to maximize the margin of the optimal hyperplane which separates the data. This is typically done in a much higher dimension than that of the original feature space. Formally, given the training set and corresponding binary class labels dened by g(x) = bias parameter. have di solving [wt ϕ(x) x wt ϕ(x) + b where w,b s.t 1 or to if denotes a xed-feature space transformation and g(x) > 0 −1 if g(x) < 0. b a This is equivalent to Summarizing, maximizing the margin is equivalent to 1 t ww 2 di xi ∈ RP the linear separable optimal hyperplane is + b] ≥ 1, i = 1, . . . , N . min with input data di ∈ {−1, 1}, ϕ(x) is assigned to class {xi , yi }N i=1 [wt ϕ(x) (2.1) + b] ≥ 1, i = 1, . . . , N If the training classes are not linearly separable, it means that above conditions and problem formulation can not be sustained. For this reason, slack variables ξi , i = 1, . . . , N were added. These allowed to have a penalty for the data points wrongly classied. Finally, the objective 2.3. Background Knowledge 17 is to minimize the error. That is, N X 1 t w w+C ξi 2 min w,b,C,ξi i=1 di s.t where the parameter C>0 [wt ϕ(x) ξi ≥ 0 (2.2) + b] ≥ 1 − ξi , i = 1, . . . , N controls the trade-o between the slack variables and the margin. In the feature space it is easier to solve the dual problem, and sometimes it is the only way to train the support vector machines. It is possible to formulate the dual problem 3 for a sample N of training {xi , yi }i=1 not separable as follows: N X max α i=1 PN N N 1 XX αi − αi αj di dj k(xi , xj ) 2 i=1 j=1 (2.3) i=1 αi di = 0 s.t 0 ≤ αi ≤ C where k(xi , xj ) = ϕT (xi )ϕ(xj ) = is the l Pm1 i = 1, 2, ..., N l=0 ϕl (xi )ϕl (xj ), component in the application ϕ(xi ) of i = 1, 2, ..., N xi ; m1 and j = 1, 2, ..., N . ϕl (xi ) is the dimension of the feature space. Above a binary classier was described whereas in this work a multiclass problem is presented. In fact, several works have been suggested as an extension to the binary problem classication that originally was proposed [eeFc02, HL02, MA98]. Usually, there are two types of approach for the multiclass classication problems. One is to build and to combine several binary classiers one-against-one and one-against-all - and another approach is to consider directly a resolution of the quadratic problem. The one-against-one methodology was used in this work. This method consists of training the j th k th and classes in the following binary problem N min wjk ,bjk ,C,ξijk s.t with l training data class of xi ; X jk 1 jk t jk (w ) w + C ξi (wjk )t 2 i=1 yi [(wjk )t ϕ(x) ξijk ≥ 0 (x1 , y1 ), . . . , (xl , yl ) the training data xi (2.4) + b] ≥ 1 − ξijk , i = 1, . . . , N where xi ∈ <n , i = 1, . . . , l, and yi ∈ {1, . . . , k} are mapped to an high dimensional space by the funtion The three most common types of inner-product kernels for SVMs are: machine, radial-basis function network and tangent hyperbolic. is the ϕ. polynomial learning In this work a radial-basis function network was used, given by: k(x, xi ) = exp(−γ||x − xi ||2 ), γ ≥ 0 3 See Appendix A.1 to see the demonstration. (2.5) Chapter 2. Related Works 18 Neural Networks The term neural network was initially studied with the aim to represent the information pro- cessing in biological systems [MP88, WH88, RHW88]. The attempts were to produce intelligent perception and cognition machines by simulating the physical structure of human brains. In our days, the principles and algorithms of neural networks have found several applications in diverse elds including pattern recognition and signal processing. Neural networks are composed by interconnected neurons. These are the information processing units. The neural model is form by three basic elements: a set of connecting links (each one is multiplied by a weight), an adder (for summing the input signals) and an activation function (for limiting the amplitude of the output of a neuron) [Hay98]. Formally, we can dene the output of a neuron by m X yk = ϕ( ωkj xj + bk ) (2.6) j=1 where x1 , x2 , . . . , xm are the input layers, is the activation function. ω are the weights of neuron k , bk is the bias and ϕ(.) There are three common types of activation functions: threshold function, piecewise-linear function and sigmoid function. In this work, the last one was used , whose graph is s-shaped. This way, the function accepts input values between and return values between 0 and 1. Hence, it is dened by f (x) = 1 , where a > 0 1 + exp(−ax) −∞ and represents the slope ∞, (2.7) A multilayer feedforward neural network was used. Typically, this network consists of input and output layers of neurons and one or more hidden layers that are not part of the input or output of the network. The input signal propagates through the network in a forward direction. 4 The learning algorithm used to train this network was the backpropagation algorithm . This algorithm is based on the gradient descending technique to minimize the cost function. In a simple explanation, the error backpropagation learning consists of two passes through the layers of the network: a forward pass and a backward pass. In the forward pass the link weights of the neurons are all xed. The input vector is applied to the sensory nodes of the networks and it is produced a set of outputs. In the backward pass the link weights are all adjusted in accordance with an error-correction rule. Basically, the output values of the network are subtracted from a desired response (targets) to produce an error signal. This error signal is then propagated backward through the network to all neurons. A network with K outputs was used, one corresponding to each class, and target values of 1 for the correct class and 0 otherwise. K-nearest Neighbour K-nearest neighbour is a supervised learning method for classifying objects based on training examples in the feature space [Fuk90]. 4 See Appendix A.2 for more details. This algorithm belongs to a set of techniques called 2.3. Background Knowledge 19 Instance-based Learning. The K-nearest neighbour algorithm is very simple and basically it does not have train. It starts by extending the local region around a data point k th nearest neighbour is found. The most represented class in the k -closest the predicted class. Data training lies only in the estimation of the best k. x until the samples denes In this work the Euclidean distance was used : deuclidean (a, b) = Elastic Matching qP d i=1 (ai − bi )2 Several research methods exist in deformable template eld applied on handwritten digits and handprinted characters recognition (e.g. [LS88, Wak94, Nis95, JZ97]). Lam [LS88], one of the rst works in this area, proposed a method of recognition in two-stages. The images are rst recognized by a tree classier. Those which cannot be satisfactory assigned to a class are passed to a matching algorithm, which deforms the image to match with a template. In Nishida [Nis95] a grammar-like model for applying deformations in primitive strokes was developed. A new and robust shape-matching approach to recognize numbers manuscripts was proposed by Wakahara [Wak94]. The method uses successive local ane transformation (LAT) operations to gradually deform the image. The aim is to yield the best match to an input binary image. LAT on each point at one location is optimized using locations of other points by means of least-squares data tting using Gaussian window functions. The deformation and matching technique used in this work to classify the musical symbol is based in [JZL96, JZ97]. In this approach, the image is mapped on a unit square The points in this square are mapped by the function displacement functions is given by S = [0, 1]×[0, 1]. (x, y) → (x, y) + D(x, y). The space of exmn (x, y) = (2 sin(πnx) cos(πmy), 0) (2.8) eymn (x, y) = (0, 2 sin(πny) cos(πmx)) (2.9) Specically, the deformation function is chosen as follows: M X N x ex + ξ y ey X ξmn mn mn mn D(x, y) = λmn (2.10) m=1 n=1 where x , ξ y ), m, n = 1, 2, . . .} ξ = {(ξmn mn orthogonal basis. Because coecients of density on ξmn 2 variance σ . can represent complex deformations by choosing dierent and dierent values of M and N, it is important to impose a probability D(x, y). pendent along the D(x, y) are the projections of the deformation function on the x Therefore, the and y ξmn 's are assumed to be independent of each other, inde- directions and identically Gaussian distributed with zero mean and Part II Detection 21 Chapter 3 Staff Lines Detection and Removal∗ Sta line detection and removal are the rst fundamental stages in many systems of optical music recognition, with subsequent processes relying heavily on their performance. The reasons for detecting and removing the sta lines lie on the need to isolate the musical symbols for a more ecient and correct detection of each symbol present on the score. For the sta lines to be properly removed and the symbols correctly detected it is necessary to make a correct detection of the sta lines. When we are dealing with printed and regular scores the process is much simpler. However, in the case of handwritten music sheets, which is the case of study of this dissertation, several vicissitudes, already listed in Chapter 2, may occur. For this reason, despite the multitude of attempts to treat the problem of sta lines detection, when we are working with old and handwritten music scores, the results are not completely satisfactory yet. In this dissertation a new conceptualization to detect the sta lines is presented. The proposed + paradigm begins with the work done in [RCC 07]. The main idea is to consider the sta lines as the result of the shortest path between the two margins of the music sheet, giving preference to black pixels. The initial version only implemented a heuristic of this basic idea, without a principle properly reasoned and tested. The proposed new learning methodology is extended and explored in various directions in this thesis. Firstly, the concept of stable paths is introduced in order to improve the computational performance of the method. Secondly, the design of the weights on the graph resulting from the music score is generalized to dierentiate black pixels belonging to the sta lines from black pixels resulting from the music symbols. Finally, the post processing is rened, improving the overall performance. A further development is the study of new sta removal algorithms, by incorporating the proposed sta line detection on standard sta removal algorithms. It is important to state that despite this Project being in the eld of recognition of standard musical notation it is not necessary, for the sta line detection and removal phase, to be conned only to this type of notation. Hence, a database was built. This database contains not only several symbols with dissimilar shapes and intensities, but also a varying number of lines per sta. ∗ Some portions of this chapter appears in [CCR+ 08, CCRG08b]. 23 Chapter 3. Sta Lines Detection and Removal 24 3.1 Basic Concepts First of all, it is necessary to indicate that the paradigm described considers that a sta line can be seen as a connected path from the left side of the music score to the right side. As sta lines are almost the only extensive black objects on the music score, the path we are looking for is the shortest path between the two margins if paths (almost) entirely through black pixels are favoured. Consequently, the image grid the music sheet is considered as a graph with pixels as nodes (vertices) and edges connecting neighbouring pixels. The weight w(p, q) of each arc is a function v1 of the pixels values and their relative positions. A path from vertex (pixel) vn v1 , v2 , . . . , vn , is a list of unique vertices with vi and vi+1 to vertex (pixel) corresponding to neighbour pixels. The total cost of a path is the sum of each arc weight in the path and is given by n X w(vi−1 , vi ). (3.1) i=2 A path from a source vertex minimum among all u on a graph, v -to-u d(v, u), v to a target vertex is the total cost of a shortest path between v to a sub-graph if its total cost is minimum among all to a sub-graph path if its total cost is v paths. The distance between a source vertex A path from a source vertex v u is said to be a shortest Ω, d(v, Ω), v -to-u Ω v and and a target vertex u. is said to be a shortest path between paths, where u ∈ Ω. v and Ω The distance from a node is the total cost of a shortest path between v and Ω and it is represented by: d(v, Ω) = min d(v, u). (3.2) u∈Ω A path from a sub-graph Ω2 to a sub-graph if its total cost is minimum among all from a sub-graph Ω1 Ω1 and Ω2 , Ω1 to a sub-graph Ω2 is said to be a shortest path between v -to-u paths, Ω2 , d(Ω1 , Ω2 ), where u ∈ Ω1 and v ∈ Ω2 . Ω1 and The distance is the total cost of a shortest path between and is given by: d(Ω1 , Ω2 ) = min v∈Ω1 ,u∈Ω2 d(v, u). (3.3) 3.2 Underlying Principle As mentioned before, sta lines can be considered as the only extensive objects made from black pixels in the music score, connected paths of black pixels from the left side to the right side of the music score. Assuming that paths through black pixels are preferred over paths through white pixels, sta lines can then be found among the shortest paths from the left to the right margin of the music score. Sta lines can be modelled as paths between two regions Ω1 and Ω2 ; these represent, respectively, the left and right margins of the score. Besides that, one may assume that sta lines do not zigzag back and forth, left and right. Therefore, one may restrict the search among connected paths containing one, and only one, pixel in each 3.3. Algorithm Outline 25 2 column of the image . Formally, let I be an N1 × N2 1 , s.t. s = {(x, y(x))}Nx=1 where y is a mapping image and dene an admissible sta to be ∀x |y(x) − y(x − 1)| ≤ 1, y : [1, · · · , N1 ] → [1, · · · , N2 ]. That is, a sta line is an 8-connected path of pixels in the image from left to right, containing one, and only one, pixel in each column of the image. Given the weight function w(p, q), the cost of a sta can be dened as C(s) = N1 X w(vi−1 , vi ) (3.4) i=2 The optimal sta line that minimizes this cost can be found using dynamic programming. The rst step is to traverse the image from the second column to the last column and compute the cumulative minimum cost C for all possible connected sta lines for each entry C(i, j) = min where (l, m). w(pi,j ; pl,m ) (i, j): C(i − 1, j − 1) + w(pi−1,j−1 ; pi,j ) C(i − 1, j) + w(pi−1,j ; pi,j ) , C(i − 1, j + 1) + w(p i−1,j+1 ; pi,j ) represents the weight of the edge incident with pixels at positions (i, j) and At the end of this process, min j∈{1,··· ,N2 } C(N1 , j) indicates the end of the minimal connected sta. Hence, in the second step, one backtrack from this minimum entry on C to nd the path of the optimal sta. 3.3 Algorithm Outline Assume one wants to nd all sta lines present in a score. This can be approached by successively nding and erasing the shortest path from the left to the right margin of the score. The 3 erase operation is crucial to ensure that a sta is not detected more than once . Consider the music score presented in Figure 3.1(a); in Figure 3.1(b) the rst eleven shortest paths are traced. From this example it is possible to conclude that music symbols placed on top of sta lines do not interfere with the detection of the sta lines. Moreover, the example also makes clear that slightly skewed scores do not pose any problem to the proposed approach. Nonetheless, two main issues need to be properly addressed. On the one hand, a criterion is needed to stop the iterative detection of the shortest paths, that is, the sta lines. On the 2 These assumptions, 8-connectivity and one pixel per column, impose a maximum detectable 45 rotation degree. However, a higher rotation degree, in a real case, is unlikely to happen, but may be previously corrected. 3 We implemented the erase operation by setting to white the pixels on the detected sta; image resizing could be a valid alternative [AS07]. Chapter 3. Sta Lines Detection and Removal 26 (a) Skewed sta lines with music symbols. (b) The rst 11 shortest paths between left and right margins. Figure 3.1: An exemplicative example of the methodology. other hand, the initial and the nal parts of a path should also be trimmed, since these (almost) completely white pixels do not belong to the sta line, despite belonging to the shortest path of an opposite margin. Another subject deserving attention is the computational complexity. The sequential computation of the shortest path may be prohibitive for some applications. It would be interesting to be able to compute several 'shortest paths' simultaneously. Thus, before presenting the complete algorithm, we introduce the concept of a stable path in a graph, which will allow computing multiple sta lines in a single iteration, instead of sequentially computing them one at a time. 3.4 Stable Paths on a Graph Before presenting the formal denition of a stable path on a graph, it is necessary to consider a hypothetical example of a simplied music score to motivate this concept. Therefore, lets consider Figure 3.2 where a sta with only four sta lines (rows 1, 4, 6, and 8) in a 8 × 9 image is presented. The sta lines are the black elements, as in the real case. The presence of noise on some of the sta lines is simulated by discontinuities. The design of the weight function will be considered in the Section 3.5; for now, it suces to know that the presented graph was constructed to favour paths through black pixels. In fact, we can see in Figure 3.2 that the shortest path between the left and right margins between the sub-graphs Ω1 and Ω2 is the path corresponding to the rst row, entirely through black pixels. By following the strategy just delineated, one could nd the four sta lines in four iterations, sequentially. Nonetheless, although only one sta line corresponds to the shortest path, they all constitute a sort of (almost) optimal paths. The stable paths concept provides a means to nd all of such paths simultaneously. Denition 1. A path Ps,t is a stable path between regions Ω1 and Ω2 if Ps,t is the shortest path between s ∈ Ω1 and the whole region Ω2 , and Ps,t is the shortest path between t ∈ Ω2 and the whole region Ω1 . The naming of stable paths has its roots in dynamical systems, as it resembles stable xed points. If one considers the function FΩ1 →Ω2 (), mapping a node s ∈ Ω1 to a node t ∈ Ω2 by 3.4. Stable Paths on a Graph 27 Figure 3.2: Stable paths on a toy example. nding the shortest path Ps,t of such shortest path, then between s ∈ Ω1 and Ω2 , with t = FΩ1 →Ω2 (s) as the end node GΩ1 →Ω1 (s) = FΩ2 →Ω1 (FΩ1 →Ω2 (s)) = s if and only if Ps,t is a stable path. Note that the concept of stable path is valid for any graph and any two sub-graphs in general. The computation of the stable paths on the toy example of Figure 3.2 provides the three paths yellow-highlighted in the gure. As a second example, in Figure 3.3(a) the shortest paths between and the whole each point on the left margin right margin are traced for the score in Figure 3.1(a). As seen, the paths got attracted by the staines. Likewise, Figure 3.3(b) shows the shortest paths between on the right margin and the whole left margin. each point The set of stable paths between both margins result as the set of paths present in both gures. With the concept of Stable Paths, the computation of all the stable paths in the graph derived from an image has only roughly twice the complexity of the shortest path computation. Noticing that the procedure delineated in Section 3.3 actually gives the shortest path between the whole left margin Ω1 and each point on the right margin Ω2 . The rst step on the computation of the stable paths corresponds verbatim to the computation of the shortest path presented on Section 3.3. In a second step one repeats the same procedure, traversing now the graph from the right column to the left. At the end of this process, if the two endpoints of a direct and reverse path coincide, we are in the presence of a stable path. Chapter 3. Sta Lines Detection and Removal 28 (a) Shortest paths from each pixel in the left column and the whole right column, superimposed on the original image. (b) Shortest paths from each pixel in the right column and the whole left column, superimposed on the original image. Figure 3.3: Exemplication of stable paths for Figure 3.1(a). On the toy example of Figure 3.2, the computation of the stable paths provides only three out of the four sta lines, with the last stable path following partially through a segment of 4 the third sta line and a segment of the fourth sta line . As a consequence, it appears that the computation of the stable paths does not guarantee the discovery of all paths of interest. Applying, in a second time, the stable paths procedure, after erasing the paths found on the rst iteration, we get a new set of paths, including a path joining the remaining segments of the 3rd and 4th sta lines. A search based on stable paths in preference to a sequential search by the shortest path is related with the number of stable paths found simultaneously. For instance, while a score with 60 sta lines would require requires a few (typically between 4 to 6) 60 iterations of the shortest path algorithm, it iterations with the stable paths method. Next, the complete proposed algorithm for sta line detection is detailed. 3.5 Proposed Algorithm The proposed algorithm can be implemented as a sequence of a few high-level operations see Listing 1 that will be described in this Section. Preprocessing With the objective of detecting the sta lines, the proposed algorithm starts by estimating the sta space height, staspaceheight, and sta line height, staineheight. These lengths will both be used as reference measures for the subsequent operations. There are already, in common use, robust estimators for its calculation. Usually, the most used technique starts by computing the vertical run-lengths representation of the image. After doing that to a bit-mapped page of music, the most common black-runs represents the sta line height and the most common white-runs represents the sta space height [Fuj04]. After estimating the reference lengths, the edges' weights are determined as explained in Section 3.6. 4 Note that the sequential search of the shortest path would suer from the same limitations. 3.5. Proposed Algorithm 29 BEGIN PreProcessing compute staffspaceheight and stafflineheight compute weights of the graph END PreProcessing CYCLE compute stable paths validate paths with blackness and shape remove valid paths from image add valid paths to list of stafflines END OF CYCLE if no valid path is found BEGIN PostProcessing uncross stafflines organize stafflines in staves smooth and trim stafflines END PostProcessing Listing 1: Main operations of the proposed method. Main Cycle The preprocessing is followed by the main cycle of the methodology that incorpo- rates successively nding the stable paths between the left and right margins. Moreover, it is in this step that the paths found are added to the list of sta lines and then erased from the image. The erase operation sets to white the pixels on a vertical strip of height empirically xed at staspaceheight, centred on the detected sta line. As explained previ- ously, the erase operation is necessary to ensure that a line is not detected multiple times, even if its height is higher than one pixel. Stopping Rule To stop the iterative sta line search, a sequence of (arguably) sensible rules is used to validate the stable paths found; if none of them passes the checking, the iterative search is stopped. Two validation rules were applied, both assessing features with respect to the median values obtained during the rst iteration. If a path does not have a percentage of black pixels above a xed threshold, it is discarded. The median percentage of blackness of all lines found in the rst iteration of the main cycle provides the necessary reference: a threshold of 80% of the median value was empirically selected. Likewise, if the shape of the detected path diers too much from the shape of the line with median blackness, it is discarded. A dissimilarity measured as the average between both paths, after removing the means above y − distance shapedi = 4×staspaceheight was selected as threshold. PostProcessing After the main search step, valid detected sta lines are post-processed. Although true sta lines never intersect, the above algorithm may occasionally create intersecting lines, detected on dierent iterations. That may be due to a local low quality of a line, leading a stable path to jump between consecutive lines. In doing so, this local discontinuities can be the reason for possible zigzag back and forth between consecutive lines, because the detected path is likely to follow and connect the remaining segments, and consequently intersect with the previous detected path. To preclude such nal, undesired state, lines are post-processed to remove intersections. That is easily and eciently accomplished by, for each image column, sorting on y the pixels of the detected lines and Chapter 3. Sta Lines Detection and Removal 30 assigning the i-pixel to the i-line. After this simple process, lines may touch but they do not intersect. After removing the intersections it is now possible to eliminate spurious sta lines and to cluster them in staves. Since the lines are ordered, these operations require only iterating through the list of lines and starting a new sta whenever the distance between two consecutive lines is above a xed threshold. old was = 2× staspaceheight. The value considered for this thresh- Subsequently, spurious staves are eliminated by simply discarding those with only single sta line. However more robust rules can be created, because staves with only one sta line exist (e.g. percussion), although they are uncommon. Finally, each retained line is trimmed at the beginning and at the ending and smoothed. As visible in the example of Figure 3.1(b), before meeting with a sta line, a path travels through a sequence of white pixels. Likewise, after the end of the sta line, the path goes again through a sequence of white pixels until it meets the right margin of the image. In order to ignore all of these white pixels (undesirable segments), the trimming operation works per sta. Thus, for each sta, a sequence of median colours is computed as follows: for each column, the median of the colours of the lines is added to the sequence. Next, the trimming points are found on this sequence: starting on the centre, we traverse the sequence to the left and to the right until a run of whiterun = 2×staspaceheight white pixels is found (value obtained experimentally). The pixels between the left and right runs are kept in the sta lines. At the end, lines are smoothed with a standard average low-pass lter. Considering a sta line as a sequence dimensional averaging lter is applied. A window size of y(x) of y-positions, a one- 2×staspaceheight was selected on the experiments. 3.6 Design of the Weight Function An immediate approach is to support the design of the weight function solely on the values of the incident nodes, that is, if any of the corresponding pixels are black then a low cost is assigned to the edge, otherwise the edge assumes a high cost. We call this the baseWeight in Listing 2. Nevertheless, the weight function can be generalized taking into account other factors. It is very useful to consider the prior knowledge about a music score when we want to nd the sta lines. In order to incorporate this idea in the shortest path process, we consider dierent alternatives to modify the weight function of the graph. In doing so, this weight function will be modied by a term that codies that information about the music sheet. Furthermore, two more attributes were considered in each pixel with the aim to inuence the main contribution to the edge's weight resulting from the values of the incident pixels. The prime intention of these additional features is to discriminate black pixels in the sta lines from black pixels in the music symbols, penalizing the latter and favouring the former. With this in mind, if a black pixel is part of a short vertical run of black pixels, then it is more 3.7. Sta Line Removal 31 likely to be part of a sta line rather than of a symbol. edges is included in the weight function. have another sta line at roughly Therefore, a term beneting such The other attribute is that a sta line is likely to staspaceheight pixels, assuming that staves have at least two lines. Hence, if the nearest vertical run of black pixels on the same column is excessively far from the vertical run of black pixels containing the current black pixel, then this pixel is more likely to belong to a symbol (for instance, it may belong to a ligature) rather than to sta line. Consequently, a penalising term is incorporated in the weight function for these cases. The pseudo-code for the weight function is provided in Listing 2. WeightFunction(pixelValue1, pixelValue2, vRun1, vRun2, nearestVRun1, nearestVRun2, NeighbourhoodType) { value = min(pixelValue1, pixelValue2); weight = baseWeight(value, NeighbourhoodType); if( (vRun1<=STAFFLINEHEIGHT) OR(vRun2<=STAFFLINEHEIGHT)) weight = weight - delta; if( (nearestVRun1>=STAFFSPACEHEIGHT+STAFFLINEHEIGHT) OR(nearestVRun2>=STAFFSPACEHEIGHT+STAFFLINEHEIGHT)) weight = weight + delta; return weight; } Listing 2: Pseudo-code for the weight Function. The base weight was set to 8 on white delta penalizing and pixels for 4-neighbourhoods and to 6 term in the weight function was set to and 1. 12 4 on black pixels on for 8-neighbourhoods. The For eciency, weights were designed with integer values. 3.7 Sta Line Removal The sta lines removal algorithms used in this work were adopted from [DDCF08]. Line Track Height, Line Track Chord, Roach/Tatem and Skeleton were the algorithms chosen for the experiments. A brief description of these algorithms will be provided now: Line Track Height The algorithm tracks the sta lines and checks when a vertical black run is longer than a threshold (experimentally set at Line Track Height Modied 2×staineheight). The version modied of the Line Track Height algorithm also track the sta lines positions obtained by a detection algorithm and removes vertical run sequences of black pixels that have a value lower than a specied threshold (chosen experimentally as 2×staineheight). In this version, a carefully attention to the deformations sta lines may have discontinuities, be curved or inclined that may occur in the music scores are given. These problems will inuence the success to achieve a correct detection of lines contained on the score. The positions of the sta lines obtained by a sta line detection algorithm may pass slightly above or under the real sta lines positions. Therefore, if we are in presence of a white pixel when the sta lines are tracked, we search Chapter 3. Sta Lines Detection and Removal 32 vertically for the closest black pixel. If that distance is lower than a specied tolerance experimentally chosen as 1+ceil( staineheight/3.0) we move the reference position of 5 the sta line to the position of the black pixel found . Line Track Chord This algorithm computes, for a xed angle resolution of three degrees, the chord length through the skeleton point (see Figure 3.4). This results in a function chordlenght(ϕ), where ϕ is the chord angle, for each skeleton point. When the sta line pixel also belongs to a crossing music symbol, the function should have a second distinct peak. To detect this peak the following thresholds are used: 1. There is a local maximum when chordlength is greater than angle below than 30 5×staineheight at an degrees and another local maximum when chordlength is greater 1.75×staineheight× sin(ϕ) at an angle ϕ > 30 degrees. 2. The valley between two maxima must have a depth greater than 1.5×staineheight. Concluding, this algorithm removes the sta line through the angles peaks of the chord lengths. There are two distinct peaks depending if the pixels belong to a sta line or a music symbol. Figure 3.4: Length of a chord through a skeleton point at some angle Roach/Tatem ϕ. This algorithm uses a labelling scheme based on the angle information and pixel adjacency to identify the sta line pixels [RT88]. The chord length and the angle function described in the Line Track Chord are computed for every pixel. Consequently, the original image can be transformed into two-dimensional vector eld by picking the angle and length of the longest chord for each black pixel. This will assign pixels on sta lines a high length value and an angle value of zero. To avoid the removal of symbol pixels on the sta lines, some horizontal line pixels are iteratively relabelled as non-horizontal pixels, depending on the labels of their neighboring pixels. Skeleton This method consists of the following steps: 1. The skeleton is split at branching point and corner points with an angle below 135 degrees. Around each spliting point a number of pixels are removed see Figure 3.5(a). 5 See Appendix A.6 for more details. 3.8. Database of Music Scores 33 2. Sta line segment candidates are picked as skeleton segments if the orientation angle (least square tted line) is below 25 degrees, the segment is wider than tall straightness staineheigth2 /2. and the (mean square deviation from least square tted line) is bellow 3. Apply a sta-nding algorithm to the sta segment candidates. Two sta segments are horizontally linked when their extrapolations from the end points with the least square tted angle come closer than staineheigth/2. 4. Remove false positives: from each ovelapping sta segment group on the same line the one that is closest to its least square tted neighborhood is picked and the others are discarded; non-sta segments that have the same branching point as a sta segment are extrapollated by a parametric parabola: if this parabola is approxi- mately tangential to the sta segment, the latter is considered a false positive see Figure 3.5(b). 5. Remove sta lines: all vertical black runs around the detected sta skeleton are removed. (a) Pixels within the distance transform radius around each splitting point are removed. Figure 3.5: (b) A falsely detected sta segment that can be identied as belonging to a music symbol because it is appoximately tangential to an extrapolated parabola from a non-sta segment. Example (from [DDCF08]). 3.8 Database of Music Scores For the purpose of testing the proposed algorithm, a database with a sucient number of scores to produce signicant results on which we can draw conclusions and test improvements, was created. Although the assessment of a new sta detection algorithm may be done by visually + inspecting the output on a set of scores as adopted on [RCC 07] , here we support the comparison with two test sets adopted for the qualitative evaluation of the proposed method: the test set of synthetic scores from [DDCF08] and a new set of real score. The synthetic database consists of 32 ideal scores see Figure 3.6(a) with dierent musical notation to which several known deformations were applied. In total 2688 images were generated. The simulation of distortions in a controlled manner made possible to build many feasible problems that can be found in the handwritten music scores. Despite this we also test the performance of Chapter 3. Sta Lines Detection and Removal 34 the stable path algorithm with a test set of real scores in total 50 images from the Portuguese composer Fernando Lapa with other old handwritten scores (see Figure 3.6(b)). (a) Music score from the test set of 32 ideal scores. (b) Music score from the test set of 50 real scores. Figure 3.6: Two examples of music scores from the test set used on the experimental evaluation. 3.8.1 Deformations in the Synthetic Scores6 7 The Gamera framework (Generalized Algorithms and Methods for Enhancement and Restora- tion of Archives) was the base platform used to create the deformation database with intention to development and test the proposed algorithm. In a brief description, this framework is a toolkit for building document image recognition systems. It has a programming library with a set of graphics tools for experimentation and training. Gamera provides a wealth of tools for image processing. In this work, in order to complete the Gamera framework, for the concrete problem of building a syntactic database, the MusicStaves plug-in was included. This toolkit appears to be ideal for the problem, because ¶ It is a specic platform for the development and test of the detection and removal sta lines algorithms. · It is an open source toolkit. ¸ It allows using handwritten and printed scores with the algorithms already implemented in Gamera. The distortions adopted from [DDCF08] applied to the test set of ideal scores can be classied in two categories: 1. Deterministic deformations, which depend of certain parameters, as for instance the tation. ro- 2. Random defects, which use various parameters about the deformation and a pseudorandom number generator. In both cases it is necessary to apply the deformation in parallel with the original score and the ground-truth sta image. Despite some deformations being easy to understand by the name, there are others where this is not true (the defects 6 7 resolution, rotation and line interruption are self explanatory). A brief Consult [DDCF08] for more details. See http://ldp.library.jhu.edu/vhost-base/gamera and [Cap08] for more details. 3.8. Database of Music Scores 35 Deformation Type Parameter description Rotation Deterministic Rotation angle Curvature Deterministic Height ratio: width of sine curve Typeset Emulation Both Range width, maximal height and variance of vertical shift Sta Line Interruptions Random Interruption frequency, maximal width and variance of range width Sta Line Thickness Variations Random Markov chain stacionary distribution and inertia factor Sta Line y-variation Random Markov chain stacionary distribution and inertia factor Degradation After Kanungo Random (η, α0 , α, β0 , β, κ), White speckles Random Speckles frequency, randow walk + see [KHB 00] length and smoothing factor Table 3.1: Deformations in the images. explanation of these distortions will be done. In Table 3.1 the deformations considered, which are available in the MusicStaves toolkit, are listed. In Figure 3.7 it is possible to see the eects caused in the images. (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 3.7: Some examples of applied deformations from the original image: a) Original; b) Curvature c) Degradation after Kanungo; d) Sta line thickness variation; e) Sta line yvariation; f ) Typeset emulation; g) Rotation; h) White Speckles; i) Sta line Interruptions. Curvature The curvature is obtained through a half sine wave over the entire score area. The intensity of the resulting curvature can be measure as the ratio of amplitude (height) and the width of the wave. Chapter 3. Sta Lines Detection and Removal 36 Typeset emulation This deformation tries to reproduce the 16th-century prints, which are similar, in some ways, to typewriters, causing, therefore, interruptions in the lines between the symbols and a random vertical shift in each portion containing a symbol. Sta line thickness variation and sta line y-variation sta line y -variation Sta line thickness variation and are obtained by a Markov chain describing the evolution of the sta line thickness from left to right. These deformations are achieved by that process because, usually, the thickness at a particular x-position. depends on the thickness at the previous The parameter is the transition probability matrix of transition from thickness or or x-position y -deviation can be of n y -deviation i to thickness or P with pij := y -deviation j . probability The thickness dierent values (states). To the stationary distribution of the individual states a symmetric binomial distribution is assumed, that is: πi = The mean value (n − 1)/2 n−1 i−1 1 2n−1 of this distribution is associated with the original value in the staine_height image without the deformation ( tion from the original ! y -position). Hastings algorithm [Has70]. for the thickness or zero for the devia- The Markov chain is generated with the Metropolis- The transition probability matrix Q, to obtain candidate transition points, is chosen to be: where the probability tions: the closer c c c for j = i qij = 1 − c/2 for j = i ± 1 0 otherwise can be considered as an inertia factor that allows smooth transi- is to one, the slower is the state variation. Degradation after Kanungo This deformation tries to imitate the local distortions caused during printing and scanning. The model has six parameters (η, α0 , α, β0 , β, κ) with dierent meanings: Each black pixel in the original image is ipped with probability where d 2 α0 exp−αd +η , is the distance to the closest background pixel. Each background pixel is ipped with probability 2 β0 exp−βd +η , where d is the distance to the closest foreground pixel. κ is the diameter of a disk of a morphological closing operation. White speckles This degradation model has three parameters (p, n, k) with the following meaning: Each black pixel in the original image is taken with probability for a random walk of length k p as a starting point n. is a rectangle of a morphological closing operation that will smooth an image containing the random walk. 3.8. Database of Music Scores 37 Deformation Parameter range Rotation angle = −5 : 2.5 : 5 Curvature Typeset Emulation Sta Line Interruptions Sta Line Thickness Variation Sta Line y -variation Degradation After Kanungo White Speckles Table 3.2: = 0.02 : 0.02 : 0.10 n_gap = 1 : 3 : 13, n_shif t = 1 : 3 : 13, p_gap = 0.5 frequency α = 0.01 : 0.02 : 0.10, binomial parameter for width: n = 6, p = 0.5 inertia c = 0.8, maximum deviation = 2 : 1 : 6 inertia c = 0.8, maximum deviation = 2 : 1 : 6 η = 0, α0 = 0.5, α = 0.25 : 0.3 : 1.5, β0 = 0.5, β = 0.25 : 0.3 : 1.5, κ = 2 smoothing factor k = 2, random walk length n = 10, speckle frequency p = 0.03 : 0.02 : 0.11 amplitude/stawidth Ranges of deformation parameters used in the tests: min:step:max. The image with the random walks is subtracted from the original: an image with white speckles at the random walk positions is obtained. Therefore, and k p can be interpreted as the speckle frequency, n as a measure for the speckle as a smoothing factor. The range of deformations parameters in our test set is given in Table 3.2. The values were restricted with the aim of obtaining realistic deformations. Even so, when real scores are treated, generally, the deformations do not occur in its pure form; a combination from several deformations is more prone to happen. However, because the evaluation of the detection algorithms is our objective it is more adequate to research the performance in each isolated deformation. 3.8.2 Ground-truth Information In order to establish a qualitative comparison between the various sta lines detection and removal algorithms a ground-truth information is necessary. Hence the black pixels of the images need to be labelled as being sta lines or otherwise. The manual process has two disadvantages: it is very time consuming and it is prone to occur errors (e.g. the existence of dubious pixels that may belong to the line or to the symbol that crosses the sta line). Therefore, ground-truth information of the syntactic images released by the authors of the MusicStaves toolkit was used. In Figure 3.8 the deformation process and the ground-truth generation are displayed. In the code of the postscript images (scores) the macros of drawing lines were removed and the resulting images were converted to one-bit raster images. In this manner, real musical scores without lines can be obtained. The distortions are applied in the bitmap images. When we are dealing with handwritten music sheets we can only achieve the ground-truth information manually, that is, we need to delete the symbols present in the score in order to Chapter 3. Sta Lines Detection and Removal 38 Figure 3.8: Generation of the deformed images and the respective Ground-Truth. retain just the sta lines segments. In order to evaluate the performance of the sta lines detection algorithm the scores with only lines will be used as references, while the sta-less images will be used to test the sta lines removal algorithms. Note that it is possible to directly store the same references for the deformations synthetics images through skeleton generated in each distortion. However, once again, in the handwritten music scores this process must be manual. 3.9 Evaluation Metrics and Results This section presents the error metrics used to evaluate the performance of the proposed algorithm for the detection and removal of sta lines in comparison with the shortest path al- 8 gorithm 9 and the algorithms present in [DDCF08]. However, since the Dalitz's algorithm has a signicantly better performance than the other two proposed in [DDCF08], the comparison results presented here will be only for the Daltiz's algorithm. 3.9.1 Detection Error Metrics With the purpose of evaluating the performance of the stable path algorithm and to study their behaviour with respect to the several deformations applied to the images, two dierent error metrics were considered: the percentage of false positive sta lines and the percentage of sta lines missed to detect. The process starts by computing the average Euclidian distance between each reference sta line given by the ground-truth information; see Subsection 3.8.2 and each sta line actually detected. Next, the matching problem on the resulting bipartite graph is solved by minimizing 10 . the assignment cost, that is, the distance Only pairs with average error-distance below staineheight were assumed correctly matched. The remaining pairs were assumed to originate from a false positive sta line being matched to an undetected true sta line and were therefore unmatched. In the end, the two metrics result as the number of unmatched detected sta lines, 8 Description of the algorithm in [CCRG08b]. See Appendix A.3 for a brief description. 10 See Appendix A.4 for a brief description of the method. 9 3.9. Evaluation Metrics and Results 39 that is, false positive, and unmatched reference sta lines, that is, false negative. It should be noted that these metrics only measure whether sta lines are found, not how good the match is. Results From the results presented in Table 3.3 we can conclude that the stable path based approach 11 , (and the shortest path approach) outperforms the Dalitz algorithm for both error metrics dened. In some deformations a visible/considerable dierence between stable path and Dalitz algorithms does not exist. However, there are not cases where the proposed algorithm is in disadvantage with respect to Dalitz algorithm. Besides that, the performance of the stable path approach is almost independent of intensity of the deformation, for the range of values considered. This performance gain is even more noteworthy as the Dalitz algorithm is receiving as input the correct number of lines per sta. Had this not been the case, the dierence between both would have been much larger. The current implementation of the stable path algorithm run as fast as the Dalitz algorithm (and about ve times faster than the shortest path version), as available at the Sta Removal Toolkit [DDCF07]. By generalizing the design of the weight function and enhancing the postprocessing, we were able to improve the detection performance on the initial results in [CCRG08b]; the use of the stable paths allowed to improve the detection speed. Since all the parameters of the method are scaled by one of these two values estimates staine- height and staspaceheight, a simple analysis of the sensitivity of the approach to these parameters was done. We re-run the experiments with values for staineheight and staspaceheight one pixel above and below the true estimated values. In both cases, the stable path method presented a good performance and continued to yield the best results (the strongest degradation occurred for the rotation degradation, with angle=5°, where both error rates increased to 1.7%). In a second similar experiment we evaluated the sta line detection methods on a set of 50 real music scores, for which reference sta lines were manually outlined as referenced in Subsection 3.8.2. 12 , Images were previously binarized with the Otsu threshold algorithm plemented in the Gamera 13 project . as im- The evaluation of both detection algorithms yielded the results presented in Table 3.4. Again, and although the correct number of sta lines per sta was inputted into the Dalitz' algorithm, the stable path approach obtained the best performance. 11 For the deformations not shown, the stable path is not signicantly better than Dalitz. See Appendix A.5 for a brief description of the method. 13 http://gamera.sourceforge.net 12 0.04 0.8 (3.6); 0.8 (3.6) 0.8 (3.6); 0.8 (3.6) 53.8 (44.5); 55.7 (45.2) 4 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 21.4 (27.4); 15.4 (17.3) 4 0.7 (3.8); 0.7 (3.8) 0.7 (3.8); 0.7 (3.8) 15.9 (27.1); 12.0 (17.1) 0.02 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 12.7 (29.0); 12.6 (28.7) 0.03 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 0.0 (0.0); 0.0 (0.0) 2 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 31.0 (50.7); 31.7 (46.1) 1 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 19.7 (34.1); 13.7 (21.6) 1 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 3.3 (10.4); 2.3 (7.3) Amplitude/stawidth Stable path Sortest Path Dalitz Rate whitened pixels Stable path Sortest Path Dalitz Max deviation, n Stable path Sortest Path Dalitz Max gap width, ngap Stable path Sortest Path Dalitz Max vertical shift, nshift Stable path Sortest Path Dalitz Error Error Error Error Error 3 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 22.1 (38.9); 32.6 (45.6) rotation 5 0.8 (3.6); 0.8 (3.6) 0.8 (3.6); 0.8 (3.6) 13.0 (20.1); 33.7 (45.0) 0.09 1.2 (3.8); 1.2 (3.8) 1.7 (4.0); 1.9 (4.3) 89.3 (54.6); 86.9 (25.6) 0.08 0.8 (3.6); 0.8 (3.6) 0.8 (3.6); 0.8 (3.6) 95.8 (17.5); 100.0 (0.0) 2.5 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 4.2 (19.6); 9.8 (29.0) 10 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 24.2 (38.9); 16.7 (22.0) 7 0.7 (3.8); 0.7 (3.8) 0.7 (3.8); 0.7 (3.8) 39.0 (48.7); 24.6 (27.3) 10 0.7 (3.8); 0.7 (3.8) 0.7 (3.8); 0.7 (3.8) 48.7 (60.8); 29.3 (30.7) typeset emulation II 7 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 22.3 (30.0); 17.4 (19.0) typeset emulation I 4 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 15.7 (27.2); 33.7 (45.0) line y-variation 0.07 0.9 (3.7); 0.9 (3.7) 0.9 (3.7); 0.9 (3.7) 26.7 (25.3); 29.9 (27.2) white speckle 0.06 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 83.3 (30.6); 83.9 (30.4) curvature 0 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 0.0 (0.0); 0.0 (0.0) 13 0.7 (3.8); 0.7 (3.8) 0.7 (3.8); 0.7 (3.8) 58.7 (54.9); 37.3 (29.0) 13 0.6 (3.5); 0.6 (3.5) 0.6 (3.5); 0.6 (3.5) 31.4 (42.3); 19.2 (20.3) 6 1.1 (3.8); 1.1 (3.8) 1.1 (3.8); 1.1 (3.8) 12.8 (18.6); 34.2 (44.7) 0.11 2.1 (4.6); 2.3 (4.8) 5.3 (7.4); 7.0 (9.6) 54.5 (55.9); 95.2 (17.0) 0.10 1.2 (4.0); 1.2 (4.0) 1.2 (4.0); 1.2 (4.0) 96.0 (17.5); 100.0 (0.0) 5 1.2 (4.0); 1.2 (4.0) 1.2 (4.0); 1.2 (4.0) 5.5 (9.3); 37.5 (41.9) Runtime 886 sec. 6106 sec. 842 sec. Runtime 739 sec. 5085 sec. 703 sec. Runtime 767 sec. 5122 sec. 768 sec. Runtime 809 sec. 5122 sec. 872 sec. Runtime 822 sec. 5554 sec. 639 sec. Runtime 858 sec. 6006 sec. 612 sec. rate and miss detection rate. See [DDCF08] for parameter details. Table 3.3: Eect of dierent deformations on the overall sta detection error rates in percentage: average (standard deviation) of the false detection Error 0.05 0.8 (3.6); 0.8 (3.6) 0.8 (3.6); 0.8 (3.6) 0.3 (1.4); 0.3 (1.4) -2.5 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 8.6 (14.0); 15.5 (28.7) -5 0.7 (3.5); 0.7 (3.5) 0.7 (3.5); 0.7 (3.5) 17.8 (22.0); 51.7 (38.1) Angle Stable path Sortest Path Dalitz 40 Chapter 3. Sta Lines Detection and Removal 3.9. Evaluation Metrics and Results 41 False detection rate miss detection rate Runtime Dalitz 5.2% (10.4) 5.9% (11.3) 112 sec. Shortest path 1.4% (3.5) 2.5% (7.3) 612 sec. Stable path 1.3% (5.7) 1.4% (6.4) 115 sec. Table 3.4: Detection performance on real music scores in percentage: average (standard deviation). 3.9.2 Removal Error Metrics Sta line detection algorithms can be used as a rst step in many sta removal algorithms. To understand the potential of our algorithm to leverage the performance of existing sta removal algorithms, we conducted a series of experiments, comparing existing versions of sta line removal algorithms with modied versions of them, making use of the stable path algorithm at the sta line detection step. The quantitative comparison of the dierent algorithms is totally in line with the comparison presented in [DDCF08]. Adopting the naming convention from [DDCF08], the following algorithms were adapted: LineTrack Height, LineTrack Chord, Roach/Tatem, as already described in Section 3.7. The original version of the algorithms was considered as available in [DDCF07], making use of the Dalitz algorithm in the detection phase. The modied versions use instead the stable path for detecting lines. We also adopted the same error metrics (individual pixels, sta-segment regions and sta interruption location) presented in [DDCF08] and conducted the comparative study on the same test set see Section 3.8. Individual Pixels This metric considers the sta line removal as a two-class classication problem at the pixel level, that is, one pixel can belong to a sta line or not. Therefore, a natural performance measure is the error rate for this classication, given by Pixel error rate x =Number y =Number = x+y z of misclassied sta pixels of misclassied non sta pixels z =Number of all black pixels However, this metric has one problem: little information is given about how well the sta removal algorithm separates symbols that are otherwise connected by sta lines; this error only indicates how badly the symbols are distorted when compared to the ideal sta-less images. Segmentation Region Level This error metric has as base the regions of the line segments. The sta line removal can be considered as a segmentation problem where the sta lines segments need to be separated from the symbol segments. Chapter 3. Sta Lines Detection and Removal 42 In an OMR application the sta lines segments are considered background and the remaining symbols are taken as being the segments of interest. Nevertheless, when we are trying to evaluate the quality of the sta line removal the situation is reversed: our interest lies on the sta segments and the rest constitutes background. Following the notation given in [TMD99], we have two segmentations for the set of black pixels present in the test image: 1. The ground-truth segmentation G = Gobj ∪ {gnoise } 2. The segmentation detected from the algorithm {s1 , . . . , sN }, gi where each and gnoise segment respectively, and sj and with G = {g1 , . . . , gM } S = Sobj ∪ {Snoise } with S = contains the black pixels of a contiguous sta snoise contain the remaining background black pixels, respectively. In the set of all sta line segments from both segmentations an equivalence classes of overlapping segments is built(two segments are considered equivalent quence class r c1 , c2 , . . . , cn exists with c1 = a, cn = b and ci ∩ ci+1 6= 0). a'b when a se- For each equivalence we count the number of the contained numbers of segments G and S and thus detect recognition errors. All possible cases are listed in Table 3.5. The formula is given by Segmentation error rate x =Number y =Number x−y z of all classes r of classes representing a correct recognition z =Number Classes number n1 n2 n3 n4 n5 n6 = Segments from Gobj 1 1 0 1 >1 >1 of all classes Segments from Sobj 1 0 1 >1 1 >1 r Error description Correct Missed segment Falsely detected segment Segment split Segments merged Both splitting and merging occurred Table 3.5: Sta segment extraction errors based on the number of segments in an equivalence class r. Sta Line Interruption To obtain this metric a comparison between the ground-truth im- ages, containing only the segments removed from the ideal case, with the image that contains exactly the pixels that were in fact removed by the removal algorithm under evaluation, is made. In doing so, each sta line is followed, from left to right, in the images containing only the removed sta segments; and interruptions in the sta line are looked at. Each interruption represents a detected music symbol that crosses the sta line. It follows two sets of intervals: the interrupting intervals ground-truth data and those in the algorithm output G = {g1 , . . . , gM } S = {s1 , . . . , sN }. in the 3.9. Evaluation Metrics and Results 43 With the purpose to establish an error metric, a bipartite graph is created by adding links between intervals intervals from G gi to and S sj that overlap. In this manner, two types of error are revealed: without a link and intervals with more than one link. In order to count the number of errors of the second type, the maximum cardinality matching in this graph is computed [Gal86].This cardinality matching also removes the minimal number of links leading to the second type error. As a result error rate we have Sta line interruption error rate n1 =Number min {n3 , n1 + n2 } n3 of interruptions without link n2 =Number n3 =Number = of removed links of ground-truth interruptions Results From the results obtained we can conclude that the qualitative eects of the deformations and the insertion of the stable path as the detection algorithm are similar for all three error metrics. In Tables 3.6, 3.7, 3.8 and 3.9, we present the results for all three error metrics (individual pixels, sta-segment regions and sta interruption location) for the Line Track Height algorithm (modied) (LTH), plus the Skeleton algorithm, which exhibited a competitive performance in [DDCF08]. This algorithm does not clearly separate the sta lines detection from the sta lines removal, doing both tasks simultaneously. So, this algorithm was used alone, without providing the detection performed by stable paths algorithm. Stable path + LTH Dalitz + LTH Skeleton Pixel Error Rate 2.8 (1.2) 3.8 (2.6) 6.5 (8.2) Segmentation Error Rate 0.3 (0.1) 0.3 (0.1) 0.3 (0.1) Sta Line Interruption Error Rate 0.3 (0.1) 0.3 (0.1) 0.3 (0.1) Table 3.6: Removal performance on real music scores (in percentage): average (standard deviation). A rst observation is that, overall, the replacement of the Dalitz method by the stable path approach as the sta detection step improved the results in the algorithms under comparison. Additionally, the LineTrack Height algorithm with the stable path consistently outperformed the other algorithms. Nevertheless, the skeleton method [DDCF08], which does not have a clear line detection step, continues to present a competitive performance. It is worth to nalize by noticing that the skeleton algorithm is about two times slower than the modied Line Tracking Height algorithm. In Figure 3.9 it is possible to see some results obtained when we apply the LineTrack Height algorithm with stable path approach to our test set. Chapter 3. Sta Lines Detection and Removal 44 rotation Angle Stable path + LTH Dalitz + LTH Skeleton -5 1.7 (0.7) 19.4 (18.4) 1.9 (0.9) -2.5 1.5 (0.7) 5.2 (8.7) 1.7 (0.8) 0 1.4 (0.7) 1.4 (0.8) 1.5 (0.7) 2.5 1.4 (0.7) 4.4 (8.8) 1.6 (0.7) 5 1.6 (0.7) 17.5 (18.9) 1.7 (0.8) Amplitude/stawidth Stable path + LTH Dalitz + LTH Skeleton 0.02 1.4 (0.7) 3.8 (5.8) 2.6 (2.4) 0.04 1.4 (0.7) 14.0 (12.2) 5.2 (5.1) 0.06 1.4 (0.7) 22.8 (13.7) 8.1 (7.2) 0.08 1.5 (0.7) 31.1 (11.0) 11.9 (8.6) 0.10 1.6 (0.7) 35.0 (10.6) 15.4 (10.4) Rate whitened pixels Stable path + LTH Dalitz + LTH Skeleton 0.03 11.9 (3.1) 11.5 (3.2) 14.6 (3.2) 0.05 17.2 (4.9) 16.8 (4.9) 21.5 (4.6) 0.07 21.1 (5.9) 26.7 (8.0) 27.1 (5.6) 0.09 24.0 (6.7) 53.3 (14.9) 35.2 (12.8) 0.11 26.1 (7.2) 73.3 (14.6) 46.9 (18.7) Max deviation, n Stable path + LTH Dalitz + LTH Skeleton 2 1.2 (0.7) 9.0 (13.2) 1.5 (0.8) 3 1.3 (0.7) 10.4 (14.1) 1.7 (0.8) 4 1.3 (0.6) 10.9 (14.5) 2.2 (0.9) 5 1.4 (0.6) 10.9 (14.5) 3.7 (1.7) 6 1.4 (0.6) 11.0 (14.6) 5.2 (2.2) Max gap width, ngap Stable path + LTH Dalitz + LTH Skeleton 1 1.4 (0.7) 2.6 (1.8) 26.4 (9.8) 4 1.4 (0.7) 2.9 (2.0) 27.3 (10.1) 7 1.4 (0.7) 3.2 (1.7) 27.2 (11.3) 10 1.4 (0.7) 2.9 (1.7) 25.5 (9.8) 13 1.4 (0.7) 3.0 (1.8) 26.4 (10.3) Max vert. shift, nshift Stable path + LTH Dalitz + LTH Skeleton 1 1.4 (0.7) 1.5 (0.8) 7.9 (8.9) 4 1.4 (0.7) 2.8 (1.6) 24.1 (9.1) 7 1.4 (0.7) 3.3 (2.5) 26.7 (11.0) 10 1.5 (0.7) 3.8 (2.4) 26.1 (9.6) 13 1.6 (0.7) 4.7 (3.7) 29.1 (10.7) curvature white speckle line y-variation typeset emulation I typeset emulation II Table 3.7: Eect of dierent deformations on the overall sta removal error rates in percentage: average (standard deviation) Individual pixels error. rotation Angle Stable path + LTH Dalitz + LTH Skeleton -5 0.3 (0.2) 0.6 (0.3) 0.3 (0.2) -2.5 0.2 (0.2) 0.3 (0.3) 0.2 (0.2) 0 0.2 (0.2) 0.2 (0.2) 0.2 (0.2) 2.5 0.2 (0.2) 0.3 (0.3) 0.2 (0.2) 5 0.3 (0.2) 0.5 (0.3) 0.3 (0.2) Amplitude/stawidth Stable path + LTH Dalitz + LTH Skeleton 0.02 0.2 (0.2) 0.3 (0.2) 0.6 (0.2) 0.04 0.2 (0.2) 0.4 (0.2) 0.3 (0.2) 0.06 0.2 (0.2) 0.5 (0.2) 0.3 (0.2) 0.08 0.3 (0.2) 0.6 (0.2) 0.3 (0.2) 0.10 0.3 (0.2) 0.7 (0.1) 0.4 (0.2) Rate whitened pixels Stable path + LTH Dalitz + LTH Skeleton 0.03 0.2 (0.1) 0.2 (0.1) 0.3 (0.1) 0.05 0.1 (0.1) 0.2 (0.1) 0.3 (0.1) 0.07 0.1 (0.1) 0.2 (0.1) 0.3 (0.1) 0.09 0.1 (0.04) 0.6 (0.2) 0.3 (0.1) 0.11 0.1 (0.04) 0.9 (0.2) 0.4 (0.2) Max deviation, n Stable path + LTH Dalitz + LTH Skeleton 2 0.2 (0.2) 0.5 (0.4) 0.2 (0.2) 3 0.3 (0.2) 0.5 (0.4) 0.3 (0.2) 4 0.3 (0.2) 0.5 (0.4) 0.3 (0.2) 5 0.3 (0.2) 0.5 (0.3) 0.3 (0.2) 6 0.4 (0.2) 0.5 (0.3) 0.3 (0.2) Max gap width, ngap Stable path + LTH Dalitz + LTH Skeleton 1 0.1 (0.1) 0.1 (0.1) 0.6 (0.1) 4 0.1 (0.1) 0.1 (0.1) 0.6 (0.1) 7 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 10 0.1 (0.1) 0.1 (0.1) 0.6 (0.1) 13 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) Max vert. shift, nshift Stable path + LTH Dalitz + LTH Skeleton 1 0.1 (0.1) 0.1 (0.1) 0.3 (0.2) 4 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 7 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 10 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 13 0.1 (0.1) 0.2 (0.1) 0.7 (0.1) curvature white speckle line y-variation typeset emulation I typeset emulation II Table 3.8: Eect of dierent deformations on the overall sta removal error rates in percentage: average (standard deviation) Sta line interruption error. 3.9. Evaluation Metrics and Results 45 Angle Stable path + LTH Dalitz + LTH Skeleton -5 0.2 (0.1) 0.6 (0.3) 0.2 (0.1) -2.5 0.2 (0.1) 0.3 (0.3) 0.2 (0.1) Amplitude/stawidth Stable path + LTH Dalitz + LTH Skeleton 0.02 0.2 (0.1) 0.3 (0.2) 0.2 (0.1) 0.04 0.2 (0.1) 0.5 (0.2) 0.3 (0.1) Rate whitened pixels Stable path + LTH Dalitz + LTH Skeleton 0.03 0.6 (0.1) 0.6 (0.1) 0.6 (0.1) 0.05 0.5 (0.1) 0.5 (0.1) 0.6 (0.1) Max deviation, n Stable path + LTH Dalitz + LTH Skeleton 2 0.2 (0.1) 0.5 (0.3) 0.2 (0.1) 3 0.2 (0.1) 0.5 (0.3) 0.2 (0.1) Max gap width, ngap Stable path + LTH Dalitz + LTH Skeleton 1 0.1 (0.1) 0.1 (0.1) 0.6 (0.1) Max vert. shift, nshift Stable path + LTH Dalitz + LTH Skeleton 1 0.1 (0.1) 0.1 (0.1) 0.3 (0.2) rotation 0 0.2 (0.1) 0.2 (0.2) 0.2 (0.1) 2.5 0.2 (0.1) 0.3 (0.3) 0.2 (0.1) 5 0.2 (0.1) 0.5 (0.3) 0.2 (0.1) 0.06 0.2 (0.1) 0.6 (0.2) 0.3 (0.1) 0.08 0.3 (0.1) 0.8 (0.1) 0.4 (0.2) 0.10 0.3 (0.1) 0.8 (0.1) 0.5 (0.2) 0.07 0.5 (0.1) 0.6 (0.1) 0.6 (0.1) 0.09 0.5 (0.1) 0.8 (0.1) 0.6 (0.1) 0.11 0.4 (0.1) 0.9 (0.1) 0.7 (0.2) 4 0.2 (0.1) 0.5 (0.3) 0.2 (0.1) 5 0.2 (0.1) 0.5 (0.3) 0.3 (0.1) 6 0.3 (0.1) 0.5 (0.3) 0.3 (0.1) 4 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 7 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 10 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 13 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 4 0.1 (0.1) 0.1 (0.1) 0.6 (0.1) 7 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 10 0.1 (0.1) 0.2 (0.1) 0.6 (0.1) 13 0.1 (0.1) 0.2 (0.1) 0.7 (0.1) curvature white speckle line y-variation typeset emulation I typeset emulation II Table 3.9: Eect of dierent deformations on the overall sta removal error rates in percentage: average (standard deviation) Segmentation region level. Figure 3.9: Examples of the results obtained using the sta line removal algorithm in our test set. Part III Segmentation and Classication 47 Chapter 4 Segmentation and Classification Process This chapter begins with a description of the dierent music symbols that can be found in a music score. Then proceeds with the description of the segmentation and classication steps. Music notation, like any other written language, did not appear as the invention of one person. It emerged from the combined and prolonged eorts of hundreds of musicians. They all hoped to express by written symbols the essence of their musical ideas [Rea69]. Improvement in those symbols came about because it was necessary. Hence, the musical notation is very extensive if we consider all the existing possibilities and their variations over time. The result notation is a kind of alphabet, shaped by a general consensus of opinion to serve as a general expressive technique. Music notation is, thus, the visual manifestation of the interrelated properties of musical sound (as pitch, intensity, time, timbre and pace). Symbols indicating the choice of tones, their duration and their manner of performance are important because they form this written language that we call music notation. Several symbols can be confused with each other because of their shape similarity. Besides that, in some cases, complex symbols can appear in the score, as for instance the guitar symbols (diagram with the positions of the notes in the instrument). And if we consider that we are working with handwritten music scores several additional problems come about. One with more evidence is the variation in the symbols notation caused by each person. In this work we decided to study and recognize the standard music notation. In this manner, variants of types of specic notation will not be recognized, as drums notation. Furthermore, since tablature is something specic to certain instruments, it will also not be recognized. In the following Section the symbols that we treat in this thesis will be described. 4.1 Musical Symbols Clefs The term Clef is derived, logically enough, from the Latin word clavis, meaning key. The three clef signs in common use today had their origins in pitch letters: the earliest F related with the tone f (fa ), when a second line was latter drawn above the text over the F the letter C was axed upon it, representing the pitch of c' (do ), the last letter was G which represents the tone g' (sol ). The F clef became form was the Latin letter the bass clef because of its lower register; the C clef became either 49 tenor or alto clef ; and Chapter 4. Segmentation and Classication Process 50 the G clef became the soprano or treble clef. The clef symbols are the rst symbols that appears at the beginning of every music sta. It is very important because they tell us which note is found on each line or space. Table 4.1: Clefs. Treble clef Alto clef Bass clef The modern bass clef is used for all voices and instruments of essentially a bass register; for instance baritone voices, keyboard and orchestra instruments as piano and organ. The alto clef is used in English horn, trombones and violas. It is important to stress that none of these last instruments uses the alto clef exclusively; only the viola employs it with consistent regularity. The treble clef is the highest of all the clef signs, beginning below the sta and extending above it [Rea69]. All instruments and voices of high range employ this symbol, as for instance lyric soprano, orchestra instruments as piccolo and ute, English horn, saxophone, among others. Noteheads and stems A musical note indicates two important aspects. On the one hand, the symbol indicates, by its position on the sta and by the clef used, a pitch to be played or sung. On the other hand, the note establishes, by the exact appearance of its three integral parts (notehead, stem and ag), the relative time duration of this musical sound. Table 4.2: Notes. Minim Crochet The notehead is somewhat oval in shape, and is either open Minim or closed Crochet. 4.1. Musical Symbols The stem 51 is the thin vertical line joined to the side of the note head (closed or open). down, it is placed on the left side of the notehead; when the stem is to go up, it is put on the right side of the notehead. This stem direction is measured as follows: when the notehead is above the centre, line of the sta, the stem goes down see Figure 4.1(a); when the note is below the middle line of the sta, the stem goes up see Figure 4.1(b); and when the note is centred on the sta, the stem may go in either direction see Figure 4.1(c), although it is more common practice to draw it down. The stem length is proportional, that is, it is measure interms of the note position on the sta. When the stem is to go (a) Stem down. (b) Stem up. Figure 4.1: Flags and beams (c) Note positioned in the centre of sta the stem may go in either direction. Example. Flags are employed to indicate the relative time values of the notes with black (closed) noteheads. These symbols are featured by a thin curved stroke that joins to the end of the stem, and always goes to the right of this symbol, regardless of whether the stem goes up or down. When a dot follows the notehead, the end of the curved line is often shortened so as not to conict with the dot. On a downward stem the curved line barely touches the bottom of the notehead. Table 4.3: Flags and beams. Quaver Semiquaver Demisemiquaver Hemidemisemiquaver Beams are compound ags used to connect notes in note-groups. The number of beams always equals the number of ags appropriate for each note. These symbols demonstrate the metrical and the rhythmic divisions within the measure and also the note-groupings Chapter 4. Segmentation and Classication Process 52 in oppositions to the normal patterns of the measure. categories: primary Beams are considered in two (beams that link an entire group; without breaks) and secondary (beams that are interrupted or partially broken) see Figure 4.2. The position of the beams is solved by the same general principles that govern stem direction. That is, if the up in the group to be beamed, then all the stems will go up and the beams will be placed above the notes; on the other hand, if most of the notes require a downward stem (most of the noteheads lie above the sta centre), then the stems of all the notes must be written down and the beams placed below the noteheads. The direction of all beams should approximately parallel to the main body of the notes majority of stems normally go they connect. Figure 4.2: Example. Rests The rests were invented to indicate the exact duration of silence in the music. Therefore, each note value has its corresponding rest sign. The written position of a rest between two barlines is determined by its location in the meter, that is, a rest always occupies the same position in a measure that an equivalent note value would ll, except the whole rest, which is always centred. The whole rest is solid, oblong symbol and placed just beneath the fourth line of the sta (regardless of the clef used). It is incorrect to place it elsewhere unless two voices or instruments share common sta. When this happens, the whole rest is usually written beneath the The half rest top line for the upper part, and beneath the bottom line for the lower part. is an oblong mark of the same length as the whole rest, located on the third sta line. Like the whole rest it is incorrect to place this symbol on any other line unless two voices or instrumental parts share the sta. In this case, the half rest goes above the top The line for the upper part, and above the quarter rest bottom line for the lower part. is the most dicult symbol to write. The sign is centred on the sta, with the top tail ending in the fourth space and the bottom hook resting in the rst space. When two parts share a common sta, this rest goes somewhat the upper part, and just below it for the lower part. above the sta centre for 4.1. Musical Symbols 53 Table 4.4: Rests. Rest Note Whole rest Half rest Quarter rest Eighth rest Sixteenth rest thirty-second rest Sixty-fourth rest The eighth rest has half of the value of the quarter rest. This symbol is written as a slanting stem with a single hook. The stems rest occupies the hook occupies the The sixteenth rest third has half of the value of the eighth rest. This symbol has another hook be extended to touch the thirty-second rest bottom second sta space and the stem must line. has half of the value of the sixteenth rest. hooks and a further extended stem. The three hooks occupy the spaces, with the stem resting on the The sta line and the space. to the slanting stem. This second hook occupies the The second sixty-fourth rest bottom It consists of three second, third, and fourth line. has half of the values of the thirty-second rest. structure with its four hooks. Each hook must occupy one of the four It is a elaborate sta spaces, and the slating stem must accordingly be extended until the distance of one space below the Chapter 4. Segmentation and Classication Process 54 sta. Ties and slurs Ties are a notational device used to prolong the time value of a written note into the following beat. This symbol is a curved line that connects two successive noteheads indicating, together, the total time value desired. The tie appears to be identical to slur, however, while tie almost touches the notehead centre, the slur is set somewhat above or below the notehead. Besides that, ties are normally employed to join the time value of two notes of identical pitch, while slurs have a totally dierent function. The tie always loops in the direction opposite that of the stem. In other words, when the tie notes have their stems going up, the tie mark loops below the noteheads see Figure 4.3(b), when the two stems go down, the tie is curved above the noteheads see Figure 4.3(a). Slurs aect note-groups as entities indicating that the two notes are to be played in one physical stroke, without break between them. When the slur is used in conjunction with tied notes, it may be placed over or under the noteheads tied together, but the tie always loop according to the note position on the sta. Table 4.5: Ties and Slurs. Tie Slur (a) Figure 4.3: Accidentals (b) Example: a) Tie above noteheads; b) Tie under noteheads. The signs that are placed before the note to designate changes in sounding pitch are called accidentals. These signs are ve in number: the the double at and the double sharp. natural, the at, the sharp, However, in this work, only the rst three symbols will be treated. The f. at [ is literally the rounded small letter b found in Guido's soft hexachord on In modern practice, the shape of the at is somewhat more slanted upward than the 4.1. Musical Symbols 55 ordinary small letter b. the second line above. line, the top of the stem extends to When the symbol is notated in a space, the stem extends to the when the at is placed on a center of the second space above. b found in Guido's hard hexachord. This sign represented the raised form of the pitch b. By adding a short tail to the square b, and slightly slanting the horizontal strokes, we get the natural sing. The The form of the sign natural \ has its origin in the square two tails of this symbol are to be extended quite as far on the sta as the stem of the at. When the sign is on a line, the upper tail of the natural extends only to the higher space and the lower tail to the next lower next space. If the sign is in a space, the tails go respectively to the adjacent lines above and below. The sharp ] was written in a variety of ways, beginning with what looks like and continuing with modied forms of the St. Andrew's cross. In our days, it resembles more closely an elaborated natural sign with an extension of both vertical lines and the horizontal dashes. Accents Accents marks are symbols for special or exaggerated stress upon any beat, or portion of a beat. These indications are sometimes called ancillary signs because they are not such basic symbols as the sta, clefs and notes. The accents are divided into categories: those for levels). percussive attack (higher dynamic levels) and those for pressure attack (lower The two principal signs for percussive accent are 4.6 Accent and 4.6 Marcato. The rst symbol is made identically whether it is placed above or below the notehead. The second sign is inverted when it is placed below any notehead, stem or beam. Choice between the two is governed by the degree of force and intensity desired. While the sign 4.6 Marcato is for a stronger attack, the sign 4.6 Accent is for a moderately sharp attack. The principal symbol for pressure accent is a short, heavy dash placed over or below the notehead - see Figure 4.6 Tenuto. In our work only the Accent and Staccatissimo accents was used to facilitate the classication process. Table 4.6: Accents. Percussive accents Accent Marcato Staccatissimo Tenuto Pressure accents Chapter 4. Segmentation and Classication Process 56 Time signatures Meter is a recurring pattern of stress, and an established arrangement of strong and weak pulsations. These pulsations are also known as beats. Meter is inextricably related to time signatures. This symbol is the vertical arrangement of two gures placed on the sta following the clef sign and any key signature. When placed on the sta, in the normal position, the numerator, which indicates the number of beats or the number of inner pulsations, occupies the two top spaces and the denominator, which indicates the unit of time, occupies the two bottom spaces. Despite of these terms, a time signature is not to be regarded as a fraction. There are several possible symbols for the representation of time signatures. Some are fairly common in practice, some are very rare. In this work we chose the common time and the cut time. Table 4.7: Time signatures. Common time Cut time 4.2 Segmentation Process The segmentation process used in this work is based on already existent algorithms [Fuj04, BBN01]. This step is a hard one because of several reasons. One of the main problems deals with the variability of the symbols. This is obvious when we consider handwritten music scores see Figure 4.4(a). However this variability is also found in printed scores, when we have scores from dierent editors see Figure 4.4(b). This variability will cause troubles of inconsistency in the size and shape of each object. Besides that, the number of possible arrangements of symbols primitives is very high. These will bring complexity and ambiguity in the segmentation process, because we do not know all the probable groups of symbols. Another problem is related with the sta line removal approach. Sta lines connect most of the objects and when we eliminate them we may cause breaks in the music symbols. These will bring undesirable imperfections that will dicult the segmentation process. Finally, a symbolic high density in the music score will also not help in this phase. With this brief description of the several problems that we can nd in this step we may conclude that it is clear that not all problems can be solved at this level, and the imperfections of the segmentation have to be taken into account during the next processing steps. The segmentation process architecture is presented in Figure 4.5 and an example of the process on a real music score is presented in Figure 4.6. This process consists in localizing and isolating 4.2. Segmentation Process 57 (a) Variability in the real scores. (b) Variability between dierent publishings in printed scores. Figure 4.4: An exemplicative example of the variability in the music symbols. the symbols in order to identify them. In this work, the symbols we want to recognize can be split into four dierent types: the symbols that are featured by a vertical segment with height greater than a threshold (notes, notes with ags and open notes), the symbols that link the notes (beams), the remaining symbols connected to sta lines (clefs, rests, accidentals and time signature), and the symbols above and under sta lines (notes, ties, slurs and accents). As a result, the segmentation method proposed was based on a hierarchical decomposition of the music image. sta lines removal, 1 This music sheet was therefore analyzed and split by stas, after the as being the rst step. Subsequently, the connected components1 If you note by the Figure 4.6 there are symbols that are not linked together but connected. This is due because in our connected component method we use a threshold of 5 pixels for the distance between objects. This is necessary because some music symbols became separate in the sta line removal phase. Chapter 4. Segmentation and Classication Process 58 Figure 4.5: Segmentation process I. Figure 4.6: Segmentation process II. were identied. To extract the symbols with appropriate size, a selection of the connected components detected in the previous step was done. The thresholds used for the height and width of the symbols were experimentally chosen as for the height and staspaceheight of the music symbols. staspaceheight and 16×staspaceheight for the width. These values took into account the features In each bounding box of the connected components repeated objects (e.g. when we have a group of notes and a sharp; in this case we have two distinct connected components: notes and sharp; however when we extract the bounding box of the notes the remove them to In the end, we are ready to nd and extract all sharp is also included see Figure 4.7) can exist. Thus, it was necessary to avoid multiple extractions of the same object. the music symbols. Next we will present the methods used to extract the symbols. Beam Detection It is reasonable to think that beams are one of the symbols with harder detection process. The vicissitude inherent in their action in the musical score is the principle reason for this diculty. Their shape and size, and the way they connect to each other and to other symbols are combined in a huge number of dierent ways. They are also prone to present inconsistency in the thickness and in the link with the stems see Figure 4.8. Thus, we propose a solution that just checks the presence of a segment of adequate height which connects the extremities of notes see Figure 4.9. Firstly, in the connected component detected, the stems are found 4.2. Segmentation Process 59 (a) Original image. (b) Image with two connect components: dierents shades of gray. Figure 4.7: An exemplicative example of two dierent connected components in the same bounding box. and removed to separate the beams from the notes. When we nd all the remaining objects we select those who have a width bigger than lower than 4×staspaceheight. 2×staspaceheight-staineheight and a height These values were selected experimentally and in accordance with the geometric features of the object in cause. The height value needs to take into account the inclination of the beams that may have. In the end, a check of the presence of a stem is made to validate the objects selected. It is important to state that the beams are grouped if the distance between them is lower than a half staspaceheight. Figure 4.8: An example of the existence of inconsistency in the beam thickness and in the link with the stems. Figure 4.9: Beam segmentation process. Notes, Notes with Flags and Notes Open Detection In this work, we decided to consider stems and note heads as one. In this way, we dened the geometric features of the notes we want to extract as the objects with a height bigger than a threshold, experimentally selected as twice staspaceheight, and a width limited by two values, Chapter 4. Segmentation and Classication Process 60 also experimentally chosen as half staspaceheight and 3×staspaceheight. To simplify this task the beams detected were removed before applying this algorithm see Figure 4.10. Figure 4.10: Notes segmentation process. Accidentals, Rests, Accents and Time Signature Detection Generally, these symbols have similar values for width and height. The procedure used to ex- 2 tract them was based on the combination of X-Y projection proles technique see Figures 4.11 and 4.12. The heuristics used were experimentally chosen and these values are arguably sensible. On the one hand, we have symbols that have vertical sequence of black pixels, for instance sharps, naturals and rests. On the other hand, we need to take into account the symbols topological position, because in this case we are trying to detect accents and time signature. Figure 4.11: An example of sharp detection on a real score. 2 Horizontal (Y) projection has as result a vector which the component i is the sum of all black pixels from row i in the image; Vertical (X) projection has as result a vector which the component j is the sum of all black pixels from column j in the image. 4.2. Segmentation Process 61 Figure 4.12: An example of sharp detection on a ideal score. Clefs, Ties and Slurs Detection These symbols have their own attributes, like a large width for the ties and slurs and a bigger height for the clefs. In both of them we have no presence of stems. With these in mind, staspaceheight < 2× numberstaspace × the projection proles procedure was used with specic heuristics. For clefs a < width < 4× staspaceheight and 2× staspaceheight < height staspaceheight + numberstaine × staineheight were used in the experiments. These values take into account the fact that clef symbols are the largest of all the signs, beginning below the sta and extending above it, as we said in the description 4.1. On the other hand, for the relations symbols (ties and slurs) the rules for extracting them were based in a large width. 4.2.1 Results With the purpose of evaluating the performance of the segmentation process and to study its behaviour with respect to some deformations curvature and rotation applied to the images, two dierent error metrics were considered: the percentage of false positive symbols and the percentage of symbols missed to detect. A false positive (failed symbol) happens when the algorithm falsely identies a musical symbol which is not one; and a false negative (missed symbol) is when the algorithm fails to detect a music symbol present in the score. These percentages are computed using the symbols position reference and the symbols position obtained by the segmentation algorithm. The references for the symbols position were partially done manually, that is, the music symbols that are in the ideal scores were manually splited in order to separate the composed existing symbols and the junction of the beams signs see Figure 4.13(b). In the resulting images the deformations described in Subsection 3.8.1 are applied. In the end, the music symbols positions are extracted using a Matlab © function that gives their bounding box. It is important to stress that this experiment was only made with ideal scores where deformations were applied, because it was our intention to see the behaviour Chapter 4. Segmentation and Classication Process 62 of the method with respect to some deformations and besides that, this was the only quickest method of obtaining a database of reasonable number with the symbols position references in 3 a semi-automatic way . (a) Music score before split the composed symbols. (b) Music score after split the composed symbols. Figure 4.13: An example of how the music symbols that are in the ideal scores were manually splited. An overview from the results from the Figure 4.14 we can conclude that the segmentation process has a good performance, since the percentage of correct symbols are largely over 80%. The high percentage of failed symbols that sometimes we nd are related with extra symbols or noise; that is, in the score that we are segmenting we have more symbols than in the reference score. For example numbers or symbols that are not our objective to classify. Besides this result, the performance of segmentation process is almost independent of intensity of the deformation, for the range of values considered see Figure 4.16. Figure 4.14: Results of the error metrics. 3 We have only split the composed symbols in 18 ideal scores with standard music notation; the deformations was obtained from these. In total we have 306 images. 4.3. Classication Process 63 Figure 4.15: Results of the error metrics for the rotation deformation. Figure 4.16: Results of the error metrics for the curvature deformation. 4.3 Classication Process In this Section several results obtained with the dierent classiers adopted are presented. These classiers were already presented in Section 2.3. For the neural network, k -nearest neighbour and support vector machines methods, each image of a symbol was initially resized to 20 × 20 pixels and then converted to a vector of 400 binary values. This process is sig- Chapter 4. Segmentation and Classication Process 64 nicantly dierent for the case of the hidden markov model. normalised in a sta height and width of In the latter, the images were 150 and 30 pixels, respectively. The segmentation-free 2-pixel approach was used to extract features from images [Pug06]. For this reason, a sliding window mechanism was used to produce a sequence of observations. In doing so, dependent observations are replaced by observations depending on the horizontal position of the window. We extract the same features that in [Pug06]: the largest black element (a(ni )/S , where S = height × width), n distinct connected components of black pixels, a(ni ) is the area of the largest black element and the area of the smallest white pixel (a(nj )/S , where the smallest white element and S = height × width) a(nj ) is the area of and the gravity centres. To assign classes to the images of music symbols the process was as follows: for the case of neural network, for each of the images converted in to vectors they were assigned to another one of length 14 this value correspond to the correct class and value 0 14 classes of symbols with value 1 in the otherwise. For the other three classiers the images were assigned to a vector numbering, between 1 and 14, depending on the class to which they belong. In the training phase, a generalization control capability on the classier was made. Putting dierently, the available data was divided into training, validation and test set, towards the assessment of the network topology with better performance. By that, through a cross validation we selected the number of neurons in the hidden layer. In other words, the cross validation consists in optimizing the model by changing the number of neurons in the hidden layer of the network using the training data; for a further evaluation of performance the validation set was tuned. The network model with better result was used to compute the expected error in the music symbols classication. Thus, the training data was added to the validation data to conduct a nal training for this network and therefore obtaining the nal model. The expected error was obtained computing the dierence between the target values and the values obtained by the network, using the test set. This process is slightly modied when we apply elastic matching to our database. The elastic matching method is adopted in the training data and training data with validation set. We increased the dataset by causing variability in our images in order to simulate distortions that can happen in real scores. Our objective is to prepare the classier for several possible situations that may occur in handwritten music symbols. The neural network structure used to classify a real score without elastic matching is illustrated in Figure 4.17. The input layer is composed by 400 units which output values are the vector values that represent the image of the music symbol. The output layer has respond to the 14 14 units that cor- classes of the symbols. When a pattern belonging to the class the target value for the unit i is presented, in the output layer is the maximum value. The range for the number of hidden neurons chosen in this work was hidden layer, for this case, was i 9 [1, 2, . . . , 9]. The number of neurons in the neurons see Table 4.8. This value was estimated during the training of the neural network. To each neuron is associated a log-sigmoid activation function. Model discriminant HMM was adopted to construct a model for each class [AYV01]. training of the HMM to learn the parameters of the models (λ = (A; B; π)) The was done by 4.3. Classication Process 65 Figure 4.17: Neural network topology. Handwritten symbols printed symbols With EM 9 8 Without EM 9 9 Table 4.8: Number of neurons in the hidden layer (not)using elastic matching method (EM) in the dataset. 4 the Baum-Welch algorithm . The goal of classication is to decide which class the unknown sequence belongs to, based on the model obtained in the training phase. These symbols are 5 classied on the basis of the maximum likelihood ratio which is obtained by Viterbi algorithm . In Table 4.9 the number of states obtained by cross validation are represented. The range was 4 5 See Appendix A.7 for more details. See Appendix A.8 for more details. Chapter 4. Segmentation and Classication Process 66 from 3 to 8. Handwritten symbols Printed symbols 8 8 7 8 With EM Without EM Table 4.9: The number of states in the hidden Markov model with and without the elastic matching method (EM) in the dataset. As in the previous methods, the value of C of and γ, k, in the k -nearest neigbor classier, and the values in the support vector machines, were also obtained by cross validation and the expected error was computed with the test set. The k value needs to be suciently big to minimize the probability of error occurrence in the classication and suciently tiny for the closest k points becoming closer to the point to classify. In doing so, for a obtained a k = 1, 2, . . . , 10, we k = 1 for a handwritten and printed dataset with and without the application of the elastic matching. The parameters for the support vector machines are presented in Table 4.10. The range was 23 for γ 2, 21.25 , 21.5 ,. . ., 24.75 , 25 for C value estimation, and 2−7 , 2−6.75 , 2−6.5 ,. . ., 22.75 , value estimation. Handwritten symbols Printed symbols C γ C γ With EM C = 23.5 = 11.3137 γ = 2−7 = 0.0078 C = 21.25 = 2.3784 γ = 2−5.75 = 0.0186 Without EM C = 24.5 = 22.6274 γ = 2−6 = 0.0156 C = 21.25 = 2.3784 γ = 2−5.75 = 0.0186 Table 4.10: The values of the parameters C and γ in the support vector machines with and without the elastic matching method (EM) in the dataset. 4.3.1 Results Towards musical handwritten symbol classication, several sets of symbols were extracted from dierent musical scores to train the classiers. In Table 4.11 the number of handwritten printed music symbols per class used in the training phase of the classication structures is represented. The symbols were grouped according to their shape. In doing so, the rests signs were divided in two groups RestI and RestII; see Table 4.11. Besides that we include the unknown class to classify those symbols that do not t into any of the other classes. In total we have 3222 handwritten music symbols and 2521 printed music symbols. 4.3. Classication Process Symbol 67 Class Total number Accent Class Total number 189 AltoClef 201 BassClef 26 Beam 291 Beam 438 Flat 155 Flat 230 Natural 127 Natural 317 Note 304 Note 466 NoteFlag 120 NoteFlag 122 NoteOpen 309 NoteOpen 208 TieSlur 67 RestI 135 RestI 63 RestII 401 RestII 321 Sharp 345 Sharp 13 Staccatissimo 21 Time 122 TrebleClef 99 TrebleClef 305 Unknown 404 Unknown 404 Handwritten Music Symbol Printed Symbols Music Symbols Table 4.11: Classes list of handwritten and printed music symbols. Chapter 4. Segmentation and Classication Process 68 Results obtained Without Elastic Matchingk Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 99% CI for the Expected performed in percentage: average (standard deviation) Neural network Nearest neighbor 100% 100% 98,17% 100% 100% 97,41% 83,33% 66,67% 100% 100% 96,51% 100% 100% 76,24% Support vector machines 89,36% 100% 96,33% 100% 100% 95,69% 93,33% 50% 96,97% 100% 98,84% 100% 83,33% 93,01% Hidden Markov model 95,83% 55,56% 91,89% 81,36% 97,50% 69,49% 62,50% 22,22% 100% 93,07% 85,23% 100% 96,30% 40,59% 89,36% 0% 86,24% 78,95% 83,54% 84,48% 70% 0% 78,79% 99% 79.07% 60% 58.33% 58.42% [78.85 (1.21); 82.88 (4.47)] [93.41 (0.34); 94.53 (1.24)] [94.59 (0.07); 96.90 (0.27)] [76.58 (0.57); 81.04 (2.09)] Table 4.12: Results obtained with the test set in the classication process for the handwritten music symbols. AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 99% CI for the Expected performed in percentage: average (standard deviation) Neural network Nearest neighbor 100% 100% 100% 100% 98,68% 93,33% 100% 93,33% 100% 100% 100% 100% 100% 82,89% Support vector machines 100% 100% 97,37% 100% 100% 90% 100% 86,67% 100% 100% 100% 100% 100% 97,37% Hidden Markov model 78% 97,26% 92,68% 100% 93,42% 63,33% 88,31% 83,33% 100% 86,67% 100% 33,33% 36,67% 73,68% 94% 93,15% 100% 100% 88,16% 96% 90,67% 63.33% 100% 60% 98.75% 0% 93.33% 67,11% [87.60 (0.41); 88.97 (1.53)] [95.93 (0.24); 96.74 (0.90)] [97.30 (0.30); 98.29 (1.10)] [84.17 (0.32); 86.71 (1.19)] Table 4.13: Results obtained with the test set in the classication process for the printed music symbols. From the results obtained for the handwritten music symbols we can conclude that the classier with better performance was the support vector machines with a the expected performance [94.59%; 96.90%]. 99% condence interval for Besides that, Hidden Markov Model performed better than neural network since the latter had an higher standard deviation than the other and it also had two classes (BassClef and NoteOpen) of symbols with 0%. k See the tables of confusion in the Appendix B. 4.3. Classication Process 69 The results obtained for the printed music symbols show us that the support vector machine is again the better classier compared with the others with a expected performance 99% condence interval for the [97.30%; 98.29%]. Results obtained With Elastic Matching∗∗ The database during the cross validation process was divided into train, validation and test sets. The deformations caused by the deformation function 2.10 of the elastic matching with M = 1, 2, 3 and N = 1, 2, 3 were applied in the training data and training data with validation set. Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 99% CI for the Expected performed in percentage: average (standard deviation) Neural network Nearest neighbor 100% 100% 95,41% 100% 100% 96,55% 86,67% 66,67% 100% 100% 98,84% 100% 100% 63,37% Support vector machines 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% Hidden Markov model 91,66% 66,67% 96,40% 71,19% 92,50% 77,97% 28,13% 22,22% 100% 85,15% 81,82% 100% 77,78% 30,69% 91,50% 0% 86,24% 71,93% 89,87% 74,14% 40% 0% 45,45% 93% 88,37% 20% 58.33% 44,55% [74.88 (0.84); 77.67 (3.09)] [92.24 (0.32); 93.29 (1.17)] [87.79 (1.95); 96.00 (8.82)] [70.89 (0.67); 76.17 (2.47)] Table 4.14: Results obtained with the test set in the classication process for the handwritten music symbols. The results lead us to conclude that the application of the elastic matching to the music symbols does not outperform the performance of the classiers without elastic matching. However, in handwritten music symbols with higher similarities in their shapes, the process had a better behaviour than the classiers without elastic matching see Table 4.16. This feature lead us to suggest a study, in a future work, where the application of the deformation function used only in these types of symbols to determine if the expected performed is augmented. The same does not happen with printed music symbols. Our explanation resides on the fact that these signs do not have high distortions in their shapes. Thus the elastic matching method, in presence of such music symbols, makes the performance of the classiers slightly lower than with handwritten music symbols. The best classier for the test set of real scores was the support vector machines with a condence interval for the expected performance [87.79%; 96.00%]. 99% However, the standard deviation was very high [1.95%; 8.82%]. Consequently, I sugest the nearest neighbor classier as ∗∗ See the tables of confusion in the Appendix B. Chapter 4. Segmentation and Classication Process 70 AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 99% CI for the Expected performed in percentage: average (standard deviation) Neural network Nearest neighbor 98% 100% 100% 100% 100% 90% 98,70% 90% 100% 100% 98,77% 100% 100% 78,95% Support vector machines 90% 90,41% 94,74% 96,77% 97,37% 93,55% 90,79% 86,67% 93,75% 93,33% 97,5% 100% 100% 100% Hidden Markov model 78% 100% 97,37% 100% 93,33% 50% 80,52% 70% 100% 86,67% 100% 33,33% 100% 43,42% 86% 97,26% 94,74% 87.10% 84.21% 40% 92.21% 13.33% 100% 0% 97.5% 0% 93,33% 56,57% [77.66 (1.34); 82.11 (4.93)] [95.34 (0.22); 96.06 (0.80)] [93.13 (0.26); 94.23 (1.18)] [73.00 (1.00) 88.71 (3.67)] Table 4.15: Results obtained with the test set in the classication process for the printed music symbols. Neural network Nearest neighbor Support vector machines With EM Without EM With EM Without EM With EM Without EM Natural 89.87% 83.54% 100% 100% 100% 100% Sharp 88.37% 79.07% 98.84% 96.51% 100% 98.84% Table 4.16: The performed of the natural and sharp symbols. the best classier with a 99% condence interval for the expected performance [92.24%; 93.29%], because with a lower standard deviation ([0.32%; 1.17%]) the results will not largely span in the classier performance. The best classier for the synthetic scores was the nearest neighbor classier with a 99% condence interval for the expected performance [95.34%; 96.06%]. Part IV Conclusions and Future Work 71 Chapter 5 Conclusion This dissertation had the aim to overcome the predicament of musical symbol recognition in handwritten scores through research and application of recent techniques of machine learning and articial intelligence to this problem. In fact, despite advances made in the area of research in optical musical recognition algorithms, as we have seen in the brief description of state of the art, several open problems still exist with handwritten music sheets. On the one hand, the scores tend to be rather irregular and conditioned by the authors' own writing style, and the quality of the paper in which it is written might have degraded throughout the years, making it a lot harder to correctly identify its contents. On the other hand, they are likely to have changes of size, shape and intensity of handwritten symbols by the same author into the same score and the sta lines may be tilted one way or another on the same page. They also may be curved and may have discontinuities. As a result, the detection and recognition process in the handwritten music scores becomes more complicated than in printed music sheets. The rst challenge faced by an OMR system was at sta lines detection and removal. These operations were one of the most important phases of musical symbol recognition, because they determine the results for subsequent proceedings. It was seen that the reasons for detecting and removing the sta lines lie on the need to isolate the musical symbols for a more ecient and correct detection of each symbol present on the score. The work of this dissertation resulted in a robust algorithm for the automatic detection of sta lines in music scores based in stable paths. The proposed method used a very simple but fundamental principle to assist detection and avoids the diculties typically asserted by symbols superimposed on sta lines. The main idea was to consider the sta lines as the result of the shortest path between the two margins of the music sheet, giving preference to black pixels. This approach for the sta line detection algorithm was adapted to a wide range of image conditions, thanks to its intrinsic robustness to skewed images, discontinuities, and curved sta lines. The proposed method was also robust to discontinuities in sta lines (due to low-quality digitalization or low-quality originals) or sta lines as thin as one pixel. In order to take full advantage of the method, existing sta line removal algorithms were enhanced by using the stable paths method as its rst processing step. Several tests that enabled improvements in this proposed approach that induced better results were also performed. The encouraging results obtained led us to consider that investigation in the detection of music symbols would be beneted from the improved sta line detection and removal. In doing so, the next step was the application of already existent algorithms for the 73 Chapter 5. Conclusion 74 segmentation process using in the previous step the stable paths approach for the detection of the sta lines. Since in this work we dealt with handwritten musical scores the segmentation and classication steps were hard. On one hand, we had a huge variability in the music symbols that caused inconsistency problems in the size and shape of each object. On the other hand, the sta lines removal may cause breaks in the music symbols that can insert imperfections into them. Besides that, we also had complexity and ambiguity problems related with the number of possible arrangements of music primitives. The segmentation method was based on a hierarchical decomposition of the music image and the symbols that we wanted to recognize were split into four dierent types: the symbols that are featured by a vertical segment, the symbols that link the notes, the remaining symbols connected to sta lines and the symbols above and under sta lines. In the music symbol classication a profound comparative study with some classiers, as the hidden Markov model and support vector machine, was presented. A new methodology based on elastic matching was used in our dataset with conjunction with others classiers. Our aim was to simulate controlled distortions, in the database, that can happen in real scores to prepare the classier for several undesirable situations that may occur in handwritten music sheets. With the study done, we concluded that the application of the elastic matching to the music symbols did not outperform the performance of the classiers without this method. However, we also saw that the results for classiers with elastic matching presented a better behaviour for the handwritten music symbols with higher similarities in their shapes. The best classier for handwritten music symbols was support vector machines with a performance higher than 94.59% for the test set without elastic matching and 93.10% for the test set with elastic matching. The best classier for printed music symbols without elastic matching was support vector machines with a performance higher than 97.30% and nearest neighbor classier for the test set with elastic macthing with a performance higher than 95.34%. 5.1 Future Work The various approaches in OMR to musical symbols segmentation and classication are still below the expectations for handwritten musical scores. It is intention of this project, in a future work, that the proposed methodology incorporates in a natural way the prior knowledge of the musical rules in the recognition of symbols, in order to overcome the limitations of the current approaches that are incapable of dealing, in a robust form, with specicities of the handwritten music. The new proposed methodology should also be naturally adaptable to manuscript images and to dierent music notations. Specically, we intend to explore our initial work in several directions: Continue to investigate new methods of optical music recognition for handwritten musical scores and for dierent music notations. This line of research involves a study and a profound understanding of the latest techniques of pattern recognition, machine learning and inductive logic programming. The merger of rules and techniques from dierent areas should help overcome the problems of the existing algorithms. 5.1. Future Work 75 Integration of the algorithms developed in an OMR system with remote access via the internet to a wide corpus of unpublished handwritten music encoded in a adequate format; The creation of a system like this will not only centralize as much information as possible but will also serve to preserve this corpus in a way that is easily accessible for browsing, analysing and downloading, while keeping the scores in their original format along with their digital counterpart. The availability of a system with these features will contribute for the preservation of the musical heritage. Specically, one of the objectives for a future work, is the investigation of new methods for the segmentation phase. It is our intention to incorporate Hidden Markov Models in this step. It will be necessary to perform an intense study and investigation in this eld. Besides that, a simple detection of the music symbols is not sucient. It is also necessary to perform their interpretation and to understand the connections between them to have a recognition of the analyzed score. Therefore, the denition of these relations are needed before they are implemented in a system. In this manner, the integration of syntactic music rules is required in the recognition process to acquire semantic information. Thus, this project has as another goal the research and application of inductive logic programming techniques. References [AS07] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing. ACM Trans. Graph., 26(3):10, 2007. [AYV01] N. Arica and F.T. Yarman-Vural. An overview of character recognition focused Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 31(2):216233, May 2001. on o-line handwriting. [Bai97] D. Bainbridge. Extensible Optical Music Recognition. PhD thesis, Department of Computer Science, University of Canterbury, Christchurch, NZ, 1997. [BBN01] P. Bellini, I. Bruno, and P. Nesi. Optical music sheet segmentation. of Music, 2001. Proceedings. First International Conference on, Web Delivering pages 183190, Nov. 2001. [BS00] Marija Bojovic and Milan D. Savic. Training of hidden markov models for cursive handwritten word recognition. In ICPR '00: Proceedings of the International Conference on Pattern Recognition, page 1973, Washington, DC, USA, 2000. IEEE Computer Society. [Cap08] Artur Capela. Reconhecimento de símbolos musicais manuscritos na framework gamera. Master's thesis, Faculdade de Engenharia da Universidade do Porto, 2008. [Car89] N. P. Carter. Publishing. [CC93] Automatic Recognition of Printed Music in the Context of Electronic PhD thesis, Departments of Physics and Music, 1989. B. Coüasnon and J. Camillerapp. Using grammars to segment and recognize music Proc. of DAS-94: International Association for Pattern Recognition Workshop on Document Analysis Systems, pages 1527, Kaiserslautern, 1993. scores. + [CCR 08] In Jaime S. Cardoso, Artur Capela, Ana Rebelo, Carlos Guedes, and Joaquim Pinto da Costa. Sta detection with stable paths, 2008. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), Submited. [CCRG08a] Artur Capela, Jaime S. Cardoso, Ana Rebelo, and Carlos Guedes. Integrated recognition system for music scores, 2008. International Computer Music Conference (ICMC), Accepted. [CCRG08b] Jaime S. Cardoso, Artur Capela, Ana Rebelo, and Carlos Guedes. A connected path approach for sta detection on a music score, 2008. Conference on Image Processing (ICIP), Accepted. 76 IEEE International References [Coü96] 77 Segmentation et reconnaissance de documents guidées par la connaissance a priori: application aux partitions musicales. PhD thesis, Université de B. Coüasnon. Rennes, 1996. [CRCG08] Artur Capela, Ana Rebelo, Jaime S. Cardoso, and Carlos Guedes. Sta line de- Proceedings of the International Conference on Signal Processing and Multimedia Applications (SIGMAP 2008), pages tection and removal with stable paths. In 263270, 2008. [DDCF07] Christoph Dalitz, Michael Droettboom, Bastian Czerwinski, and Ichiro Fujigana. Sta removal toolkit for gamera, 2005-2007. http://music-staves. sourceforge.net. [DDCF08] Christoph Dalitz, Michael Droettboom, Bastian Czerwinski, and Ichiro Fujigana. A comparative study of sta removal algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:753766, 2008. [Die05] Reinhard Diestel. Graph Theory. Graduate Texts in Mathematics. Springer-Verlag, third edition, 2005. [eeFc02] Vojt ech ech Franc and Václav Hlavá c. Multi-class support vector machine. Technical report, 2002. [FLS05] Alicia Fornés, Josep Lladós, and Gemma Sánchez. Primitive segmentation in old handwritten music scores. In Wenyin Liu and Josep Lladós, editors, 3926 of [Fuj04] GREC, volume Lecture Notes in Computer Science, pages 279290. Springer, 2005. Ichiro Fujinaga. Sta detection and removal. In Susan George, editor, Perception of Music Notation: On-Line and O-Line Recognition, Visual pages 139. Idea Group Inc., 2004. [Fuk90] Keinosuke Fukunaga. Introduction to statistical pattern recognition (2nd ed.). Aca- demic Press Professional, Inc., San Diego, CA, USA, 1990. [Gal86] Zvi Galil. Ecient algorithms for nding maximum matching in graphs. Comput. Surv., 18(1):2338, 1986. [GWE04] Rafael C. Gonzalez, Richard E. Woods, and Steven L. Eddins. age processing using MATLAB, pages 405407. ACM Digital Im- Upper Saddle River, NJ : Pearson/Prentice-Hall, 2004. [Has70] W. K. Hastings. applications. [Hay98] Monte carlo sampling methods using markov chains and their Biometrika, 57(1):97109, April 1970. Simon Haykin. Neural Networks: A Comprehensive Foundation (2nd Edition). Prentice Hall, July 1998. [HL02] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2):415425, 2002. References 78 [JZ97] Anil K. Jain and Douglas Zongker. Representation and recognition of handwritten digits using deformable templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(12):13861391, 1997. [JZL96] Anil K. Jain, Yu Zhong, and Sridhar Lakshmanan. formable templates. Object matching using de- IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 18(3):267278, 1996. + [KHB 00] Tapas Kanungo, Robert M. Haralick, Henry S. Baird, Werner Stuezle, and David Madigan. A statistical, nonparametric methodology for document degradation IEEE Transactions on Pattern Analysis and Machine Intelli- model validation. gence, 22(11):12091223, 2000. [KI90] H. Kato and S. Inokuchi. A recognition system for printed piano music using Proceedings of the International Association for Pattern Recognition Workshop on Syntactic and Structural Pattern Recognition, musical knowledge and constraints. In pages 231248, 1990. [KPM96] Gary E. Kopec, Phil A. Chouxerox Parc, and David A. Maltzcarnegie. Markov source model for printed music decoding. Journal of Electronic Imaging, pages 714, 1996. [LCL93] I. Leplumey, J. Camillerapp, and G. Lorette. A robust detector for music staves. Proceedings of the International Conference on Document Analysis and Recognition, pages 902905, 1993. In [LS88] L. Lam and Ching Y. Suen. Structural classication and relaxation matching of totally unconstrained handwritten zip-code numbers. Pattern Recogn., 21(1):19 32, 1988. [MA98] E. Mayoraz and E. Alpaydim. Support vector machines for multiclass classication. Technical report, 1998. [Mah82] J. V. Mahoney. Automatic analysis of music score images. B.Sc thesis, 1982. [Mat85] T. Matsushima. vision system). Automated high speed recognition of printed music (wabot2 Advanced Robotics 1985. ICAR 1985. International Conference on, pages 477 482, 1985. [MMM04] Youichi Mitobe, Hidetoshi Miyao, and Minoru Maruyama. A fast hmm algo- rithm based on stroke lengths for on-line recognition of handwritten music scores. IWFHR '04: Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition, pages 521526, Washington, DC, USA, 2004. IEEE In Computer Society. [MN96] H. Miyao and Y. Nakano. Note symbol extraction for printed piano scores using neural networks. D:548554, 1996. IEICE TRANSACTIONS on Information and Systems, E79- References [MO07] 79 Hidetoshi Miyao and Masayuki Okamoto. Stave extraction for printed music scores using DP matching. Journal of Advanced Computational Intelligence and Intelli- gent Informatics, 8:208215, 2007. [MP88] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. pages 1527, 1988. [Ng95] Kia Ng. Automated computer recognition of music score. PhD thesis, University of Leeds, 1995. [Nis95] Hirobumi Nishida. A structural model of shape deformation. Pattern Recognition, 28(10):16111620, 1995. [Pre70] D. Prerau. Computer pattern recognition of standard engraved music notation. PhD thesis, Massachusetts Institute of Technology, 1970. [Pru66] D. Pruslin. Automatic recognition of sheet music. PhD thesis, Massachusetts Institute of Technology, 1966. [Pug06] Laurent Pugin. Optical music recognition of early typographic prints using hidden markov models. In [RB05] ISMIR, pages 5356, 2006. F. Rossant and I. Bloch. Optical music recognition based on a fuzzy modeling of symbol classes and music writing rules. Image Processing, 2005. ICIP 2005. IEEE International Conference on, 2:II53841, Sept. 2005. + [RCC 07] Ana Rebelo, Artur Capela, Joaquim F. Pinto da Costa, Carlos Guedes, Eurico Carrapatoso, and Jaime S. Cardoso. A shortest path approach for sta line detection. Automated Production of Cross Media Content for Multi-Channel Distribution, 2007. AXMEDIS '07. Third International Conference on, pages 7985, Nov. 2007. + [RCF 93] R. Randriamahefa, J.P. Cocquerez, C. Fluhr, F. Pepin, and S. Philipp. Printed Document Analysis and Recognition, 1993., Proceedings of the Second International Conference on, pages 898901, Oct 1993. music recognition. [Rea69] Gardner Read. Music Notation: A Manual of Modern Practice (2nd ed.). Ta- plinger, New York, 1969. [RHW88] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. pages 673695, 1988. [Ros02] Florence Rossant. A global method for music symbol recognition in typeset music sheets. [RP96] Pattern Recognition Letters, 23(10):11291141, 2002. K.T. Reed and J.R. Parker. Automatic computer recognition of printed music. Pattern Recognition, 1996., Proceedings of the 13th International Conference on, 3:803807 vol.3, Aug 1996. References 80 [RT88] J. W. Roach and J. E. Tatem. Using domain knowledge in low-level visual processing to interpret handwritten music: an experiment. Pattern Recognition, 21(1):33 44, 1988. [SDV06] C.H. Papadimitriou S. Dasgupta and U.V. Vazirani. Algorithms, pages 169179. McGraw-Hill Higher Education, 2006. [Szw05] Mariusz Szwoch. A robust detector for distorted music staves. In Computer Anal- ysis of Images and Patterns, pages 701708. Springer-Verlag, Heidelberg, 2005. [TK03] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern recognition, chapter 4.6. Academic Press, second edition, 2003. [TMD99] Michael Thulke, Volker Märgner, and Andreas Dengel. A general approach to DAS '98: Selected Papers from the Third IAPR Workshop on Document Analysis Systems, pages 4357, quality evaluation of document segmentation results. In London, UK, 1999. Springer-Verlag. [TSM06] Fubito Toyama, Kenji Shoji, and Juichi Miyamichi. Symbol recognition of printed piano scores with touching symbols. pages 480483, 2006. [Vap98] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998. [Wak94] T. Wakahara. Shape matching using lat and its application to handwritten numeral recognition. [WFJC01] IEEE Trans. Pattern Anal. Mach. Intell., 16(6):618629, 1994. Yuan-Kai Wang, Kuo-Chin Fan, Yau-Tarng Juang, and Tai-Hong Chen. hidden markov model for chinese business card recognition. In ICIP (1), Using pages 11061109, 2001. [WH88] Bernard Widrow and Marcian E. Ho. Adaptive switching circuits. pages 123134, 1988. Part V Appendix 81 Appendix A Fundamentals A.1 Primal Problem vs Dual Problem The primal problem for the case of nonseparable training classes space is given by {(xi , di )}N i=1 in the feature N X 1 T w w+C ξi 2 min w,b,ξi (A.1) i=1 di (wT xi + b) ≥ 1 − ξi s.a where the non negatives slack variables, {ξi }N i=1 , i = 1, 2, ..., N measure the deviation of a data point from the ideal condition of pattern separability and the parameter C controls the relation between the complexity of the classier and the number of nonseparable points. It is possible to nd the solution of the quadratic problem by determining the saddle point of the Lagrangian function that is given by: J(w, b, ξ, α, µ) = N N N h i X X X 1 T w w+C ξi − αi di (wT xi + b) − (1 − ξi ) − µi ξ i 2 i=1 i=1 i=1 w, b which has to be minimized with respect to respect to the nonnegative multipliers J(w, b, ξ, α, µ) αi µi . e and ξi ; (A.2) it also has to be maximized with The parameters that minimize the function meet the following conditions: N X ∂J =w− αi di xi = 0 ∂w (A.3) i=1 N X ∂J =− αi di = 0 ∂b (A.4) i=1 ∂J = C − µi − αi = 0 i = 1, 2, . . . , N ∂ξi (A.5) From these conditions we have w= N X i=1 83 αi di xi (A.6) Appendix A. Fundamentals 84 N X αi di = 0 (A.7) i=1 µi + α i = C i = 1, 2, . . . , N (A.8) If we expand the equation A.2, term by term, we have: J(w, b, ξ, α, µ) = N N N N N N X X X X X X 1 T w w+C ξi − αi di wT xi + b αi di + αi − αi ξi − µi ξi 2 i=1 i=1 i=1 i=1 i=1 i=1 (A.9) The fourth term in the right-hand side is zero by the optimality condition A.7 obtained above. Besides that, from the condition A.6 we have N X wT w = N αi di wT xi = i=1 N 1 XX αi αj di dj xTi xj 2 i=1 j=1 Replacing the condition A.8 and the result obtained above in the equation A.9, we obtain the following dual objective function Lagrangiana: JD = N X i=1 N αi − N 1 XX αi αj di dj xTi xj 2 (A.10) i=1 j=1 which gives the inferior limit of the function objective A.1 to any admissible point. In doing so, it is possible to formulate the dual problem for a non separable training sample{(xi , di )}i=1 N as follows: max α N X i=1 s.a N αi − w= PN PN N 1 XX αi αj di dj xTi xj 2 i=1 j=1 i=1 αi di xi i=1 αi di = 0 (A.11) C − µi − αi = 0 i = 1, 2, ..., N αi ≥ 0, µi ≥ 0 i = 1, 2, ..., N Adding the complementary conditions of the Karush-Kuhn-Tucker αi di (wT xi + b) − (1 − ξi ) = 0 µi ξi = 0 to the conditions A.6A.8 for follows: αi di (wT xi + b) − (1 − ξi ) ≥ 0 i = 1, . . . , N (A.12) (A.13) (A.14) it is possible reformulate the dual problem A.11 as A.2. Error Backpropagation Algorithm max α s.a N X i=1 85 N αi − PN N 1 XX αi αj di dj xTi xj 2 i=1 j=1 (A.15) i=1 αi di = 0 0 ≤ αi ≤ C i = 1, 2, ..., N A.2 Error Backpropagation Algorithm The error backpropagation method is a neural network train without feedback. This tech- nique requires a dierentiable transfer function to be possible the minimization of the cost function [TK03, Hay98]. Since the multilayer perceptron architectures apply the activation function degree: 1, f (x) = 0, se x>0 se x<0 (A.16) which is discontinues in the point 0, this algorithm uses the family of continuous dierentiable functions which is the family of sigmoid function: f (x) = 1 , where a > 0 1 + exp(−ax) is the slope parameter (A.17) A.2.1 Mathematic explanation The learning method of using the error backpropagation uses the gradient descending technique to minimize the cost function. This function expresses the mean square error dened as follows: E(i) = kL N X 1X i=1 where ym (i) and ŷm (i) the output neuron 2 m=1 (ym (i) − ŷm (i))2 (A.18) represent respectively the output obtained and the desired output for m. kL is the total number of the outputs neurons and of training samples available (y(i), x(i)). N is the total number The updating of the weights of the edges is made from wjr (new) = wjr (old) − µ ∂E(i) ∂wjr wjr (old) is the current estimate of the unknow weights, ∂E(i) ∂wjr is the gradient of error function. where Let (A.19) µ is the learning coecients and υjr be the sum of the input weights in the j th neuron in the rth layer and yjr the corresponding output after the activation function. If in the r− f (.) j th neuron in the rth of the latter neuron is layer,j neuron, k = 1, 2, ..., kr−1 , r wjk is the current estimate of the corresponding is the output of the 1th layer for the ith training pair and weight in the function ykr−1 = 1, 2, ..., kr , k th then the argument of the activation Appendix A. Fundamentals 86 kr−1 υjr (i) = X kr−1 X r r−1 r wjk yk (i) + wj0 = k=1 r where y0 (i) = 1, ∀r, i. (A.20) k=0 For the output layer, we have and for the network input we have r r−1 wjk yk (i) r= 1, ykr (i) r = L, ykr (i) = ŷk (i), = xk (i), with with k = 1, 2, ..., k0 . k = 1, 2, ..., kL By the chain rule in dierentiation, we have ∂E(i) ∂E(i) ∂υjr (i) = ∂wjr ∂υjr (i) ∂wjr (A.21) from A.20 we obtain ∂ r υ (i) ≡ ∂wjr j Let us dene, ∂ r r υj (i) ∂wj0 . . . ∂ r ∂wjk r−1 υjr (i) = y r−1 (i) (A.22) ∂E(i) = δjr (i) ∂υjr (i) (A.23) then A.19 becomes wjr (new) = wjr (old) −µ N X δjr (i)y r−1 (i) (A.24) i=1 Summarizing the error backpropagation algorithm: 1. Inicialization : 2. Forward computations : Initialize all the weights with small random values. r pute υj (i) and yjr (i) For each of the training feature vectors = f (υjr (i)), for x(i), i = 1, 2, ..., N j = 1, 2, ..., kr , r = 1, 2, ..., L com- from the sigmoid f . Compute the cost function for the current estimate of weights from E(i) = PkL 1 PkL 1 PkL 2 2 L 2 m=1 (ym (i) − ŷm (i)) and E(i) ≡ 2 m=1 em (i) ≡ 2 m=1 (f (vm (i)) − ym (i)) . function PN 1 i=1 2 3. Backward computations : δjL (i) for r For each i = 1, 2, ..., N and j = 1, 2, ..., kr compute Pkr δjL (i) from r f 0 (υ r−1 (i)), = ej (i)f 0 (υjL (i)) and in the end compute δjr−1 (i) from δjr−1 (i) = k=1 δjr (i)wkj j = L, L − 1, ..., 2 and j = 1, 2, ..., kr ; f 0 (.) represents the dierentiation with respect to argument. 4. Update the weights : wjr (new) = wjr (old) r = 1, 2, ..., L and j = 1, 2, ..., kr PN r − µ i=1 δj (i)y r−1 (i). For A.2.2 Graphic explanation To facilitate the layout of this method we will only consider a neural network with three layers see Figure A.1. Each neuron is composed by the sum of products of the weights with the input signals and the sigmoid activation function. The signal e represents the output signal of A.2. Error Backpropagation Algorithm 87 Figure A.1: Neural network architecture with three layers, two inputs and one output. the sum and y = f (e) represents the output signal of the neuron. To prepare the network for the classication it is necessary to train a data set constituted by the input signals (x1 and x2) associated with their targets z. This training is an interac- tive process, because for each interaction the node weights are modied using a new training dataset, in order to minimize the cost function. This modication is computed using the error backpropagation algorithm. In doing so, each learning step starts by forcing both input signals of the training set to determine the values of the output signals for each neuron in each network layer. Figure A.2 shows how the signal propagates through the network. The symbols xm are the weights of the edges between the network input the symbols yn represent the output signal in the neuron In the next step, the output signal y and the neuron w(xm) n n in the input layer; n. is compared with the value of the desired output (z ) see Figure A.3. The dierence is called error signal d of the neuron output layer. Since the output values of the neurons in the hidden layers are unknown, it is possible to compute the error signal. In this manner, the aim is to propagate the error signal the neurons - see Figure A.4. The value of the weights wmn d in backward mode to all is equal to the value of the weights to calculate the output value. Only the direction of the sense of the data is changed, that is, the signals propagate itself, for all layers of network, from the output to the input, one behind the others. When the error signal for each neuron has been computed, the weights coecients for each 1 input neuron are modied see Figura A.5 . 1 df (e) de represents the derivative of the activation function of the neuron whose weights are modied. Appendix A. Fundamentals 88 (a) Forward propagation of the signal in the input layer. (b) Forward propagation of the signal in the input layer. [ (c) Forward propagation of the signal in the input (d) Forward propagation of the signal in the hidden layer. layer. (e) Forward propagation of the signal in the hidden layer. Figure A.2: (f) Forward propagation of the signal in the output layer. Forward propagation of the signal in the neural network. Figure A.3: Comparison between the signal output and the target. A.2. Error Backpropagation Algorithm 89 (a) Backward propagation of the signal in the output layer. (b) Backward propagation of the signal in the output layer. (c) Backward propagation of the signal in the hidden layer. (d) Backward propagation of the signal in the hidden layer. (e) Backward propagation of the signal in the hidden layer. Figure A.4: Propagation of the error signal d in backward mode in the neural network. Appendix A. Fundamentals 90 (a) The calculation of weight in the input layer. (b) The calculation of weight in the input layer. (c) The calculation of weight in the input layer. (d) The calculation of weight in the hidden layer. (e) The calculation of weight in the hidden layer. (f) The calculation of weight in the output layer. Figure A.5: The calculation of weights in the neural network. A.3. Dalitz's Algorithm 91 A.3 Dalitz's Algorithm Dalitz algorithms is a generalization of the method described in [MO07]. It stars by estimating the values for the sta line thickness and the sta space height. These values are estimated by the technique used in Fujinaga [Fuj04]: the most frequent black-runs represents the sta line staine_height ) and the most common white-runs represents the vertical line distance within the same sta (staspace_height ). This technique starts by computing the vertical runheight ( lengths representation of the image. The process of Dalitz for nding the sta lines operates on a set of stasegments and requires methods not only for linking two of those stasegments horizontally and vertically, but also for merging two segments with overlapping positions into one. In the following steps the algorithm will be described: 1. Add vertical links between stasegments with a vertical distance around staine_height +staspace_height. 2. Add horizontal links between adjacent stasegments possibly belonging to the same sta line. 3. Partition the resulting graph into connected subgraphs; each subgraph that is wide and high enough corresponds to a sta. 4. All stasegments within a system are labelled as belonging to a certain sta line. Segments of the same line at the same horizontal position are merged into one segment. 5. Due to ledger lines, ties and beams, some subgraphs will contain too many sta lines. To reduce them to a predened number of lines per sta (typically ve for modern notation, four for chant and six for tablature), the outer sta lines of each sta are subsequently removed until the predened number of sta lines remains. Dalitz developed the following algorithm to nd the stasegments: 1. Extract horizontal runs with more than 60 percent black pixels within a window of width staspace_height. 2. The resulting laments are vertically thinned by replacing each vertical black run with its middle pixel. For black runs higher than 2*staine_height, more than one skeleton point is extracted. 3. The resulting skeleton segments wider than 2*staine_height are the stasegments. A.4 Matching in Bipartite Graphs Before moving to the denition of the Kuhn-Munkres Algorithm it is necessary to give some denitions. Let has weight and G = (X ∪ Y, X × Y ) w(xy). X and Y be a weighted complete bipartite graph, where edge have the same size, n, and can be written as Y = {y1 , y2 , . . . , yn }. Denition 1: A feasible vertex labelling l(y) = w(xy) for all x ∈ X and y∈Y for N is a function X = {x1 , x2 , . . . , xn } l : V (N ) → Z . We dene the size of l, by xy such that size(l) = P l(x) + v∈V (N ) l(v). Appendix A. Fundamentals 92 Denition 2: N Suppose w(e). P Let l be a feasible vertex labelling of N. equality subgraph Gl Then the l in l(x) + l(y) = w(xy). is the spanning subgraph of N P G = (X∪Y, X×Y ) by giving each edge e an integer weight is a network obtained from containing all edges xy for for which The algorithm iteratively constructs a sequence of feasible vertex labelling such that size(li+1 ) < size(li ), and a sequence of matchings matching in the equality subgraph labelling li for which Mi G(li ), i ≥ 1. for all is a perfect matching in Mi 1. If M matched 2. If G, is complete for x ∈ X. JGl (S) 6= T , Set Gl , then S = {x} where Tc Note that by min x∈S,y∈T c denotes the complement of αl > 0 S ∪ {z} and y in and T Stop. JGl (S) = T . JG0l (S) 6= T . T in Y, Replace JGl (S), not in T . T ∪ {y}, If y l by l0 and Gl by l0 by Gl 0 . M , say with z ∈ X , replace and go to step 2. Otherwise, there will be an 0 by M and go to step 1. M Find is matched in x Replace y, Gl . and construct a new labeling M 0 in to in Otherwise, there is some un- l(v) − αl for v ∈ S 0 l (v) = l(v) + αl for v ∈ T l(v) otherwise by M {l(x) + l(y) − w(xy} alternating path from Gl . Start with an arbitrary feasi- T = ∅. go to step 3. Otherwise, 3. Choose a vertex S is optimal. and αl = is a maximum G(li ). and choose an arbitrary matching M Mi for It stops when it nds a feasible vertex The Kuhn-Munkres Algorithm (or Hungarian method) ble vertex labeling l, determine such that l1 , l2 , . . . M and we may use this path to nd a larger matching A.5 Otsu Threshold Algorithm The segmentation process divides the image into sub- homogeneous regions. Homogeneity is the existence of compliance of some features as intensity of colour or levels of gray. The Otsu algorithm is used to nd the parameter of intensity T that divides the initial image into two parts - classes of white pixels and classes of black pixels. This method considers that any point that presents an intensity equal or greater than T belongs to the region of interest, and all the others are considered background of the image. In this manner, a new image generated as follows: g(x, y) = ( 1 se f (x, y) ≥ T 0 se f (x, y) < T g(x, y) is A.5. Otsu Threshold Algorithm f (x, y) where 93 is the original image. After the segmentation, the matrix of the new image will be constituted by 0's black pixels and 1's white pixels [GWE04]. Otsu considers the image normalized histogram as a function of density of discrete probability, as follows: pr (rq ) = where rq nq , n q = 0, 1, 2, . . . , L − 1 n is the total number of pixels in the image, nq and L is the number of pixels that have intensity is the total number of possible levels of intensity in the image. chooses the threshold k, such that C1 = [k, k + 1, . . . , L − 1], k is the intensity level where that maximize the 2 = ω0 (µ0 − µT )2 + ω1 (µ1 − µT )2 σB where, ω0 = ω1 = µ0 = µ1 = µT = C0 = [0, 1, . . . , k − 1] variance between the classes as Pk−1 q=0 PL−1 q=k Pk−1 q=0 PL−1 q=k PL−1 q=0 pq (rq ) pq (rq ) qpq (rq )/ω0 qpq (rq )/ω1 qpq (rq ) The Otsu method and 2 , that is dened σB Appendix A. Fundamentals 94 A.6 Sta Line Removal Algorithm staffLineRemoval(IMAGE,STAVES) { threshold = 2*staffHeight; tolerance = 1+ceil(staffHeight/3.0); IMAGE\_REMOVE = copy(IMAGE); For(nvalid = 0 to STAVES size) { Point2D staff = validStaves[nvalid]; For (i = 0 to staff size) { col = staff[i].x; refRow = staff[i].y; row = refRow; pel = valuePixel(IMAGE, IMAGE\_REMOVE); decrement/increase the reference row until one pixel different from white pixel (dist1/dist2) is found; If ( dist1 <= max(1, min (dist2, tolerance)) ) { refRow-=dist1; Else If ( dist2 <= max(1, min (dist1, tolerance)) ) refRow+=dist2; Else continue; } Count the number of decrements/increase on the reference row until the black pixel changes to white pixel (run); If ( run >= threshold ) continue; remove the vertical black sequences on the IMAGE; } } } Listing 3: Sta lines removal algorithm. A.7 Baum-Welch Algorithm The Baum-Welch algorithm or the Forward-Backward procedure consider the forward variable αt (i) dened as αt (i) = P (o1 o2 . . . t , qt = Si |λ) that is, the probability of the partial observation sequence, given the model λ and the backward variable βt (i) o1 o2 . . . t , at time t, to the end, given state Si and state Si dened as βt (i) = P (ot+1 ot+2 . . . T , qt = Si , λ) that is, the probability of the partial observation sequence from at time t and the model λ. t+1 Besides these variables the Baum-Welch algorithm also needs two more auxiliary variables that can be expressed in terms of the forward and backward variables: ξ(i, j) = P (qt = Si , qt+1 = Sj |o, λ) A.7. Baum-Welch Algorithm 95 that is, the probability of being in state Si at time t, and state Sj at time t + 1, given the model and the observation sequence. This is the same as, ξ(i, j) = P (qt = Si , qt+1 = Sj , o|λ) P (o|λ) Using forward and backward variables this can be expressed as, αt (i)aij bj (ot+1 )βt+1 (j) ξ(i, j) = PN PN i=1 j=1 αt iaij (i)bj (ot+1 )βt+1 (j) (A.25) The second variable is the a posteriori probability, N X γt (i) = ξt (i, j) (A.26) j=1 In forward and backward variables this can be expressed by, Assuming a starting model αt (i)βt (i) γt (i) = PN i=1 αt (i)βt (i) λ = (A; B; π), this algorithm starts by computing the α's and β 's as follows. The initialization of the forward probabilities as the joint probability of state and initial observation o1 : α1 (i) = πi bi (o1 ), the next step consists of the product over all the αt+1 (j) = N X 1≤i≤N N [αt (j)aij ] bj (ot+1 ), This result in the probability of αt+1 (j) Si at time possible states t+1 at time t: 1≤j≤N with all the accompanying previous partial is obtained by accounting for observation multiplying the summed quantity by the probability by the initialization of the Si , 1 ≤ i ≤ N 1 ≤ t ≤ T − 1, i=1 observations. The Si bj (ot+1 ).The ot+1 in state j, that is, by backward procedure starts βT (i): βT (i) = 1, 1≤i≤N and the calculation of βt (j) = N X aij bj (ot+1 )βt+1 (j), j=1 Then the algorithm computes the 1 ≤ i ≤ N, t = T − 1, T − 2, . . . , 1. ξ 's and γ 's using Equations A.25 and A.26, respectively. The next step is to update the HMM parameters according to π̄i = γ1 (i), PT −1 ξt (i, j) āij = Pt=1 , T −1 t=1 γt (i) PT t=1 γt (j) s.to =v b̄j (k) = PTt k , t=1 γt (j) 1≤i≤N (A.27) 1 ≤ i ≤ N, 1 ≤ j ≤ N (A.28) 1 ≤ j ≤ N, (A.29) 1≤k≤M Appendix A. Fundamentals 96 A.8 Viterbi Algorithm Let δt (i) = maxq1 ,q2 ,...,qt−1 P [q1 , q2 , . . . , qt = i, o1 o2 . . .t |λ] the highest probability along a single path, at time t, which accounts for the rst we have t observations and ends in state Si . By induction δt+1 (j) = max δt (i)aij bj (Ot+1 ) The array ψt (i) has the track of δt+1 (j) i for each t and j. The procedure is described below: 1. Initialization: δ1 (i) = πi bi (o1 ), 1≤i≤N ψ1 (i) = 0 2. Recursion: δt (j) = max [δt−1 (i)aij ] bj (ot ), 1≤i≤N ψt (j) = arg max [δt−1 (i)aij ] , 1≤i≤N 2 ≤ t ≤ T, 1 ≤ j ≤ N 2 ≤ t ≤ T, 1 ≤ j ≤ N 3. Termination: P ∗ = max [δT (i)] 1≤i≤N qT∗ = arg max [δT (i)] 1≤i≤N 4. Path backtracking: qT∗ = ψt+1 (qt+1 )∗ , t = T − 1, T − 2, . . . , 1 Appendix B Table of confusion B.1 Results Obtained Without Elastic Matching for the Handwritten Music Symbols Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 47 0 0 0 0 0 0 0 0 0 0 0 0 0 BassClef 0 6 0 0 0 0 0 0 0 0 0 0 0 0 Beam 1 0 107 0 0 0 0 0 0 0 0 0 0 1 Flat 0 0 0 57 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 0 79 0 0 0 0 0 0 0 0 0 Note 0 0 0 1 0 113 2 0 0 0 0 0 0 0 NoteFlag 0 0 0 1 0 4 25 0 0 0 0 0 0 0 NoteOpen 0 0 0 0 0 1 0 4 0 0 0 0 0 1 RestI 0 0 0 0 0 0 0 0 33 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 100 0 0 0 0 Sharp 0 0 0 1 0 1 0 0 0 0 83 0 0 1 Staccatissimo 0 0 0 0 0 0 0 0 0 0 0 5 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 24 0 Unknown 0 0 4 3 3 6 4 1 0 2 1 0 0 77 Accent Table B.1: Table of confusion of the nearest neighbor classier. Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 42 0 0 1 0 0 2 0 0 0 0 0 0 2 BassClef 0 0 0 0 0 3 0 0 1 0 0 0 0 2 Beam 1 0 94 0 1 1 1 0 0 0 1 0 0 10 Flat 2 0 0 45 0 0 3 0 0 0 0 0 0 7 Natural 1 0 0 0 66 3 0 0 0 0 0 0 0 9 Note 0 0 0 1 0 98 3 0 1 2 1 0 1 9 NoteFlag 0 0 0 2 0 4 21 0 1 0 0 0 0 2 NoteOpen 0 0 0 0 0 5 0 0 0 0 0 0 0 1 RestI 0 0 0 0 0 1 0 0 26 2 2 0 0 2 RestII 0 0 0 0 0 1 0 0 0 99 0 0 0 0 Sharp 0 0 0 2 0 0 4 0 3 0 68 0 0 9 Staccatissimo 0 0 0 0 0 0 0 0 0 0 2 3 0 0 TrebleClef 0 0 1 0 0 3 0 0 0 0 1 0 14 5 Unknown 0 0 4 2 0 17 9 0 4 3 3 0 0 59 Accent Table B.2: Table of confusion of the neural network. 97 Appendix B. Table of confusion 98 Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 42 0 0 0 0 2 0 0 0 0 0 0 0 3 BassClef 0 6 0 0 0 0 0 0 0 0 0 0 0 0 Beam 0 0 105 0 0 0 0 0 0 0 0 0 0 4 Flat 0 0 0 57 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 0 79 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 0 111 0 0 0 0 0 0 0 5 NoteFlag 0 0 0 0 0 0 28 0 0 0 0 0 0 2 NoteOpen 0 0 0 0 0 1 0 3 0 0 0 0 0 2 RestI 0 0 0 0 0 0 0 0 32 0 0 0 0 1 RestII 0 0 0 0 0 0 0 0 0 100 0 0 0 0 Sharp 0 0 0 0 0 0 0 0 0 0 85 0 0 1 Staccatissimo 0 0 0 0 0 0 0 0 0 0 0 5 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 20 4 Unknown 0 0 2 2 1 1 1 0 0 0 0 0 0 94 Accent Table B.3: Table of confusion of the support vector machines. Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 46 0 0 0 0 1 0 0 0 0 0 0 0 1 BassClef 0 5 0 0 0 2 0 0 0 0 0 0 0 2 Beam 3 0 102 0 0 3 0 0 0 0 0 0 0 3 Flat 0 0 0 48 11 0 0 0 0 0 0 0 0 0 Natural 0 0 0 1 78 1 0 0 0 0 0 0 0 0 Note 0 0 4 1 12 82 0 0 1 4 3 0 0 11 NoteFlag 3 0 0 0 2 2 20 0 0 1 1 0 0 3 NoteOpen 0 0 0 0 3 2 2 2 0 0 0 0 0 0 RestI 0 0 0 0 0 0 0 0 36 0 0 0 0 0 RestII 2 0 0 2 2 0 0 0 0 94 0 0 0 1 Sharp 0 0 0 0 10 1 0 0 1 1 75 0 0 0 Staccatissimo 0 0 0 0 0 0 0 0 0 0 0 6 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 26 1 Unknown 2 0 17 2 7 23 1 0 1 0 3 0 4 41 Accent Table B.4: Table of confusion of the hidden Markov model. B.2. Results Obtained Without Elastic Matching for the Printed Music Symbols 99 B.2 Results Obtained Without Elastic Matching for the Printed Music Symbols AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 50 0 0 0 0 0 0 0 0 0 0 0 0 0 Beam 0 73 0 0 0 0 0 0 0 0 0 0 0 0 Flat 0 0 38 0 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 31 0 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 75 1 0 0 0 0 0 0 0 0 NoteFlag 0 0 0 0 1 28 0 1 0 0 0 0 0 0 NoteOpen 0 0 0 0 0 0 77 0 0 0 0 0 0 0 TieSlur 0 0 0 0 0 0 0 28 0 0 0 0 0 2 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 15 0 0 0 0 Sharp 0 0 0 0 0 0 0 0 0 0 80 0 0 0 Time 0 0 0 0 0 0 0 0 0 0 0 3 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 30 0 Unknown 0 0 0 0 1 0 3 7 0 2 0 0 0 63 AltoClef Table B.5: Table of confusion of the nearest neighbor classier. AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 47 3 0 0 0 0 0 0 0 0 0 0 0 0 Beam 1 68 0 0 0 0 2 0 0 0 0 0 0 2 Flat 0 0 38 0 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 31 0 0 0 0 0 0 0 0 0 0 Note 0 3 0 0 67 1 1 0 0 0 0 0 0 4 NoteFlag 0 0 0 0 0 24 0 0 0 0 0 0 0 1 NoteOpen 0 0 0 0 4 1 68 1 0 0 0 0 0 1 TieSlur 0 0 0 0 4 1 0 19 0 0 0 0 0 6 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 1 0 0 1 0 9 0 2 0 2 Sharp 0 0 0 0 0 0 0 0 0 0 79 0 0 1 Time 0 0 0 0 0 0 0 1 0 0 0 0 0 2 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 28 2 Unknown 3 0 0 0 4 4 5 7 0 1 0 1 0 51 AltoClef Table B.6: Table of confusion of the neural network. AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 50 0 0 0 0 0 0 0 0 0 0 0 0 0 Beam 0 73 0 0 0 0 0 0 0 0 0 0 0 0 Flat 0 0 37 0 0 0 0 0 0 0 0 0 0 1 Natural 0 0 0 31 0 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 76 0 0 0 0 0 0 0 0 0 NoteFlag 0 0 0 0 0 27 0 0 0 0 0 0 0 3 NoteOpen 0 0 0 0 0 0 77 0 0 0 0 0 0 0 TieSlur 0 0 0 0 0 0 0 26 0 0 0 0 0 4 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 15 0 0 0 0 Sharp 0 0 0 0 0 0 0 0 0 0 80 0 0 0 Time 0 0 0 0 0 0 0 0 0 0 0 3 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 30 0 Unknown 0 0 0 0 0 0 1 1 0 0 0 0 0 74 AltoClef Table B.7: Table of confusion of the support vector machines. Appendix B. Table of confusion 100 AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 39 0 0 0 0 0 0 0 0 0 0 0 0 11 Beam 0 71 0 0 0 0 0 0 0 0 0 0 0 2 Flat 3 0 38 0 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 31 0 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 71 0 0 0 0 0 0 0 0 5 NoteFlag 0 0 0 0 1 19 2 0 0 1 2 0 0 5 NoteOpen 2 0 0 0 0 1 68 0 0 0 0 0 0 6 TieSlur 0 0 1 0 0 0 0 25 0 0 0 0 0 4 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 13 0 0 0 2 Sharp 0 0 0 0 0 0 0 0 0 0 80 0 0 0 Time 0 0 0 0 0 0 0 0 0 0 0 1 0 2 TrebleClef 0 0 0 0 0 4 0 0 0 0 0 0 11 15 Unknown 1 4 0 0 6 0 3 6 0 0 0 0 0 56 AltoClef Table B.8: Table of confusion of the hidden Markov Model. B.3. Results Obtained With Elastic Matching for the Handwritten Music Symbols 101 B.3 Results Obtained With Elastic Matching for the Handwritten Music Symbols Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown Accent 47 0 0 0 0 0 0 0 0 0 0 0 0 0 BassClef 0 6 0 0 0 0 0 0 0 0 0 0 0 0 Beam 2 0 104 0 0 0 0 0 1 0 0 0 0 2 Flat 0 0 0 57 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 0 79 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 0 112 1 1 0 0 0 0 0 2 NoteFlag 0 0 0 0 0 4 26 0 0 0 0 0 0 0 NoteOpen 0 0 0 0 0 2 0 4 0 0 0 0 0 0 RestI 0 0 0 0 0 0 0 0 33 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 100 0 0 0 0 Sharp 0 0 0 0 1 0 0 0 0 0 85 0 0 0 Staccatissimo 0 0 0 0 0 0 0 0 0 0 0 5 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 24 0 Unknown 1 0 7 14 0 7 2 0 1 2 3 0 0 64 Table B.9: Table of confusion of the nearest neighbor classier. Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 43 0 0 1 1 0 0 0 1 0 0 0 1 0 BassClef 0 0 0 0 0 3 1 0 0 0 0 0 0 2 Beam 0 0 94 0 0 3 0 0 0 0 7 0 0 5 Flat 1 0 0 41 6 1 0 0 3 0 3 0 0 2 Natural 2 0 1 3 71 0 0 0 0 0 1 0 0 1 Note 0 0 2 0 1 86 13 0 1 3 1 0 2 7 NoteFlag 0 0 1 0 0 8 12 0 0 0 1 0 1 7 NoteOpen 0 0 0 0 0 3 1 0 0 2 0 0 0 0 RestI 6 0 0 0 3 0 0 0 15 0 7 0 0 2 RestII 0 0 1 0 0 0 0 0 0 93 4 0 0 2 Sharp 2 0 0 2 1 2 0 0 0 1 76 0 1 1 Staccatissimo 0 0 0 0 0 0 0 0 0 2 2 1 0 0 TrebleClef 0 0 0 0 0 2 0 0 1 1 0 0 14 6 Unknown 3 0 8 3 6 17 10 0 0 4 3 0 2 45 Accent Table B.10: Table of confusion of the neural network. Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 47 0 0 0 0 0 0 0 0 0 0 0 0 0 BassClef 0 6 0 0 0 0 0 0 0 0 0 0 0 0 Beam 0 0 109 0 0 0 0 0 0 0 0 0 0 0 Flat 0 0 0 57 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 0 79 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 0 116 0 0 0 0 0 0 0 0 NoteFlag 0 0 0 0 0 0 30 0 0 0 0 0 0 0 NoteOpen 0 0 0 0 0 0 0 6 0 0 0 0 0 0 RestI 0 0 0 0 0 0 0 0 33 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 100 0 0 0 0 Sharp 0 0 0 0 0 0 0 0 0 0 86 0 0 0 Staccatissimo 0 0 0 0 0 0 0 0 0 0 0 5 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 24 0 Unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 101 Accent Table B.11: Table of confusion of the support vector machines. Appendix B. Table of confusion 102 Accent BassClef Beam Flat Natural Note NoteFlag NoteOpen RestI RestII Sharp Staccatissimo TrebleClef Unknown 44 0 0 0 1 0 0 0 1 1 0 0 0 1 BassClef 0 6 1 0 2 0 0 0 0 0 0 0 0 0 Beam 0 0 107 0 0 3 0 0 0 0 1 0 0 0 Flat 0 0 2 42 11 0 0 1 3 0 0 0 0 0 Natural 0 0 0 2 74 0 0 0 3 0 0 0 0 1 Note 0 0 7 2 4 92 2 0 4 6 0 1 0 0 NoteFlag 1 0 1 0 3 4 9 1 1 4 6 0 0 2 NoteOpen 0 0 0 0 4 2 1 2 0 0 0 0 0 0 RestI 0 0 0 0 0 0 0 0 36 0 0 0 0 0 RestII 4 0 0 0 3 1 0 1 6 86 0 0 0 0 Sharp 0 0 2 0 8 3 0 0 0 3 72 0 0 0 Staccatissimo 0 0 0 0 0 0 0 0 0 0 0 6 0 0 TrebleClef 0 1 0 0 1 0 0 0 0 0 4 0 21 0 Unknown 3 1 16 8 8 18 4 0 2 0 7 2 1 31 Accent Table B.12: Table of confusion of the hidden Markov model. B.4. Results Obtained With Elastic Matching for the Printed Music Symbols 103 B.4 Results Obtained With Elastic Matching for the Printed Music Symbols AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 49 0 0 0 0 0 0 0 0 0 0 0 0 1 Beam 0 73 0 0 0 0 0 0 0 0 0 0 0 0 Flat 0 0 38 0 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 31 0 0 0 0 0 0 0 0 0 0 Note 0 0 0 0 76 0 0 0 0 0 0 0 0 0 NoteFlag 0 0 0 0 3 27 0 0 0 0 0 0 0 0 NoteOpen 0 0 0 0 1 0 76 0 0 0 0 0 0 0 TieSlur 0 0 0 0 0 0 2 27 0 0 0 0 0 1 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 15 0 0 0 0 Sharp 0 0 0 0 1 0 0 0 0 0 80 0 0 0 Time 0 0 0 0 0 0 0 0 0 0 0 3 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 30 0 Unknown 1 1 2 0 3 0 3 4 0 0 1 0 1 60 AltoClef Table B.13: Table of confusion of the nearest neighbor classier. AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 43 0 0 0 0 0 0 0 0 0 0 0 0 7 Beam 0 71 0 0 0 0 1 0 0 0 1 0 0 0 Flat 0 0 36 0 0 0 0 0 0 0 0 0 0 2 Natural 0 1 0 27 0 0 0 2 0 0 0 0 0 1 Note 0 0 0 0 64 0 7 2 0 0 1 0 2 0 NoteFlag 0 0 0 0 3 12 4 0 0 0 2 0 3 6 NoteOpen 0 0 1 0 1 0 71 0 0 0 0 0 2 2 TieSlur 2 1 0 0 0 0 5 4 0 0 1 0 3 14 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 2 2 0 3 0 0 3 0 1 4 Sharp 1 0 0 1 0 0 0 0 0 0 78 0 0 0 Time 0 0 0 0 0 0 0 2 0 0 1 0 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 2 0 28 0 Unknown 3 1 2 0 7 3 8 3 0 0 5 0 1 43 AltoClef Table B.14: Table of confusion of the neural network. AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 45 0 0 0 0 0 0 0 0 0 0 0 0 5 Beam 0 66 0 0 0 0 0 0 0 0 0 0 0 7 Flat 0 0 37 0 0 0 0 0 0 0 0 0 0 1 Natural 0 0 0 29 0 0 0 0 0 0 0 0 0 2 Note 0 0 0 0 69 0 0 0 0 0 0 0 0 7 NoteFlag 0 0 0 0 0 29 0 0 0 0 0 0 0 1 NoteOpen 0 0 0 0 0 0 72 0 0 0 0 0 0 5 TieSlur 0 0 0 0 0 0 0 26 0 0 0 0 0 4 RestI 0 0 0 0 0 0 0 0 15 0 0 0 0 1 RestII 0 0 0 0 0 0 0 0 0 14 0 0 0 1 Sharp 0 0 0 0 0 0 0 0 0 0 78 0 0 2 Time 0 0 0 0 0 0 0 0 0 0 0 3 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 30 0 Unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 76 AltoClef Table B.15: Table of confusion of the support vector machines. Appendix B. Table of confusion 104 AltoClef Beam Flat Natural Note NoteFlag NoteOpen TieSlur RestI RestII Sharp Time TrebleClef Unknown 39 0 0 0 0 1 0 0 0 0 0 0 0 10 Beam 0 73 0 0 0 0 0 0 0 0 0 0 0 0 Flat 0 0 37 1 0 0 0 0 0 0 0 0 0 0 Natural 0 0 0 31 0 0 0 0 0 0 0 0 0 0 Note 0 0 1 0 70 0 1 3 0 0 0 0 0 0 NoteFlag 0 0 0 0 11 15 0 1 0 0 2 0 0 1 NoteOpen 0 0 0 5 0 8 62 0 0 0 0 0 0 2 TieSlur 0 0 1 0 6 0 0 21 0 0 0 0 0 2 RestI 0 0 0 0 0 0 0 0 16 0 0 0 0 0 RestII 0 0 0 0 0 0 0 0 0 13 2 0 0 0 Sharp 0 0 0 0 0 0 0 0 0 0 80 0 0 0 Time 0 0 0 0 0 0 2 0 0 0 0 1 0 0 TrebleClef 0 0 0 0 0 0 0 0 0 0 0 0 30 0 Unknown 2 6 1 0 9 7 5 3 6 3 1 0 0 33 AltoClef Table B.16: Table of confusion of the hidden Markov model. Appendix C Articles Submited to Conferences 105 A Shortest Path Approach for Staff Line Detection Ana Rebelo FCUP and INESC Porto Portugal Artur Capela FEUP and INESC Porto Portugal Joaquim F. Pinto da Costa FCUP Portugal [email protected] [email protected] [email protected] Carlos Guedes ESMAE Portugal Eurico Carrapatoso FEUP and INESC Porto Portugal Jaime S. Cardoso FEUP and INESC Porto Portugal [email protected] [email protected] [email protected] Abstract Many music works produced in the past still exist only as original manuscripts or as photocopies. Preserving them entails their digitalization and consequent accessibility in a digital format easy-to-manage. The manual process to carry out this task is very time consuming and error prone. Optical music recognition (OMR) is a form of structured document image analysis where music symbols are isolated and identified so that the music can be conveniently processed. While OMR systems perform well on printed scores, current methods for reading handwritten musical scores by computers remain far from ideal. One of the fundamental stages of this process is the staff line detection. In this paper a new method for the automatic detection of music stave lines based on a shortest path approach is presented. Lines with some curvature, discontinuities, and inclination are robustly detected. The proposed algorithm behaves favourably when compared experimentally with wellestablished algorithms. 1. Introduction The impact of music in our lives can hardly be overestimated. Music is a pivotal part of our cultural heritage and its preservation, in all of its forms, must be pursued. Frequently, the preservation of many music works entails the digitalization of these works and consequent accessibility in a format that encourages browsing, analysis and retrieval. In fact, many music works produced during the last centuries still exist only as original manuscripts or as photocopies. The digitalization of these works is therefore a highly desirable goal. Unfortunately, the ambitious goal of providing generalized access to handwritten scores that were never published has been severely hampered by the actual state-of-the-art of handwritten music recognition. The manual process required to recognize handwritten musical symbols in scores and to put them in relationship with the spine structure is very time consuming. Despite the fact that OMR systems dealing with machine printed scores exhibit good performance, handwritten music recognition introduces several additional difficulties. Outstanding problems include notation varying from writer to writer, and possibly varying in the same score: symbols and staff lines written with different sizes, shapes or intensity. Despite the continued research on OMR, with the availability of several commercial OMR systems, we are still lacking a satisfactory performance in terms of precision and reliability. Most of the existing work provides a real efficiency only when quite regular, printed music sheets are processed. This condition is exacerbated with handwritten music scores. This justifies the research around the definition of reliable OMR algorithms. Staff line detection is one of the fundamental stages of the OMR process, with subsequent processes relying heavily on its performance. The reasons for detecting and removing the staff lines lie on the need to isolate the musical symbols for a more efficient and correct detection of each symbol presented on the score. The detection of staves is complicated by a variety of reasons. The handwritten staff lines are rarely straight and horizontal, and are not parallel to each other. For example, some staves may be tilted one way or another on the same page or they may be curved. This is especially true for handwritten scores. Since these scores tend to be rather irregular and determined by a person’s own writing style, the staff lines might be twisted, being curved and not really horizontal at all. It depends on how regular the author writes the symbols in his scores. There might also be big- ger or smaller gaps along a staff line. And if we consider that most of these works are old, the quality of the paper in which it is written might have degraded throughout the years, making it a lot harder to correctly identify its contents. In this paper a method for the automatic detection of staff lines based on a shortest path approach is presented. The proposed paradigm uses the image as a graph, where the staff lines result as the shortest path between the two margins of the image. This introduction is concluded with a brief review of the work done in this area. In section 2 the proposed algorithm is described. In section 3, the proposed algorithm is experimentally evaluated using real music scores. Finally, conclusions are drawn and future work is outlined in section 4. 1.1. Related Works Different methods for staff line detection have been researched. The simplest approach consists on finding local maxima in the horizontal projection of the black pixels of the image [2]. These local maxima represent line positions. This method assumes straight and horizontal lines. Several horizontal projections can be made with different image rotation angles, keeping the image in which the local maxima are bigger. This eliminates the assumption that the lines are always horizontal. An alternative strategy for identifying staff lines is to use vertical scan lines [3, 7, 8, 13]. More recent works present a more or less sophisticated use of a combination of projection techniques to improve on the basic approach [1]. Fujinaga [5] incorporates a set of image processing techniques in the algorithm, including run-length coding (RLC), connected-component analysis, and projections. After applying the RLC to find the thickness of staff lines and the space between the staff lines, any vertical black runs that are more than twice the staff line height are removed from the original. Then, the connected components are scanned in order to eliminate any component whose width is less than the staffspace height. After a global deskewing, taller components, such as slurs and dynamic wedges are removed. Other techniques for finding stave lines include the application of mathematical morphology algorithms [11, 15, 6], rule-based classification of thin horizontal line segments [10], and line tracing [12, 14]. In spite of the variety of methods available, they all suffer from some limitations. In particular, lines with some curvature or discontinuities are inadequately resolved. The dash detector [9] is one of few works that try to handle discontinuities. The dash detector is an algorithm that searches the image, pixel by pixel, finding black pixel regions that it classifies as stains or dashes. Then, it tries to unite the dashes to construct lines. 2. A Shortest Path Approach for Staff Line Detection A staff line can be considered as a path from the left side of the music score to the right side. As staff lines are almost the only extensive black objects on the music score, the path we are looking for is the shortest path between the two margins if paths (almost) entirely through black pixels are favoured. In the work to be detailed, the image grid is considered as a graph with pixels as nodes and edges connecting neighbouring pixels. Therefore, some graph concepts are in order. 2.1. Definitions and Notation A graph G = (V, A) is composed of two sets V and A. V is the set of nodes, and A the set of arcs (p, q), p, q ∈ V . The graph is weighted if a weight w(p, q) is associated to each arc, and it is called a digraph if the arcs are directed, i.e., (p, q) 6= (q, p). A path from p1 to pn is a list of unique nodes p1 , p2 , . . . , pn , (pi , pi+1 ) ∈ A. The path cost is the sum of each arc weight in the path. In graph theory, the shortest-path problem seeks the shortest path connecting two nodes; efficient algorithms are available to solve this problem, such as the well-known Dijkstra algorithm [4]. 2.2. General Framework Description As mentioned before, a staff line corresponds to a path from (almost) the left margin of the image to (almost) the right side of the image, (almost) always through black pixels. Starting by modelling the edge image as a graph, match a node to each pixel. Connect two nodes with an arc on the graph iff the corresponding pixels are neighbours (8connected neighbourhoods) on the image. The weight of each arc is a function of pixels values and pixels relative positions (see Figure 1): w2 w3 w1 w8 w4 w5 w7 w6 Figure 1. Arc weight between two pixels. ( f (p, qi ) if qi ∈ 4-connected neighbourhood of p wi = h(p, qi ) if qi 6∈ 4-connected neighbourhood of p √ In this work we set h(., .) = 2f (., .). The objective is then to design the weights wi such that the weights are low for the pixels of interest (black) and high otherwise (this will lead to small weighted distances from left to right for paths through staff lines and large for the rest). Therefore this setting will favour paths through black pixels, as required. We set ( c1 if p or q are black pixels f (p, q) = c2 otherwise with c2 > c1 . In this work c1 and c2 were experimentally determined as 2 and 6, respectively. Note that c1 must be set greater than zero in order to also favour the smallest path, when more than one exists through black pixels only. Finally, the solution to the shortest path problem will yield the intended staff line. (a) A pair of staff lines. s1 e1 s8 e8 (b) The shortest path between some pairs of points. Figure 3. A second exemplificative example. (a) Skewed staff lines with music (b) The shortest path between symbols. some pairs of points. Figure 4. A third exemplificative example. 2.3. Illustrative Examples In order to get a better intuition of the general result, it is instructive to first explore some basic examples. Consider a music score with a single staff line as represented in Figure 2(a). In Figure 2(b) the shortest paths between starting s1 e1 s5 e5 clear that slight skewed scores do not pose any problem to the proposed approach. Nonetheless, some issues are visible and need to be conveniently addressed. Due to the skew of the staff lines, some of the shortest paths jump between consecutive staff lines. The text on the top of the score also constitutes a false low weight path between the two margins, inducing some paths to go through it. Finally, even when a staff line is correctly followed by the path, the initial and the final parts of the path should be ignored. 2.4. Proposed Algorithm (a) A single staff line. (b) The shortest path between some pairs of points. Figure 2. A first exemplificative example. points si on the left margin and ending points ei on the right margin, at the same row, are traced. All paths get attracted by the staff line. A similar condition is verified when we have more than one staff line (see Figure 3). Now paths get attracted to the nearest staff line. Half of the paths inbetween two consecutive staff lines goes along the top staff line; the other half follows the bottom staff line. The last example, in Figure 4, shows that music symbols placed on top of staff lines do not interfere with the detection of the staff lines. Moreover, the example also makes To detect the staff lines, the proposed overall algorithm starts by estimating the staffspace height. This length will be used as a reference length for the subsequent operations. Robust estimators of the staffspace height are already in common use. The technique starts by computing the vertical run-lengths representation of the image. If a bit-mapped page of music is converted to vertical run-lengths coding, the most common black-runs represents the staff line height and the most common white-runs represents the staffspace height [5]. After the estimation of the staffspace height, the proposed approach applies the main step of the framework: for each row of the image, the shortest path between the leftmost pixel and the rightmost pixel is found, using the Dijkstra algorithm [4]. Instead of considering the whole image when computing the shortest path, only a strip centred on the row of interest is used— see Figure 5. This allows to constrain the complexity of the algorithm. On the experiments we have set STRIP HEIGHT=STAFFSPACE HEIGHT. 2 x STRIP HEIGHT Figure 5. Vertical strip around the row of interest. Now, because the search for the shortest path is constrained to stay in a vertical strip (or because the nearest staff line is far enough from the current row), the shortest path may not follow a staff line. Therefore, the main step is followed by a sequence of (arguably) sensible rules, aimed at discarding false staff lines. First, all paths without a percentage of black pixels above a threshold BLACK PERC are discarded (a threshold of BLACK PERC = 0.75 was used on the experiments). Next, each retained path is trimmed at the beginning and at the end. As visible in the previous examples (refer to Figure 4), before meeting with a staff line, a path travels through a sequence of white pixels. Likewise, after the end of the staff line, the path goes again through a sequence of white pixels until it meets the right margin of the image. In order to ignore all these white pixels, the initial pixels of the path are discarded until a run of at least BLACK RUN black pixels are found in the path. In the same way, all pixels of the path after the last occurrence of a run of at least BLACK RUN black pixels are discarded. A threshold of BLACK RUN = 2 × STAFFSPACE HEIGHT was used on the experiments. Finally, the proposed overall algorithm ends with the validation of the preserved paths. Because the staff lines are expected to be straight lines, the linear correlation coefficient of the x and y components of the pixels of the path is used to reject paths that do not meet the linearity criterion, with a correlation coefficient bellow a threshold CORRCOEF. The threshold used here is conservative, since for manuscript lines the (perfect) linearity is not guaranteed. This precaution is needed because staves on a page are often distorted in different ways. 2.4.1 Summation The shortest path approach has some advantages over standard algorithms presented in the literature for the detection of staff lines: • While almost all current approaches provide only isolated pieces of a staff line, the approach proposed here outputs a complete line, with starting and ending points. • The staff lines are rarely straight and horizontal, and are not parallel to each other. For example, some staves may be tilted one way or another on the same page or they may be curved. While current approaches apply a chain of heuristics to correct these undesired imperfections, the shortest path algorithm is naturally robust to these challenging conditions. • The proposed approach is robust to broken staff lines (due to low-quality digitalization or low-quality originals) or staff lines as thin as one pixel. Missing pieces are automatically ’completed’ by the algorithm. For reference, the overall algorithm is summarized in Listing 1 with pseudo-code. StaffLine_Detection(IMAGE, CORRCOEF) { STAFFSPACE_HEIGHT = computeStaffSpaceHeight(); BLACK_PERC = 0.75; BLACK_RUN = STAFFSPACE_HEIGHT; STRIP_HEIGHT = STAFFSPACE_HEIGHT; for (int row = 0; row < ImageHeight; row++) { Point2D start(0, row); Point2D end(ImageWidth-1, row); Path path = findShortestPath (start, end); if(blackPercentage(path) < BLACK_PERC) continue; path = trimPath(path); if(corrcoef(path) < CORRCOEF) continue; addPathToSetOfStaffLines(path); } } Listing 1: Shortest Path StaffLine Detection. 3. Results We now present additional, complete, examples of the proposed framework for automatic staff lines detection. The code for all the examples in this paper has been written in C++1 . For comparison purposes, the method by Fujinaga [5] was also implemented. Three types of scores were used 1 The source code is available upon request to the authors. in the experiments: machine printed scores—see Figure 6; handwritten music scores sitting on regular staff lines—see Figure 7 and Figure 8; and irregular handwritten scores— see Figure 9. Before applying the staff line detection algorithms, each image was binarized. Analysing the results for Fujinaga’s method, we observe that some information is lost, information that could be important for the recognition tasks which follow. The places where musical symbols are present create gaps on the detected staff lines; in some places the gaps are larger than the space occupied by those symbols—see Figure 7(c). This is mainly due to the low quality of the lines. The shortest path approach is able to overcome these conditions because it follows a continuous path connecting both line ends. The robustness of the proposed approach to possible defects in staff lines (curvature, discontinuities) is apparent in the example of Figure 7(c). Occasionally, the shortest path algorithm retains paths that do not go completely through a staff line. This problem is prone to happen when there’s a large beamed note density—see Figure 9(b)—making it go through the beams for this set of notes (e.g. quavers and semiquavers) or when the curvature of the lines is such that the path jumps between consecutive lines. This condition can be overcome by automatically learning the best threshold on the linearity test of the path for each image being processed, or by the incorporating additional rules to validate the path. Both approaches are currently being investigated. Nevertheless, considering handwritten scores, we can see that the proposed algorithm has good and promising results. 4. Conclusion The first challenge faced by an OMR system is staff line detection. This first task dictates the possibility of success for the recognition of the music score. And when it comes to handwritten music scores, existing solutions are far from presenting satisfactory results. In this paper, a new algorithm for the automatic detection of staff lines in music scores was proposed. The shortest path approach for staff line detection algorithm brings a new and promising approach to the staff line detection task. The technique is adaptable to a wide range of image conditions, thanks to its intrinsic robustness to (slightly) skewed images, discontinuities, and curved staff lines. There are several directions of research to pursue with the framework introduced in this paper. We are currently investigating the computational feasibility of, instead of finding the shortest path between a starting point on the left margin and an ending point on the right margin at the same row, finding the shortest path between a starting point on the left margin and the whole right margin. That will allow coping with severely skewed scores. Another line of investigation is the automatic learning of the parameters of the algorithm. Although most of the parameters values are scaled by the staff height space, the threshold for the test of linearity is still manually set. This improvement could make the algorithm work with a wide range of types of scores, without requiring any parameter tuning by the user. Finally, more than just having spatial continuity of the path— enforced by the shortest path algorithm—, we are also pursuing the continuity of the direction of the path. As staff lines do not suffer abrupt changes of direction, even when manually drawn, it seems sensible to impose this additional constraint. Acknowledgments This work was partially funded by Fundação para a Ciência e Tecnologia (FCT) - Portugal through project PTDC/EIA/71225/2006. References [1] D. Bainbridge. Extensible Optical Music Recognition. PhD thesis, Department of Computer Science, University of Canterbury, Christchurch, NZ, 1997. [2] D. Blostein and H. S. Baird. A critical survey of music image analysis. In Baird, Bunke, and Y. (Eds.), editors, Structured Document Image Analysis, pages 405–434. Springer-Verlag, Heidelberg, 1992. [3] N. P. Carter. Automatic Recognition of Printed Music in the Context of Electronic Publishing. PhD thesis, Departments of Physics and Music, University of Surrey, 1989. [4] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959. [5] I. Fujinaga. Staff detection and removal. In S. George, editor, Visual Perception of Music Notation: On-Line and OffLine Recognition, pages 1–39. Idea Group Inc., 2004. [6] I. Gawedzki. Optical music scores recognition. Technical report, 2002. [7] S. Glass. Optical music recognition. B.Sc thesis, 1989. [8] H. Kato and S. Inokuchi. A recognition system for printed piano music using musical knowledge and constraints. In Proceedings of the International Association for Pattern Recognition Workshop on Syntactic and Structural Pattern Recognition, pages 231–248, 1990. [9] I. Leplumey, J. Camillerapp, and G. Lorette. A robust detector for music staves. In Proceedings of the International Conference on Document Analysis and Recognition, pages 902–905, 1993. [10] J. V. Mahoney. Automatic analysis of music score images. B.Sc thesis, 1982. [11] B. R. Modayur, V. Ramesh, R. M. Haralick, and L. G. Shapiro. Muser: A prototype musical score recognition system using mathematical morphology. Machine Vision and Applications, 6:140–150, 1993. [12] D. Prerau. Computer pattern recognition of standard engraved music notation. PhD thesis, Department of Computer Science and Engineering, MIT, 1970. (a) Original score. (b) Staff lines detected by our algorithm. (c) Staff lines detected by Fujinaga’s method [5]. Figure 6. Results for a machine-printed score. (a) Original score. (b) Staff lines detected by our algorithm. (c) Staff lines detected by Fujinaga’s method [5]. Figure 7. Results for a handwritten score. [13] T. Reed. Optical music recognition. Master’s thesis, Department of Computer Science, University of Calgary, Canada, 1995. [14] J. W. Roach and J. E. Tatem. Using domain knowledge in low-level visual processing to interpret handwritten music: an experiment. Pattern Recognition, 21(1):33–44, 1988. [15] M. Roth. OMR-optical music recognition. Master’s thesis, Swiss Federal Institute of Tecnology, 1992. (a) Original skewed score. (b) Staff lines detected by our algorithm. (c) Staff lines detected by Fujinaga’s method [5]. Figure 8. Results for a skewed handwritten score. (a) Original score. (b) Staff lines detected by our algorithm. (c) Staff lines detected by Fujinaga’s method [5]. Figure 9. Results for a handwritten score. STAFF LINE DETECTION AND REMOVAL WITH STABLE PATHS Artur Capela, Ana Rebelo INESC Porto, Campus da FEUP, Rua Dr. Roberto Frias 378, 4200-465 Porto, Portugal [email protected], [email protected] Jaime S. Cardoso INESC Porto, Faculdade de Engenharia, Universidade do Porto, Portugal [email protected] Carlos Guedes INESC Porto, Escola Superior de Música e Artes do Espectáculo, Portugal [email protected] Keywords: Music, optical character recognition, document image processing, image analysis. Abstract: Many music works produced in the past are currently available only as original manuscripts or as photocopies. Preserving them entails their digitalization and consequent accessibility in a machine-readable format, which encourages browsing, retrieval, search and analysis while providing a generalized access to the digital material. Carrying this task manually is very time consuming and error prone. While optical music recognition (OMR) systems usually perform well on printed scores, the processing of handwritten music by computers remains below the expectations. One of the fundamental stages to carry out this task is the detection and subsequent removal of staff lines. In this paper we integrate a general-purpose, knowledge-free method for the automatic detection of staff lines based on stable paths, into a recently developed staff line removal toolkit. Lines affected by curvature, discontinuities, and inclination are robustly detected. We have also developed a staff removal algorithm adapting an existing line removal approach to use the stable path algorithm at the detection stage. Experimental results show that the proposed technique outperforms well-established algorithms. The developed algorithm will now be integrated in a web based system providing seamless access to browsing, retrieval, search and analysis of submitted scores. 1 INTRODUCTION The Universal Declaration on Cultural Diversity adopted by the General Conference of UNESCO on 2001 asserts that cultural diversity is as necessary for humankind as biodiversity is for nature, and that policies to promote and protect cultural diversity thus are an integral part of sustainable development. Being music a pivotal part of our cultural heritage, its preservation, in all of its forms, must be pursued. Frequently, the preservation of many music works entails their digitalization and consequent accessibility in a format that encourages browsing, analysis and retrieval. There is a vast amount of invaluable paper-based heritage, including printed and handwritten music scores, which are deteriorating over time due to natural decaying of paper and chemical reaction (i.e., between written ink and paper). Various efforts have been focused on this issue in order to preserve the record of our heritage. Digitisation has been com- monly used as a possible tool for preservation. Although a digital copy may not conserve the original document, it can preserve the most important part: its data. It has also the advantages of easy duplications, distribution, and digital processing. Nevertheless, the output of the digitalization process is not amenable for further analysis or semantic search operations. Thus, an Optical Music Recognition (OMR) process is needed. However, the manual process required to recognize handwritten musical symbols in scores and to put them in relationship with the spine structure of the score is very time consuming. This justifies the research around reliable automatic OMR algorithms as current solutions are still below the expectations. As a concrete example, Portugal has a notorious lack in music publishing from virtually all eras of its musical history. However, whereas most of the known original music manuscripts before the twentieth century are kept at the National Library Archive in Lisbon, there is virtually no national repository for the Portuguese music from the twentieth century. Although there are recent efforts in order to catalogue and preserve in digital form the Portuguese music from the late twentieth century—notably the Music Information Center (MIC, 2008) and the section on musical heritage from the Institute of the Arts website (IOA, 2008)—most of the music pre-dating computer notation software was never published and still exists as manuscripts or photocopies spread out all over the country in inconspicuous places. The risk of irreversibly losing this rich cultural heritage is thus a reality. 1.1 OMR System The project “Optical recognition system for handwritten music scores” initiated in 2007 by INESC Porto and ESMAE is the point of departure for creating a web-based system of music manuscripts of Portuguese composers from the twentieth century. This database will provide generalized access to a wide corpus of unpublished handwritten music encoded in MusicXML, which can be accessed remotely via the Internet. The database will not only centralize as much information as possible but will also serve to preserve this corpus in a way that is easily accessible for browsing, analysis, and ultimately, for performing this repertoire, therefore helping to keep the Portuguese music alive (Capela et al., 2008). Although the aim of this project is Portuguese music, it is equally valid for all printed and handwritten music scores that need to be preserved from all around the world. The ambitious goal of providing generalized access to handwritten scores that have never been published has been severely hampered by the actual stateof-the-art of handwritten music recognition. There are currently various commercial OMR software solutions (Capella software1 , SharpEye Music Reader2 , OMeR3 ) and a few open source solutions (AOMR24 , OpenOMR5 , Audiveris6 ), but they are all offline standalone applications. The existing online archives of music scores (Lester Levi Collection, 2008; Classical Sheet Music Collection, 2008; Mutopia Collection, 2008) usually provide them in inadequate formats— 1 http://www.capella-software.com/capscan. htm 2 http://www.visiv.co.uk 3 http://www.myriad-online.com/en/products/ omer.htm 4 http://www.bzzt.net/ arnouten/wiki/index. ˜ php/Gamera#AOMR2:_omr_toolkit 5 http://sourceforge.net/projects/openomr 6 http://audiveris.dev.java.net usually only as the scanned score image—for retrieval or automatic analysis. These online archives are mere standard websites, without facilities for optical recognition, editing and searching through the scores musical content. The creation of an OMR system, integrating optical recognition, storage, search, browsing and downloading capabilities, while keeping the scores in their original format along with their digital counterpart, would therefore be extremely beneficial. An integrated score editor would be provided in order to view and edit the submitted music scores. In our previous work on this project we have presented a complete OMR System solution— OMRSYS (Capela et al., 2008)—comprising a database driven web application with one or more OMR applications integrated in the proposed system. The proposed architecture successfully attends the stated objectives. At the end of our project we plan on developing a fully functional system according to the specified architecture and integrating a complete OMR package. 1.2 Detection and Removal of Staff Lines Staff line detection and removal are the first fundamental stages on the OMR process, with subsequent processes relying heavily on their performance. The reasons for detecting and removing the staff lines lie on the need to isolate the musical symbols for a more efficient and correct detection of each symbol present on the score. Although their primary application is as a preprocessing step in the recognition of music notation, the line detection problem also occurs in different contexts (e.g., the recognition of bank transfer forms). The detection of staves is complicated due to a variety of reasons. The handwritten staff lines are rarely straight and horizontal, and are not parallel to each other. For example, some staves may be tilted one way or another on the same page or they may be curved. These scores tend to be rather irregular and determined by a person’s own writing style. Moreover, if we consider that most of these works are old, the quality of the paper in which it is written might have degraded throughout the years, making it a lot harder to correctly identify its contents. In (Cardoso et al., 2008a; Cardoso et al., 2008b) we presented a new and robust staff line detection algorithm based on a stable paths approach. The proposed paradigm uses the image as a graph, where the staff lines result as connected paths between the two lateral margins of the image. A staff line can be considered as a connected path from the left side to the right side of the music score. As staff lines are almost the only extensive black objects on the music score, the path to look for is the shortest path between the two margins if paths (almost) entirely through black pixels are favoured. In this paper we present our recent work and results focusing on the implementation of the Stable Paths algorithm as a C++ plugin for the MusicStaves Toolkit (Dalitz et al., 2008; MusicStaves Toolkit, 2008) (based on the Gamera Framework (MacMillan et al., 2002; Gamera Framework, 2008)), as well as on the removal stage by using our detection algorithm in the first stage. We have also adapted a removal algorithm based on the LineTrack Height approach proposed in (Dalitz et al., 2008), which we will present in section 3. In section 2 the Gamera and MusicStaves Toolkit are presented, together with our C++ plugin. In section 4, both our proposed detection and removal algorithms are evaluated experimentally using a wellknown dataset of music scores. Finally, conclusions are drawn and future work is outlined in section 5. 2 STABLE PATHS INTEGRATION In this section we present the platform in which our detection algorithm was integrated for testing and validation. The platform is comprised by its core—the Gamera Framework (MacMillan et al., 2002; Gamera Framework, 2008)—and by a toolkit for the evaluation of detection and removal algorithms—the MusicStaves Toolkit (Dalitz et al., 2008; MusicStaves Toolkit, 2008). This toolkit is a set of Gamera plugins aiming to support the development and test of staff line detection and removal algorithms for music scores, by extending the Gamera functionality. After we describe each part constituting this platform, we present the implementation of our staff line detection algorithm as a C++ plugin on the MusicStaves Toolkit. Finally, we present the integration of our detection algorithm on some removal algorithms in the MusicStaves toolkit. We have integrated our algorithm in those removal algorithms where the detection is processed separately from the removal operation, replacing the Dalitz algorithm (Dalitz et al., 2008) as the detection stage. 2.1 “Generalized Algorithms and Methods for Enhancement and Restoration of Archives”. It combines a programming library with graphical tools for an interactive training and development of recognition systems. This framework tries to be a tool to create custom applications through the domain experts knowledge instead of responding to various requirements with a monolithic application. It aims at providing an efficient test and refinement development cycle. The programming language in its basis is the high-level language Python, although it has many extensions written in C++ to carry out low-level image processing. Nevertheless, due to its nature, Python turns the code writing process more agile and facilitates the use of scripting, which makes Gamera an interactive and batchable framework. Besides the large number of extensions deployed with this framework, it is also possible to customize and extend with plugins or toolkits, written in Python and C++. The Gamera framework is modular and organized in a series of horizontal layers, which can be seen at Figure 1. Gamera Framework Gamera (MacMillan et al., 2002; Gamera Framework, 2008) is a portable and open source framework to create structured documents analysis applications by domain experts. The name Gamera is an acronym to Figure 1: Gamera Architecture. Gamera follows a modular plugin architecture. It is made of modules (plugins), both written in Python and C++, integrated in a high-level scripting environment. Each module executes a task on the recognition process. The framework maintains a toolbox design approach, i.e., a user has access to a large set of tools for the optical recognition stages. 2.2 MusicStaves Toolkit MusicStaves (Dalitz et al., 2008; MusicStaves Toolkit, 2008) is a Gamera toolkit specific for the development and test of staff line detection and removal algorithms. This Phyton toolkit also integrates facilities to create a test set of music scores and to evaluate results with established metrics. As with Gamera, this toolkit is portable and its source code is freely available. In order to use MusicStaves, it must be imported to Gamera, either in the GUI or in a programmatic manner. MusicStaves is structured as seen in Figure 2. The toolkit is composed by a set of main classes where the staff line detection and removal algorithms are implemented, and by a set of plugins in Python and C++. Some plugins are used by the implemented algorithms while others are tools to aid the algorithms testing. It is an extendable toolkit in which one can integrate a new staff line detection and/or removal algorithm. Its plugin system follows the Gamera framework rationale and as such the plugins may be written in Python and C++. In order to write new staff line detection or removal algorithms the toolkit provides two interfaces: StaffFinder (detection) and MusicStaves (removal). Figure 3: Some examples of applied deformations: a) Curvature; b) Rotation; c) White Speckles; d) Staffline Interruptions; e) Staffline y-variation; f) Typeset Emulation. See (Dalitz et al., 2008) for details. The plugin encompasses the main StaffFinder class in the toolkit root, the respective Python plugin in the Python plugins folder and the algorithm implementation in C++ in the C++ plugins folder. In Figure 4 we present a diagram of the algorithm as integrated in MusicStaves. The algorithm processing starts with the call to the method find staves, which receives a binarized image as input. That image is then passed to the function in C++ by the plugin Python code. In the C++ implementation, after the respective staff line detection function is called, the image is converted to the format used internally by our algorithm. After the whole detection task is complete it returns the staff lines skeleton in the MusicStaves format. After receiving the skeleton list on its Python class, it fills the structure self.linelist with the obtained values. Figure 2: MusicStaves Architecture. A test set of 32 synthetic music scores is also provided by the authors of this toolkit. Moreover, the toolkit allows applying a set of deformations (e.g., rotation, curvature, typeset emulation, white speckles) commonly found in the real world to these perfect scores—see Figure 3. The purpose is to be able to measure the performance of the removal algorithms contained in MusicStaves by using three defined error metrics (Dalitz et al., 2008): Pixel Level, Segmentation Region Level and Staffline Interruptions. However, the same set may be used to evaluate staff line detection algorithms alone by defining adequate error metrics (Cardoso et al., 2008a; Cardoso et al., 2008b). We have restricted the range of the parameters controlling the intensity of the deformations to values considered realistic. 2.3 Integration We have integrated our recently proposed algorithm for staff line detection (Cardoso et al., 2008a; Cardoso et al., 2008b) in MusicStaves as a C++ plugin. Figure 4: Overall view. Besides integrating our Stable Paths Approach algorithm into the MusicStaves toolkit as a C++ plugin, we have also integrated the algorithm with staff removal algorithms in order to evaluate the improvements over the original results. However, this is not a standard integration as the toolkit does not provide the means for this kind of integration. This integration was coded in the staff removal algorithms by adding the possibility to choose the staff line detection algorithm through a parameter value. A diagram illustrating this integration can be seen in Figure 5. Some removal algorithms present in MusicStaves detect the staff lines along with the removal process. From those, we have used the one with best results in general (Dalitz et al., 2008)—Skeleton—for comparison purposes. Figure 5: Stable Paths integration with staff removal algorithms. 3 STAFF REMOVAL ALGORITHMS In the process of recognizing music scores the staff removal algorithm is processed after the staff line detection stage takes place. In our current work the removal algorithm is based on the LineTrack Height algorithm presented on (Randriamahefa et al., 1993). The goal on this method is to track the staff lines positions obtained by a detection algorithm and remove vertical run sequences of black pixels that have a value lower than a specified threshold, which was experimentally set at 2*staffHeight. As music scores may suffer from deformations, the staff lines may have descontinuities, be curved or inclined. These problems will influence the success to achieve a correct detection of lines contained on the score to recognize. However, due to the abovementioned problems, the positions of the staff lines obtained by a staff line detection algorithm may pass slightly above or under the real staff lines positions. That way, if we are in presence of a white pixel when the staff lines are tracked, we search vertically for the closest black pixel. If that distance is lower than a specified tolerance—experimentally chosen as 1+ceil(staffHeight/3.0)—we move the reference position of the staff line to the position of the black pixel found. On (Dalitz et al., 2008) a new method is presented—Skeleton—which uses the skeleton information, but performs the staff removal on the original Algorithm 1 Staff Removal Algorithm. procedure STAFF L INE R EMOVAL (IMAGE,STAV ES) threshold = 2∗staffHeight; tolerance = 1 + ceil(staffHeight/3.0); IMAGE REMOV E = copy(IMAGE); for nvalid = 0 to STAVES size do Point2D staff = validStaves[nvalid]; for i = 0 to staff size do col =staff[i].x; re f Row =staff[i].y; row = re f Row; pel = valuePixel(IMAGE, IMAGE REMOV E) decrement/increase the reference row until one pixel different from white pixel (dist1/dist2) is found; if dist1 ≤ max(1, min(dist2,tolerance)) then re f Row− = dist1; else if dist2 ≤ max(1, min(dist1,tolerance)) then re f Row+ = dist2; else continue; end if end if Count the number of decrements/increase on the reference row until the black pixel changes to white pixel (run); if run ≥ threshold then continue; end if remove the vertical black sequences on the IMAGE; end for end for end procedure image instead of the skeleton. The method relies on the fact that symbols on the stafflines lead to junction points or corner points in the skeleton. 4 RESULTS Although the evaluation of new staff detection algorithms may be done by visually inspecting the output on a set of scores—as adopted on (Rebelo et al., 2007)—, our current comparison is supported on quantitative measures. The test set adopted for the qualitative evaluation of the proposed method is the one presented in (Dalitz et al., 2008) and already described. It consists of ideal synthetic scores to which a set of known deformations have been applied—see (Dalitz et al., 2008) for more details. In total we have generated 2688 deformed images originated from 32 perfect scores. In order to conveniently measure the performance of staff line removal algorithms we have adopted two error metrics from (Dalitz et al., 2008): Pixel Level and Segmentation Region Level. Staff line detection algorithms can be used as a first step in many staff removal algorithms. To understand the potential of our algorithm to leverage the performance of existing staff removal algorithms, we conducted a series of experiments, comparing the original version of a staff removal algorithm with the modified version of it, making use of the Stable Paths algorithm at the staff line detection step. The quantitative comparison of the different algorithms is totally in line with the comparison presented in (Dalitz et al., 2008). With respect to the considered distortions, regarding the detection stage, the Stable Paths based approach outperforms the Dalitz algorithm. In Figure 6 we present our results for the removal algorithms: LineTrack Height (with Dalitz and Stable Paths), Skeleton and LineTrack Height Modified (with Stable Paths). We chose the methods that present the best results in (Dalitz et al., 2008), implementing our own removal algorithm with LineTrack Height as a basis. In general we verify that the replacement of the Dalitz method by our Stable Paths Approach algorithm as the staff detection step has improved the final staff line removal results7 . Additionally, the LineTracking Height Modified algorithm presents an overall better performance than the original LineTrack Height algorithm from (Dalitz et al., 2008). Our staff line detection and removal approaches also outperform the Skeleton method, although it continues to present a competitive performance. We have not integrated the Stable Paths algorithm with the Skeleton algorithm as the second performs the lines detection along with their removal instead of using two separate stages. All the parameters, both on the Stable Paths detection algorithm and LineTrack Height Modified, were preliminary tuned over an independent set of images. This performance gain is even more noteworthy as the MusicStaves algorithms are receiving as input the correct number of lines per staff. Had not this been the case, the differential between both would have been much larger. In summary, these experiments show the strength of the algorithms presented here. Despite being based on simple and intuitive underlying principles, the performance of the proposed algorithms is quite competitive. The analysed results have covered the detection and removal accuracy but a brief word on speed is also in order. Comparing different algorithms for speed is notoriously difficult; we are simultaneously judging mathematical properties and specific implementa7 For the deformations not shown, the stable path is not significantly better than Dalitz. tions. In the experimental study, the current implementation of the Stable Paths algorithm run almost as fast as the Dalitz algorithm (20% slower). In respect to the removal algorithms our LineTrack Height version with Stable Paths is significantly faster than the Skeleton algorithm (two times faster). Comparing to the original LineTrack Height algorithm with the Dalitz detection algorithm the runtime difference is not significant. The algorithms were evaluated as available at the Staff Removal Toolkit (Dalitz et al., 2008). 5 CONCLUSIONS This paper presented the integration of our robust Stable Paths Approach algorithm (Cardoso et al., 2008a; Cardoso et al., 2008b) in the MusicStaves Toolkit (Dalitz et al., 2008) as a C++ plugin, an improved version of an existing staff line removal algorithm— LineTrack Height (Dalitz et al., 2008), and the results we have obtained in our staff line removal tests. We have integrated our detection algorithm with existing staff line removal algorithms. Our approach successfully deals with the difficulties posed by the symbols superimposed on the staff lines as well as a wide range of image conditions (e.g., discontinuities, curved lines), frequently found on handwritten scores. The encouraging results lead us now to consider investigating the detection of music symbols benefiting from the improved staff line detection and removal, creating a complete OMR application in order to integrate it on our proposed complete OMR solution—OMRSYS (Capela et al., 2008). Thus, our proposed system offers a complete solution for the preservation of our musical heritage. It includes an optical recognition engine integrated with an archiving system and a user-friendly interface for searching, browsing and edition. The digitized scores are stored in MusicXML, a recent and expanding music interchange format designed for notation, analysis, retrieval, and performance applications. Our proposed algorithms and complete OMR system promote the creation of a full corpus of music documents, promoting its preservation and study. This project will culminate in the creation of a repository of handwritten scores, accessible online. The database will be available for enjoyment, educational and musicological purposes, thus preserving this corpus of music in an unprecedented way. (a) Curvature. (b) Curvature. (c) White Speckles. (d) Rotation. (e) Staffline Y-Variation. (f) Staffline Y-Variation. (g) Staffline Interruptions. (h) Typeset Emulation Part 1. (i) Typeset Emulation Part 2. Figure 6: Effect of different deformations on the overall staff removal error rates. See (Dalitz et al., 2008) for parameter details. ACKNOWLEDGEMENTS This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) - Portugal through project PTDC/EIA/71225/2006. REFERENCES Capela, A., Cardoso, J. S., Rebelo, A., and Guedes, C. (2008). Integrated recognition system for music scores. In International Computer Music Conference (ICMC) (Accepted). Cardoso, J. S., Capela, A., Rebelo, A., and Guedes, C. (2008a). A connected path approach for staff detection on a music score. In International Conference on Image Processing (ICIP 2008) (Accepted). Cardoso, J. S., Capela, A., Rebelo, A., and Guedes, C. (2008b). Staff detection with stable paths. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (Submitted). Classical Sheet Music Collection (2008). Classical Sheet Music and MIDI Files, http://www.music-scores. com. Dalitz, C., Droettboom, M., Pranzas, B., and Fujinaga, I. (2008). A comparative study of staff removal algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):753–766. Gamera Framework (2008). Gamera Framework, http:// ldp.library.jhu.edu/projects/gamera/. IOA (2008). Institute of the Arts, http://patrimonio. dgartes.pt/?lang=pt. Lester Levi Collection (2008). The Lester S. Levi Collection of Sheet Music, http://levysheetmusic.mse. jhu.edu/. MacMillan, K., Droettboom, M., and Fujinaga, I. (2002). Gamera: Optical music recognition in a new shell. In International Computer Music Conference (ICMC), pages 482–485. MIC (2008). Music Information Center, http://www.mic. pt. MusicStaves Toolkit (2008). MusicStaves Toolkit, http://lionel.kr.hsnr.de/˜dalitz/data/ projekte/stafflines/. Mutopia Collection (2008). The Mutopia Project, http: //sca.uwaterloo.ca/Mutopia. Randriamahefa, R., Cocquerez, J. P., Fluhr, C., Ppin, F., and Philipp, S. (1993). Printed music recognition. In Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR’93), pages 898–901. Rebelo, A., Capela, A., da Costa, J. F. P., Guedes, C., Carrapatoso, E., and Cardoso, J. S. (2007). A shortest path approach for staff line detection. In AXMEDIS ’07: Proceedings of the Third International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, pages 79–85, Washington, DC, USA. IEEE Computer Society. INTEGRATED RECOGNITION SYSTEM FOR MUSIC SCORES Artur Capela, Jaime S. Cardoso FEUP and INESC Porto Portugal Ana Rebelo FCUP and INESC Porto Portugal Carlos Guedes ESMAE and INESC Porto Portugal [email protected] [email protected] [email protected] [email protected] ABSTRACT Many music works produced in the last century still exist only as original manuscripts or as photocopies. Preserving them entails their digitalization and consequent accessibility in a digital format easy-to-manage which encourages browsing, retrieval, search and analysis while providing a generalized access to the digital material. The manual process to carry out this task is very time consuming and error prone. Automatic optical music recognition (OMR) has emerged as a partial solution to this problem. However, the full potential of this process only reveals itself when integrated in a system that provides seamless access to browsing, retrieval, search and analysis. We address this demand by proposing a modular, flexible and scalable framework that fully integrates the abovementioned functionalities. A web based system to carry out the automatic recognition process, allowing the creation and management of a music corpus, while providing generalized access to it, is a unique and innovative approach to the problem. A prototype has been implemented and is being used as a test platform for OMR algorithms. 1. INTRODUCTION The impact of music in our lives can hardly be overestimated. Music is a pivotal part of our cultural heritage and its preservation, in all of its forms, must be pursued. Portugal has a notorious lack in music publishing from virtually all eras of its musical history. However, whereas most of the known original music manuscripts before the twentieth century are kept at the National Library Archive in Lisbon, there is virtually no national repository for the Portuguese music from the twentieth century. Although there are recent efforts in order to catalogue and preserve in digital form the Portuguese music from the late twentieth century—notably the Music Information Center [10] and the section on musical heritage from the Institute of the Arts website [8]—most of the music pre-dating computer notation software was never published and still exists as manuscripts or photocopies spread out all over the country in inconspicuous places. For example, all the music composed by Jorge Peixinho (1940-1995), an internationally-renowned composer who epitomized the Portuguese avant-garde in the 1960s and 1970s, was never published in Portugal (few of his scores were published abroad), and almost his entire oeuvre consists of manuscript paper [6]. Almost fourteen years past his death, his music is already catalogued, although not published. Unfortunately, this case is not unique, and this situation is common with other great Portuguese composers from the twentieth century. The risk of irreversibly losing this rich cultural heritage is thus a reality. The project “Optical recognition system for handwritten music scores” initiated in 2007 by INESC Porto and ESMAE is the point of departure for creating a web-based system of music manuscripts of Portuguese composers from the twentieth century. This database will provide generalized access of a wide corpus of handwritten unpublished music encoded in MusicXML that can be accessed remotely via the Internet. The database will not only centralize as much information as possible but will also serve to preserve this corpus in a way that is easily accessible for browsing, analysis, and ultimately, for performing this repertoire, therefore help keeping the Portuguese music alive. The ambitious goal of providing generalized access to handwritten scores that have never been published has been severely hampered by the current state-of-the-art of handwritten music recognition. There are currently various commercial OMR software solutions [4, 16, 14] and a few open source solutions [1, 15, 2], but they are all offline standalone applications. The existing online archives of music scores [9, 5, 13] usually provide them in inadequate formats—usually only as the scanned score image—for retrieval or automatic analysis. These online archives are mere standard websites, without facilities for optical recognition, editing and searching through the scores’ musical content. The creation of an OMR system, integrating optical recognition, storage, search, browsing and downloading capabilities, while keeping the scores in their original format along with their digital counterpart, would therefore be extremely beneficial. Using this background, we present the specification and implementation of a system integrating all the required features. It uniquely combines OMR technology in a system, easing the conversion of scores to the Music Extended Markup Language format [12]—MusicXML—as it is being widely adopted and meets our needs. In Section 2 the proposed system is described. We continue in Section 3 by presenting a usage scenario. Finally, conclusions are drawn in Section 4. 2. SYSTEM ARCHITECTURE AND IMPLEMENTATION The system that we propose on this paper comprises the creation of a database of music scores and a web application mainly featuring: • Addition of music scores to the system, performing their recognition and conversion to MusicXML in an integrated manner, allowing the user to confirm and correct the conversion results at the last stage of this process. • Complete maintenance of a fully navigable music scores archive, including both the original version and the digital version obtained from the optical recognition. • Browsing and searching the database, as well as the MusicXML contents. Visualization, downloading and edition of the selected music scores. • Complete system management. The architecture for the proposed system is based on a client-server model. The system is intended to be accessible through the Internet. There are three different entities present in this system, as it can be seen in Figure 1. Web Browser Client 1 Web Server LAN Repository 1 Repository 2 Repository N OMR engine 1 OMR engine 2 OMR engine N Internet Web Browser Client 2 Search engine Web Browser Client N Figure 1. Generic system architecture The Repository module stores the original scanned score, the digital counterpart in MusicXML and all the descriptive metadata inserted by the user, as detailed latter. All the remaining system contents, such as the user information, are also stored in this entity. The Web Server is the user access point to the system as well as to all of its processing modules run on the server, encompassing the search engine and the optical recognition engine for the music scores. There is support for the inclusion of several OMR Engines, aiming to provide the ability to meet different needs (e.g. different music notation systems). The most adequate OMR Engine can be chosen manually by the user or automatically by the system, by detecting the scores notation and type (i.e. handwritten or printed). Our Search Engine allows not only generic searches throughout all of the system contents, but it also provides the capability of searching throughout the music scores MusicXML information in an innovative manner. The Web Server interacts with the Repository and with the Web Browser, which establishes the interface between the user and the system. The user interface on a Web Browser allows the complete management of the music scores and associated metadata, as well as carrying out the system administration. Generaly speaking, the user interface provides the user the ability to execute all the necessary tasks to fully use the proposed system. On the administration side it is possible to manage the users, as well as the whole system contents and validate new ones. There are four user types which can access the system: General User, Registered User, Privileged User and the Administrator. The General User represents a visitor and may only consult and download contents. The remaining types are registered users and according to their level they may add/edit/remove certain contents with or without restrictions. The Privileged User is similar to the Administrator and has full access to all functionalities, though it cannot manage the users from its own level. The contents added by Registered Users have to be validated by Privileged Users or the Administrator, although the later two are able to add any contents without the need to be validated, they are considered to be trustful users. For content management, functionalities include the addition of music scores to the Repository, their automatic recognition, visualization and edition, searching, and browsing. It is also possible to insert and browse information related to the music scores—name, authors, instruments, musical genres—providing the user with a Repository containing all the necessary information to keep a complete music corpus. This metadata can then be used by the user on search queries or for a more flexible browsing experience. Finally, a music work is organized into sections, where each section is a music score which represents a part from the whole music work. This flexible structure allows accommodating either simple or complex works on the system. Each score can be visualized and its representation in MusicXML edited, side-by-side with the original score directly on the Web Browser. Both the visualization and the edition of the music scores are done in a graphical easy-to-use editor available through the Web Browser. Figure 2 illustrates the information recorded in the system associated with a music work. Figure 2. Content stored in the Repository associated with a music work 2.1. Prototype Implementation – OMRSYS We developed a prototype—OMRSYS—taking the system architecture shown in Figure 1 as a basis. Currently, the prototype supports only a single Repository collocated with the Web Server and a single OMR Engine was used to prove this concept. The Repository is implemented as a PostgreSQL 1 database, an open source Database Management System (DBMS). The main reasons for choosing PostgreSQL were its native XML support, needed for creating a search engine that allows searching on the scores’ MusicXML counterparts stored on the database. This is an important aspect as it is a major feature for the proposed system. It is also a very mature and well documented popular DBMS. Another important feature is the great ease of integration on the framework we have selected to develop the prototype, which we will describe next. MySQL 2 is the default DBMS used on the framework chosen for developing the Web Application, but it lacks on XML support. Other DBMSs were also considered but PostgreSQL was the one that best fitted our needs. The development of the Web Application was supported on Ruby on Rails 3 . Rails is an open source, full-stack framework for developing dynamic database-backed web applications according to the Model-View-Control (MVC) pattern. It is an almost complete platform which requires only a DBMS and a server. These features present us with a suitable choice for supporting the development of our prototype. Ruby is the programming language at its core, a flexible and powerful Object Oriented language, with a vast array of powerful characteristics. Another strong advantage is the database manipulation, as it is greatly simplified, which is ideal to develop a system of this kind. Other frameworks [17, 7] were considered but fell short compared to the features of Ruby on Rails. The Web Server selected for the prototype was the Apache HTTP Server 4 with Mongrel 5 to execute the Web Application. This solution is suggested by Ruby on Rails and was found suitable, being both efficient and open source. The Search Engine developed for this prototype allows the usual queries on the database contents, although the groundbreaking search through the MusicXML contents is still not possible at this stage. The OMR Engine on the Web Server is the module responsible for the automatic recognition of the submitted music scores. We initially adapted it from the open source OpenOMR project [15]. The OMR Engine performs an automatic conversion of a submitted music score to a digital easy-to-use representation, the MusicXML format. This digital format allows representing sheet music by its musically-relevant parts, sections, phrases and motives, thus easing the access to the relevant portions of the score while browsing that score in a computer monitor. 1 http://www.postgresql.org http://www.mysql.com 3 http://www.rubyonrails.org 4 http://httpd.apache.org 5 http://mongrel.rubyforge.org 2 Figure 3. The OMRSYS user interface Simultaneously, MusicXML enables the retrieval of relevant musical information for analysis, thus facilitating certain types of computational analysis to be performed on a corpus of scores. Nevertheless, it also provides an adequate way to restore old sheet music, preventing them from oblivion. The other OMR applications we have analysed were left aside because they were either less complete than OpenOMR at the time or they were commercial solutions. Our main goal at this stage was to prove the concept. The User Interface has a great impact in the user experience and is divided into several sections. There is the authentication, title and quick search sections on the upper portion of the screen, followed by the middle and largest portion which includes the main menu and the contents area. The main menu works as a two-level expansible menu and allows access to all the system’s functionalities by grouping them in a logical manner. Similar functionalities follow similar designs to keep the interface consistent, intuitive and easy to learn. Some of the functionalities are the common Create/Read/Update/Delete (CRUD) and the listing of the database contents based on a chosen criteria, all following a familiar behaviour. The main differences and most unique aspects rely on the submission and the update of music scores, which is discussed in Section 3. As an interface example, Figure 3 shows the Graphical User Interface (GUI) being used for browsing the music scores available in the Repository. Both the Privileged Users and the Administrator have additional options in the main menu for validation purposes. The MusicXML Editor for this prototype is still implemented as a plain text editor and viewer, but already showing side-by-side the original score with its MusicXML counterpart. However, the development of a fully integrated graphical MusicXML editor is being pursued. Such editor would allow a higher level and intuitive edition and could be developed for example in Flash, in the likes of MusicRain [11], an online interactive sheet music viewer. The main purpose for the editor in this initial prototype was to give the end-user the possibility to at least view and edit the music scores in MusicXML. The existing editors and visualizers [11] usually have a high maturity level but they are offline applications. Figure 4. Score submission scenario The Digital Rights Management control is done in two ways: the acceptance of a license agreement at the registration process and the validation of submitted music scores by a Privileged User before they become available on the system. tions. An additional benefit of the automatic conversion of the music score to MusicXML is the possibility of encoding the manuscript score in MX format, an XML-base, multi-layered format for music representation [3]. MX synchronizes several layers belonging to the description of a piece of music, e.g. an audio recording and score of the same piece. A system of this kind promotes the creation of a full corpus of music documents, promoting its preservation and study. This project will culminate in the creation of a repository of the handwritten scores, accessible online. The database will be available for enjoyment, educational and musicological purposes, thus preserving this corpus of music in an unprecedented way. Acknowledgments This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) - Portugal through project PTDC/EIA/71225/2006. 5. REFERENCES 3. USAGE SCENARIO: SCORE SUBMISSION When the insertion of a music score in the system is requested, the user inserts the metadata associated with the music score, of which some is optional—name, year, description, etc—and associates it with one or more authors and a musical genre. Each section of a music work has to be associated with the instruments present on the music score. In the last submission step, after the insertion of the requested metadata, the user submits the various pages for each section, as illustrated in Figure 4. After validating the inserted data, the user then triggers the automatic recognition process by calling a suitable OMR engine. Afterward, an overview of the submitted score is shown by listing its contents allowing the user to view the result of the automatic process on the built-in editor side-by-side with the original scanned image, offering the user the possibility to manually correct the automatic results. After confirming the results and making the necessary corrections, the user then finalizes the music score submission by accepting it. If the score was submitted by the Administrator or a Privileged User it then becomes immediately available on the system; if it was submitted by a standard Registered User it is kept on queue for validation and will only become available once a user with administration privileges validates it. 4. CONCLUSION The proposed system offers a complete solution for the preservation of our musical heritage. It includes an optical recognition engine integrated with an archiving system and a user-friendly interface for searching, browsing and edition. The digitized scores are stored in MusicXML, a recent and expanding music interchange format designed for notation, analysis, retrieval, and performance applica- [1] AOMR2. [Online]. Available: http://www.bzzt.net/ ∼arnouten/wiki/index.php/Gamera#AOMR2: omr toolkit [2] Audiveris. [Online]. Available: http://audiveris.dev.java. net [3] A. Baraté, G. Haus, and L. Ludovico, “An XML-based format for advanced music fruition,” in Proceedings of the Third Sound and Music Computing Conference, 2006, pp. 141–147. [4] Capella-scan. [Online]. Available: http://www. capella-software.com/capscan.htm [5] Classical Sheet Music and MIDI Files. [Online]. Available: http://www.music-scores.com [6] Delgado, Cristina, Machado, Jorge, and J. Machado, Catálogos da obra de Jorge Peixinho, ser. José Machado (ed.) Jorge Peixinho: In Memoriam. Lisbon: Caminho, 2002. [7] Google Web Toolkit. [Online]. Available: http://code. google.com/webtoolkit [8] Institute of the Arts. [Online]. Available: http://patrimonio. dgartes.pt/?lang=pt [9] The Lester S. Levi Collection of Sheet Music. [Online]. Available: http://levysheetmusic.mse.jhu.edu/ [10] Music Information Center. [Online]. Available: http: //www.mic.pt [11] MusicRain. [Online]. Available: http://musicrain.us [12] The MusicXML Format. [Online]. Available: http: //www.musicxml.org [13] The Mutopia Project. [Online]. Available: http://sca. uwaterloo.ca/Mutopia [14] OMeR. [Online]. Available: http://www.myriad-online. com/en/products/omer.htm [15] OpenOMR. [Online]. Available: http://sourceforge.net/ projects/openomr [16] SharpEye Music Reader. [Online]. Available: http: //www.visiv.co.uk [17] Tacos. [Online]. Available: http://tacos.sourceforge.net A CONNECTED PATH APPROACH FOR STAFF DETECTION ON A MUSIC SCORE Jaime S. Cardoso∗ , Artur Capela† , Ana Rebelo‡ , Carlos Guedes§ ABSTRACT The preservation of many music works produced in the past entails their digitalization and consequent accessibility in an easy-tomanage digital format. Carrying this task manually is very time consuming and error prone. While optical music recognition systems usually perform well on printed scores, the processing of handwritten musical scores by computers remain far from ideal. One of the fundamental stages to carry out this task is the staff line detection. In this paper a new method for the automatic detection of music staff lines based on a connected path approach is presented. Lines affected by curvature, discontinuities, and inclination are robustly detected. Experimental results show that the proposed technique consistently outperforms well-established algorithms. Index Terms— Music, optical character recognition, document image processing, image analysis 1. INTRODUCTION The Universal Declaration on Cultural Diversity adopted by the General Conference of UNESCO on 2001 asserts that cultural diversity is as necessary for humankind as biodiversity is for nature, and that policies to promote and protect cultural diversity thus are an integral part of sustainable development. Being music a pivotal part of our cultural heritage, its preservation, in all of its forms, must be pursued. Frequently, the preservation of many music works entails their digitalization and consequent accessibility in a format that encourages browsing, analysis and retrieval. In fact, many music works produced during the last centuries still exist only as original manuscripts or as photocopies. The digitalization of these works is therefore a highly desirable goal. Unfortunately, the ambitious goal of providing generalized access to handwritten scores that were never published has been severely hampered by the actual state-ofthe-art of handwritten music recognition. The manual process required to recognize handwritten musical symbols in scores and to put them in relationship with the spine structure of the score is very time consuming. This justifies the research around the definition of reliable optical music recognition (OMR) algorithms. Staff line detection is one of the fundamental stages of the OMR process, with subsequent processes relying heavily on its performance. The reasons for detecting and removing the staff lines lie on the need to isolate the musical symbols for a more efficient and correct detection of each symbol presented on the score. The detection of staves is complicated due to a variety of reasons. The handwritten staff lines are rarely straight and horizontal, ∗ INESC Porto, Faculdade de Engenharia, Universidade do Porto, Portugal, email: [email protected] † INESC Porto, Faculdade de Engenharia, Universidade do Porto, Portugal, email: [email protected] ‡ INESC Porto, Faculdade de Ciências, Universidade do Porto, Portugal, email: [email protected] § INESC Porto, Escola Superior de Música e Artes do Espectáculo, Portugal, email: [email protected] and are not parallel to each other. For example, some staves may be tilted one way or another on the same page or they may be curved. These scores tend to be rather irregular and determined by a person’s own writing style. Moreover, if we consider that most of these works are old, the quality of the paper in which it is written might have degraded throughout the years, making it a lot harder to correctly identify its contents. In this paper a method for the automatic detection of staff lines based on a connected path approach is presented. The proposed paradigm uses the image as a graph, where the staff lines result as the connected path between the two margins of the image. Our previous work [1] was a first effort to explore this concept. We now present a complete, principled solution, with a strong experimental validation of the proposed approach. This introduction is concluded with a brief review of the work done in this area. In section 2 the proposed algorithm is described. In section 3, the proposed algorithm is evaluated experimentally using a well-known dataset of music scores. Finally, conclusions are drawn and future work is outlined in section 4. 1.1. Related Works The problem of staff line detection is often considered simultaneously with the goal of their removal, although exceptions exist [2, 3]. The simplest approach consists on finding local maxima in the horizontal projection of the black pixels of the image [4]. These local maxima represent line positions, assuming straight and horizontal lines. Several horizontal projections can be made with different image rotation angles, keeping the image in which the local maxima are bigger. This eliminates the assumption that the lines are always horizontal. An alternative strategy for identifying staff lines is to use vertical scan lines [5]. More recent works present a more or less sophisticated use of a combination of projection techniques to improve on the basic approach [6]. Fujinaga [7] incorporates a set of image processing techniques in the algorithm, including run-length coding (RLC), connectedcomponent analysis, and projections. After applying the RLC to find the thickness of staff lines and the space between the staff lines, any vertical black runs that are more than twice the staff line height are removed from the original. Then, the connected components are scanned in order to eliminate any component whose width is less than the staff space height. After a global deskewing, taller components, such as slurs and dynamic wedges are removed. Other techniques for finding staff lines include the application of mathematical morphology algorithms [8], rule-based classification of thin horizontal line segments [9], and line tracing [10, 11]. The methods proposed in [2, 3] operate on a set of ‘staff segments’, with methods for linking two segments horizontally and vertically and merging two segments with overlapping position into one. In spite of the variety of methods available, they all suffer from some limitations. In particular, lines with some curvature or discontinuities are inadequately resolved. The dash detector [12] is one of few works that try to handle discontinuities. The dash detector is an algorithm that searches the image, pixel by pixel, finding black pixel regions that it classifies as stains or dashes. Then, it tries to unite the dashes to construct lines. 2. A CONNECTED PATH APPROACH FOR STAFF LINE DETECTION A staff line can be considered as a connected path from the left side of the music score to the right side. As staff lines are almost the only extensive black objects on the music score, the path we are looking for is the shortest path between the two margins if paths (almost) entirely through black pixels are favoured. More formally, let s and t be two pixels of the image and Ps,t a path over the image connecting them. We are interested in finding the path P that optimizes some predefined distance d(s, t). This criterion should embed the need to favour black pixels. In the work to be detailed, the image grid is considered as a graph with pixels as nodes and edges connecting neighbouring pixels. The weight of each arc, w(p, q), is a function of pixels values and pixels relative positions. A path from vertex (pixel) v1 to vertex (pixel) vn is a list of unique vertices v1 , v2 , . . . , vn , with vi−1 and vi corresponding to neighbour pixels. The path cost is the sum of each arc weight in the path n i=2 w(vi−1 , vi ). As mentioned before, a staff line corresponds to a path from (almost) the left margin of the image to (almost) the right side of the image, (almost) always through black pixels. If the weight assigned to an edge captures the intensity of the path of the adjacent pixels, finding the best path between a point s on the left margin and a point t on the right margin translates into computing the minimum accumulated weight along all possible connected curves connecting s and t: d(s, t) = min w(p, q). (1) At the end of this process, the minimum value of the last column in C will indicate the end of the minimal connected staff. Hence, in the second step one backtrack from this minimum entry on C to find the path of the optimal staff. 2.1. Algorithm outline Assume one wants to find all staff lines present in a score. This can be approached by successively finding and removing the shortest path from the left to the right margin of the score. The removal operation is required to ensure that a staff is not detected twice2 . Consider the music score presented in Figure 1(a). In Figure 1(b) the first 11 shortest paths are traced. This example shows that music symbols placed on top of staff lines do not interfere with the detection of the staff lines. Moreover, the example also makes clear that slight skewed scores do not pose any problem to the proposed approach. P X Ps,t Staff lines are best modelled as paths between two regions Ω1 and Ω2 , the left and right margins of the score. The shortest path between two regions Ω1 and Ω2 is defined as a path Ps,t , with s ∈ Ω1 and t ∈ Ω2 and cost equal to d(Ω1 , Ω2 ) = min s∈Ω1 ,t∈Ω2 d(s, t). (2) One may assume that staff lines do not zigzag back and forth, left and right. Therefore, one may restrict the search among connected paths containing one, and only one, pixel in each column of the image1 . Formally, let I be a n × m image and define an admissible staff to be s = {(j, y(j))}n j=1 , s.t. ∀j |y(j) − y(j − 1)| ≤ 1, where y is a mapping y : [1, · · · , n] → [1, · · · , m]. That is, a staff line is an 8-connected path of pixels in the image from left to right, containing one, and only one, pixel in each column of the image. Given the weight function w(p, q), one can define the cost of n a staff as C(s) = i=2 w(vi−1 , vi ). The optimal staff line that minimizes this cost can be found using dynamic programming. The first step is to traverse the image from the second column to the last column and compute the cumulative minimum cost C for all possible connected staff lines for each entry (i, j): P 8 > < C(i, j) = min > : 1 This C(i − 1, j − 1) C(i − 1, j) C(i − 1, j + 1) + + + w(pi−1,j−1 ; pi,j ) w(pi−1,j ; pi,j ) w(pi−1,j+1 ; pi,j ) assumption imposes a maximum detectable 45 rotation degree. (a) Skewed staff lines with music (b) The first shortest paths between symbols. left and right margins. Fig. 1. An exemplificative example. Our first naive effort to apply the shortest path foundation to staff line detection in [1] did it by computing the shortest path between two pixels at the same height on the left and right margin. That approach was not robust enough to tilted scores, leading the detected paths to jump between consecutive staff lines. Nonetheless, two main issues are still visible with the current methodology and need to be conveniently addressed. A criterion is needed to stop the iterative detection of the shortest paths (staff lines) and the initial and the final parts of a path should be trimmed. 2.2. Proposed Algorithm To detect the staff lines, the proposed overall algorithm starts by estimating the staff space height, staffspaceheight, and staff line height, stafflineheight. These lengths are used as reference lengths on subsequent operations. Robust estimators are already in common use: the technique starts by computing the vertical runlengths representation of the image. If a bit-mapped page of music is converted to vertical run-lengths coding, the most common blackruns represents the staff line height and the most common white-runs represents the staff space height [7]. After estimating the reference lengths, the proposed approach applies the main step of the framework, by successively finding the shortest path between the left and right margin, adding the path found to the list of staff lines and removing it from the image. The weight w(p, q) was experimentally set to w(p, q) = 2 + (Ip + Iq )/255, √ with Ip , Iq ∈ {0, 255}, for pixels in a 4neighbourhood or 2 times that value for 8-neighbours. The removal operation sets to white the pixels on a vertical strip of height= 2×stafflineheight, centred on the detected staff 2 We implemented the removal operation by setting to white the pixels on the detected staff; image resizing could be a valid alternative [13]. line. To stop the iterative staff line search, a sequence of (arguably) sensible rules is used to validate the last found path; if it does not pass the checking, the iterative search is broken. Two validation rules were applied, both assessing features with respect to the first detected staff line (assumed to be the most perfect one). If the last path does not have a percentage of black pixels above a threshold, the search is broken (a threshold of blackperc = 0.8 of the percentage of black pixels in the first staff line was used in the experiments). Likewise, if the shape of the last detected path differs too much from the shape of the first detected path (measured as the average ydistance between both paths, after removing the means), the iteration is broken. A threshold shapediff = 4×staffspaceheight was experimentally selected. After the main search step, detected staff lines are post-processed. Although true staff lines never intersect, the above algorithm may occasionally create intersecting lines. That may be due to a local low quality of a line, leading the shortest path to jump between consecutive lines; the next iteration will then follow the remaining segments, intersecting with the previous detected line. To preclude such final, undesired state, lines are post-processed to remove intersections. That is easily and efficiently accomplished by, for each image column, sorting on y the pixels of the detected lines and assigning the i-pixel to the i-line. After this simple process, lines may touch but they do not intersect. Each retained line is then trimmed at the beginning and at the ending. As visible in the previous example (refer to Figure 1(b)), before meeting with a staff line, a path travels through a sequence of white pixels. Likewise, after the end of the staff line, the path goes again through a sequence of white pixels until it meets the right margin of the image. In order to ignore all of these white pixels, the initial pixels of the path are discarded until a run of at least blackrun black pixels are found in the path. In the same way, all pixels of the path after the last occurrence of a run of at least blackrun black pixels are discarded. A threshold of blackrun = 2×staffspaceheight was used on the experiments. Finally, lines are smoothed with a standard average low-pass filter. A window size of 2×staffspaceheight was selected on the experiments. 3. RESULTS This section provides experimental results obtained on a set of scores. Although the assessment of a new staff detection algorithm may be done by visually inspecting the output on a set of scores—as adopted on [1]—, here we support the comparison on quantitative measures. The test set adopted for the qualitative evaluation of the proposed method is the one presented in [14]. The test set consists of ideal scores to which known deformations can be applied. The distortions range from rotation and curvature to typeset emulation and staff line thickness variation—see [14, 15] for more details. In total, 630 images were generated from the originally perfect scores. To conveniently measure the performance of a staff line detection algorithm, we considered two different error metrics: the number of false positive staff lines and missed to detect staff lines. To evaluate these metrics, we start by computing the average Euclidian distance between each reference staff line and each actually detected staff line; then we solve the matching problem on the resulting bipartite graph by minimizing the assignment cost (= distance). Only pairs with average error-distance bellow the staff line height were considered correctly matched (the other pairs were assumed to originate from a false positive staff line being matched to an undetected true staff line and were therefore unmatched). Now the two metrics result as the number of unmatched detected staff lines (false positive) and unmatched reference staff lines (missed to detect). The proposed algorithm was compared with the three methods considered in [14] for staff line detection3 . As Dalitz’s algorithm performs significantly better than the two other algorithms evaluated in [14], we have only included Dalitz results in subsequent figures. It is important to state that the comparison reports only to staff line detection algorithms, not to the removal phase. That explains the need to introduce the aforementioned metrics, while not adopting the metrics introduced in [14] for assessing staff line removal. The effects of the different deformations over the respective parameter ranges are shown in Figure 2. With respect to the distortions considered, our connected path based approach is the most robust and clearly outperforms the Dalitz algorithm. In fact, the performance of our approach is almost independent of intensity of the deformation, for the range of values considered. This performance gain is even more noteworthy as the Dalitz algorithm is receiving as input the correct number of lines per staff, while the proposed approach does not rely on that information. Had not this been the case, the differential between both would have been much larger. In summary, these experiments show the strengths of the presented algorithm. Despite being based on a simple and intuitive underlying principle, the performance of the proposed algorithm is quite competitive. Moreover, the results are prone to be improved even further by elaborating the stopping criterion of the iterative search or the post-processing rules, while leaving intact the main principle of the method. 4. CONCLUSION The first challenge faced by an OMR system is staff line detection. This first task dictates the possibility of success for the recognition of the music score. In the case of handwritten music scores, the existing solutions are far from presenting satisfactory results. In this paper, a new algorithm for the automatic detection of staff lines in music scores was proposed. The connected path approach for staff line detection algorithm is adaptable to a wide range of image conditions, thanks to its intrinsic robustness to skewed images, discontinuities, and curved staff lines. The handwritten staff lines are rarely straight and horizontal, and are not parallel to each other. Some staves may be tilted one way or another on the same page or they may be curved. While current approaches apply a chain of heuristics to correct these undesired imperfections, the connected path algorithm is naturally robust to these challenging conditions. The proposed approach is robust to broken staff lines (due to lowquality digitalization or low-quality originals) or staff lines as thin as one pixel. Missing pieces are automatically ‘completed’ by the algorithm. Acknowledgments This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) - Portugal through project PTDC/EIA/71225/2006. 5. REFERENCES [1] Ana Rebelo, Artur Capela, Joaquim F. Pinto da Costa, Carlos Guedes, Eurico Carrapatoso, and Jaime S. Cardoso, “A shortest path approach for staff line detection,” in Third International Conference on Automated Production of Cross Media Content for Multi-channel Distribution (AXMEDIS 2007), 2007. 3 The source code is available upon request to the authors. false staves (connected path) missed staves (connected path) false staves (Dalitz) missed staves (Dalitz) 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 error error 0.8 1 0.9 0.5 0.5 0.6 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0 −5 0 Angle (degrees) 5 (a) Rotation. 0 0.02 false staves (connected path) missed staves (connected path) false staves (Dalitz) missed staves (Dalitz) 0.7 false staves (connected path) missed staves (connected path) false staves (Dalitz) missed staves (Dalitz) error 1 0.9 0.1 0.03 0.04 0.05 0.06 0.07 Amplitude/staffwidth 0.08 0.09 0.1 0 0.02 (b) Curvature. 0.1 1 false staves (connected path) missed staves (connected path) false staves (Dalitz) missed staves (Dalitz) 0.8 false staves (connected path) missed staves (connected path) false staves (Dalitz) missed staves (Dalitz) 0.9 0.8 0.7 0.7 0.6 0.6 error error 0.06 0.08 Rate of whitened pixels (c) White speckle. 1 0.9 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0 0.04 0.1 2 2.5 3 3.5 4 4.5 Maximum deviation n 5 (d) Line y-variation. 5.5 0 6 (e) Shortest path (left) and Dalitz (right) results for rotation (angle=-4◦ ). 2 4 6 8 10 Maximum vertical shift nshift (for ngap = 10) 12 (f) Typeset emulation. Fig. 2. Effect of different deformations on the overall error rates. See [14] for parameter details. [2] Hidetoshi Miyao and Masayuki Okamoto, “Stave extraction for printed music scores using dp matching,” Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 8, pp. 208–215, 2007. [3] Mariusz Szwoch, “A robust detector for distorted music staves,” in Computer Analysis of Images and Patterns, pp. 701– 708. Springer-Verlag, Heidelberg, 2005. [4] Dorothea Blostein and Henry S. Baird, “A critical survey of music image analysis,” in Structured Document Image Analysis, Baird, Bunke, and Yamamoto (Eds.), Eds., pp. 405–434. Springer-Verlag, Heidelberg, 1992. [5] N. P. Carter, Automatic Recognition of Printed Music in the Context of Electronic Publishing, Ph.D. thesis, Departments of Physics and Music, University of Surrey, 1989. [6] D. Bainbridge, Extensible Optical Music Recognition, Ph.D. thesis, Department of Computer Science, University of Canterbury, Christchurch, NZ, 1997. [7] Ichiro Fujinaga, “Staff detection and removal,” in Visual Perception of Music Notation: On-Line and Off-Line Recognition, Susan George, Ed., pp. 1–39. Idea Group Inc., 2004. [8] Ignacy Gawedzki, “Optical music scores recognition,” Tech. Rep., 2002. [9] J. V. Mahoney, “Automatic analysis of music score images. B.Sc thesis,” 1982. [10] D. Prerau, Computer pattern recognition of standard engraved music notation, Ph.D. thesis, Department of Computer Science and Engineering, MIT, 1970. [11] J. W. Roach and J. E. Tatem, “Using domain knowledge in low-level visual processing to interpret handwritten music: an experiment,” Pattern Recognition, vol. 21, no. 1, pp. 33–44, 1988. [12] I. Leplumey, J. Camillerapp, and G. Lorette, “A robust detector for music staves,” in Proceedings of the International Conference on Document Analysis and Recognition, 1993, pp. 902–905. [13] Shai Avidan and Ariel Shamir, “Seam carving for contentaware image resizing,” in ACM Transactions on Graphics (SIGGRAPH 2007), 2007, vol. 26. [14] Christoph Dalitz, Michael Droettboom, Bastian Czerwinski, and Ichiro Fujigana, “A comparative study of staff removal algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004. [15] Christoph Dalitz, Michael Droettboom, Bastian Czerwinski, and Ichiro Fujigana, “Staff removal toolkit for gamera(20052007,” .