New methodologies towards an automatic optical

Transcrição

New methodologies towards an automatic optical
Ana Maria Rebelo
New methodologies towards an automatic
optical recognition of handwritten musical
scores
Dissertação de Mestrado
Universidade do Porto
Outubro 2008
Universidade do Porto
Departamento de Matemática Aplicada
Tese submetida à Faculdade de Ciências da
Universidade do Porto para obtenção do grau de
Mestre em Engenharia Matemática
New methodologies towards an automatic
optical recognition of handwritten musical
scores
Ana Maria Rebelo
([email protected])
Dissertação realizada sob a supervisão do
Professor Doutor Jaime S. Cardoso (INESC Porto, FEUP)
e sob a co-orientação do
Professor Doutor Joaquim F. Pinto da Costa (DMA, FCUP).
Porto, Outubro 2008
A painter paints pictures on canvas. But musicians paint their pictures on silence.
Leopold Stokowski
Abstract
Many music works produced in the past are currently available only as original manuscripts
or as photocopies. Their preservation has entailed a vast amount of research in the last years.
The digitalization has been commonly used as a possible preservation method which comprises
setting up a digital copy. Despite the fact of an easy accessibility in a machine-readable format,
which encourages browsing, retrieval, search and analysis and, most importantly, the preservation of endangered works, while providing a generalized access to digital content, actually
it only keeps a portion of the information.
Several systems have been presented in order to
fulll the lack of a tool able to analyse and perform semantic search operations.
Carrying
this task manually is very time consuming and error prone. While optical music recognition
systems usually perform well on printed scores, the processing of handwritten musical scores
by computers remains far from ideal.
One of the fundamental stages in the optical music recognition is the detection and subsequent
removal of sta lines. In this project we investigate a general-purpose, knowledge-free method
for the automatic detection of music sta lines based on stable paths approach. Lines aected
by curvature, discontinuities, and inclination are robustly detected. A sta removal algorithm is
also developed by adapting an existing line removal approach to use the stable paths algorithm
at the detection stage.
Experimental results show that the proposed technique consistently
outperforms well-established algorithms. The developed approach will be integrated, as future
work, in a web based system providing seamless access to browsing, retrieval, search and analysis of submitted scores.
Symbol detection is also crucial for a good performance in an optical music recognition system.
In this work we propose a segmentation method based on the hierarchical decomposition of
music sheets. The symbols are split into four dierent types to facilitate their extraction. Since
we are working with handwritten scores many situations may occur. The work degradation,
highly handwritten dierent types as well as other possible undesirable discontinuities are
situations to take into account.
One way to diminish these situations in the classication
process is to create a database where several distortions are simulated. So, a new methodology
is presented where elastic matching is used in conjunction with several classiers, for instance
neural networks and hidden Markov models.
A profound comparative study is made about
classiers with and without the elastic matching applied to handwritten scores. It is expected
that the obtained results outperform the actual state of the art.
I
Resumo
Muitos dos trabalhos musicais produzidos no passado estão actualmente disponíveis apenas
como manuscritos originais ou fotocópias.
de pesquisas nos últimos anos.
A sua preservação tem causado um vasto leque
A digitalização tem sido usualmente utilizada como possível
método de preservação em que se resume à criação de uma cópia digital. Apesar de constituir
uma acessibilidade fácil num formato legível por computador, que permite a consulta, a recuperação, a procura e a análise, e mais importante, a preservação das obras em risco, enquanto
promove um acesso generalizado ao material digital, na verdade apenas guarda uma parte da
informação. Vários sistemas têm sido apresentados para preencher a falta de uma ferramenta
que seja capaz de realizar análises e executar operações de pesquisa semântica. Executar esta
tarefa manualmente é deveras dispendioso e susceptível a erros. Enquanto que os sistemas de
reconhecimento óptico musical apresentam, geralmente, um bom desempenho para partituras
impressas, para partituras manuscritas o processo recorrendo ao computador continua longe do
ideal.
Uma das etapas fundamentais em reconhecimento óptico é a detecção e subsequente remoção
das linhas de pauta. Neste projecto é investigado um método de uso geral e conhecimento livre
para a detecção automática das linhas de música baseado na aproximação do caminho estável.
Linhas afectadas por curvatura, discontinuidades e inclinação são detectadas robustamente.
Também é desenvolvido um algoritmo de remoção de linhas adaptando um processo já existente de remoção para usar o algoritmo do caminho estável na fase de detecção. Os resultados
experimentais obtidos mostram que a técnica proposta supera consistentemente os algoritmos
bem estabelecidos.
A aproximação desenvolvida será agora integrada, num trabalho futuro,
num sistema web, proporcionando um acesso à navegação, recuperação, pesquisa e análise das
partituras submetidas.
A detecção dos símbolos é também crucial para um bom desempenho de um sistema de reconhecimento óptico musical. Neste trabalho é proposto um método de segmentação baseado
numa decomposição hierárquica da folha de música. Os símbolos são divididos em quatro tipos
diferentes para facilitar a sua extracção.
Como estamos a lidar com partituras manuscritas,
diversas situações são susceptíveis de ocorrer. A degradação do trabalho, diferentes tipos de
manuscritos, assim como outras possíveis descontinuidades não desejáveis são situações a ter em
conta. Uma forma de diminuir estes problemas, no processo de classicação, é criar uma base
de dados onde se simulam várias distorções. Neste sentido, uma nova metodologia é apresentada, onde se recorre à modelação elástica em conjunção com vários classicadores, como por
exemplo, redes neuronais e cadeias de Markov escondidas. É apresentado um profundo estudo
comparativo sobre estes classicadores com e sem modelação elástica aplicada às partituras
manuscritas. É esperado que os resultados obtidos superem o estado actual da arte.
Acknowledgements
Was this project successfully accomplished? The period was long and sometimes dicult; however I believe that the answer is yes. I am thankful to a set of persons who made all this work
possible and fun to perform.
I came to INESC Porto to nish my undergraduate in Mathematical Applied to Technology
and I ended up staying for another year for the realization of my Master course in Mathematical Engineering. I am sincerely grateful to Prof. Dr. Jaime S. Cardoso, my thesis supervisor,
for giving me the opportunity to continue to work with him, by the trust in my work, by
the availability and patience always demonstrated, for his guidance given with high rigor and
professionalism and by his wise knowledge.
I also thank my thesis co-supervisor, Prof.
Dr.
Joaquim Pinto da Costa, for the availability to help and guide with all the rigor and erudition.
It was a great pleasure to work in the INESC Porto with a team as fabulous as the UTM that
done everything to integrate me in the group as soon as possible. To all my colleagues who
accompanied me on this journey and in one way or another supported me and encouraged me,
my sincere thanks. I also do a special thank to Artur Capela, my colleague in the OMR project,
for all the patience and support given, and to Prof. Dr. Carlos Guedes, professor from ESMAE
associated with this Project.
INESC Porto for providing the right environment for
(Fundação para a Ciência e Tecnologia ) for nancial support.
I thank
high-quality research and FCT
To my parents who made a huge eort to tolerate and help the realization of my dreams and
aspirations. To my twin sister my greatest appreciation for all that she represents for me, for
her support and the countless hours of complicity available. To D. Lurdes, a very special friend
who made me face the diculties with courage and determination. Last but not least, I thank
a person very special to me, accompanied me on this journey with great patience and spirit
winner, helping me in this work with his critical and rigorous eye. Thank you for your presence
at all times.
I would like to nish these acknowledgments stressing that with determination and humbleness
our dreams can take place.
Ana
Maria
October,
Silva
Rebelo,
2008
III
Contents
Contents
IV
List of Figures
VI
List of Tables
VIII
Glossary
X
I Introduction
1
1 Introduction
3
1.1
OMR Architecture
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Contributions of this Project and Related Publications . . . . . . . . . . . . . .
7
1.4
Dissertation Structure
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Related Works
10
2.1
State of the Art in Sta Lines Detection . . . . . . . . . . . . . . . . . . . . . .
10
2.2
State of the Art in Musical Symbol Extraction and Classication
. . . . . . . .
12
2.3
Background Knowledge
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
II Detection
21
3 Sta Lines Detection and Removal
23
3.1
Basic Concepts
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2
Underlying Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.3
Algorithm Outline
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4
Stable Paths on a Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.5
Proposed Algorithm
28
3.6
Design of the Weight Function
. . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.7
Sta Line Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.8
Database of Music Scores
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.9
Evaluation Metrics and Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IV
IIISegmentation and Classication
47
4 Segmentation and Classication Process
49
4.1
Musical Symbols
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2
Segmentation Process
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.3
Classication Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
IVConclusions and Future Work
71
5 Conclusion
73
5.1
Future Work
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
References
76
V Appendix
81
A Fundamentals
83
A.1
Primal Problem vs Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . .
83
A.2
Error Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .
85
A.3
Dalitz's Algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
A.4
Matching in Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
A.5
Otsu Threshold Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
A.6
Sta Line Removal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
A.7
Baum-Welch Algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
A.8
Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
B Table of confusion
97
B.1
Results Obtained Without Elastic Matching for the Handwritten Music Symbols
97
B.2
Results Obtained Without Elastic Matching for the Printed Music Symbols
99
B.3
Results Obtained With Elastic Matching for the Handwritten Music Symbols
B.4
Results Obtained With Elastic Matching for the Printed Music Symbols
C Articles Submited to Conferences
. .
.
101
. . . .
103
105
List of Figures
1.1
Generic system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
OMR architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1
A directed acyclic graphs and its linearization.
. . . . . . . . . . . . . . . . . . . .
14
3.1
An exemplicative example of the methodology. . . . . . . . . . . . . . . . . . . . .
26
3.2
Stable paths on a toy example.
27
3.3
Exemplication of stable paths for Figure 3.1(a).
3.4
Length of a chord through a skeleton point at some angle
3.5
Example (from [DDCF08]).
3.6
Two examples of music scores from the test set used on the experimental evaluation.
3.7
Some examples of applied deformations from the original image: a) Original; b)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
ϕ.
28
. . . . . . . . . . . . .
32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
34
Curvature c) Degradation after Kanungo; d) Sta line thickness variation; e) Sta
line y-variation; f ) Typeset emulation; g) Rotation; h) White Speckles; i) Sta line
Interruptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.8
Generation of the deformed images and the respective Ground-Truth. . . . . . . . .
38
3.9
Examples of the results obtained using the sta line removal algorithm in our test set. 45
4.1
Example.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2
Example.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.3
Example: a) Tie above noteheads; b) Tie under noteheads.
. . . . . . . . . . . . .
54
4.4
An exemplicative example of the variability in the music symbols. . . . . . . . . .
57
4.5
Segmentation process I.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.6
Segmentation process II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.7
An exemplicative example of two dierent connected components in the same
bounding box.
4.8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
An example of the existence of inconsistency in the beam thickness and in the link
with the stems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Beam segmentation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.10 Notes segmentation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.11 An example of sharp detection on a real score.
. . . . . . . . . . . . . . . . . . . .
60
4.12 An example of sharp detection on a ideal score. . . . . . . . . . . . . . . . . . . . .
61
4.9
4.13 An example of how the music symbols that are in the ideal scores were manually
splited.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.14 Results of the error metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
VI
4.15 Results of the error metrics for the rotation deformation. . . . . . . . . . . . . . . .
63
4.16 Results of the error metrics for the curvature deformation. . . . . . . . . . . . . . .
63
4.17 Neural network topology.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.1
Neural network architecture with three layers, two inputs and one output. . . . . .
87
A.2
Forward propagation of the signal in the neural network. . . . . . . . . . . . . . . .
88
A.3
Comparison between the signal output and the target. . . . . . . . . . . . . . . . .
88
A.4
Propagation of the error signal
in backward mode in the neural network. . . . . .
89
A.5
The calculation of weights in the neural network. . . . . . . . . . . . . . . . . . . .
90
d
List of Tables
3.1
Deformations in the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Ranges of deformation parameters used in the tests:
3.3
Eect of dierent deformations on the overall sta detection error rates in percent-
min:step:max.
. . . . . . . .
35
37
age: average (standard deviation) of the false detection rate and miss detection rate.
See [DDCF08] for parameter details.
3.4
. . . . . . . . . . . . . . . . . . . . . . . . . .
Detection performance on real music scores in percentage: average (standard deviation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
r. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
44
Eect of dierent deformations on the overall sta removal error rates in percentage:
average (standard deviation) Sta line interruption error.
3.9
43
Eect of dierent deformations on the overall sta removal error rates in percentage:
average (standard deviation) Individual pixels error.
3.8
42
Removal performance on real music scores (in percentage): average (standard deviation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7
41
Sta segment extraction errors based on the number of segments in an equivalence
class
3.6
40
. . . . . . . . . . . . .
44
Eect of dierent deformations on the overall sta removal error rates in percentage:
average (standard deviation) Segmentation region level.
. . . . . . . . . . . . . .
45
4.1
Clefs.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2
Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3
Flags and beams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.4
Rests.
53
4.5
Ties and Slurs.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.6
Accents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.7
Time signatures.
56
4.8
Number of neurons in the hidden layer (not)using elastic matching method (EM) in
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9
65
The number of states in the hidden Markov model with and without the elastic
matching method (EM) in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 The values of the parameters
C
and
γ
66
in the support vector machines with and
without the elastic matching method (EM) in the dataset. . . . . . . . . . . . . . .
66
4.11 Classes list of handwritten and printed music symbols. . . . . . . . . . . . . . . . .
67
4.12 Results obtained with the test set in the classication process for the handwritten
music symbols.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VIII
68
4.13 Results obtained with the test set in the classication process for the printed music
symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.14 Results obtained with the test set in the classication process for the handwritten
music symbols.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.15 Results obtained with the test set in the classication process for the printed music
symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.16 The performed of the natural and sharp symbols. . . . . . . . . . . . . . . . . . . .
70
B.1
Table of confusion of the nearest neighbor classier.
. . . . . . . . . . . . . . . . .
97
B.2
Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . .
97
B.3
Table of confusion of the support vector machines.
. . . . . . . . . . . . . . . . . .
98
B.4
Table of confusion of the hidden Markov model. . . . . . . . . . . . . . . . . . . . .
98
B.5
Table of confusion of the nearest neighbor classier.
. . . . . . . . . . . . . . . . .
99
B.6
Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . .
99
B.7
Table of confusion of the support vector machines.
. . . . . . . . . . . . . . . . . .
99
B.8
Table of confusion of the hidden Markov Model. . . . . . . . . . . . . . . . . . . . .
100
B.9
Table of confusion of the nearest neighbor classier.
. . . . . . . . . . . . . . . . .
101
B.10 Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . .
101
B.11 Table of confusion of the support vector machines.
. . . . . . . . . . . . . . . . . .
101
B.12 Table of confusion of the hidden Markov model. . . . . . . . . . . . . . . . . . . . .
102
B.13 Table of confusion of the nearest neighbor classier.
. . . . . . . . . . . . . . . . .
103
B.14 Table of confusion of the neural network. . . . . . . . . . . . . . . . . . . . . . . . .
103
B.15 Table of confusion of the support vector machines.
. . . . . . . . . . . . . . . . . .
103
B.16 Table of confusion of the hidden Markov model. . . . . . . . . . . . . . . . . . . . .
104
Glossary
OMR Optical Music Recognition
SVM Support Vector Machine
HMM Hidden Markov Model
EM Elastic Matching
LTH Line Track Height
X
Part I
Introduction
1
Chapter 1
Introduction
Music, from Greek
µυσική (τ ´χνη )
musiké (téchne),
which means the art of the muses, can
be dened as an organized sequence of sounds and silences so as to produce aesthetic pleasure in the listener. There are evidences, by Pictographs, that music is known and practiced
since prehistory.
Over the years, the music expanded in many several music styles and for
many dierent purposes, like educational or therapy. All known cultures have their own musical practice. Music is a pivotal part in the cultural heritage of any society. In this way, its
preservation, in all of its forms, must be pursued. On the face of it, the Universal Declaration
on Cultural Diversity adopted by the General Conference of UNESCO on 2001 asserts that
cultural diversity is as necessary for humankind as biodiversity is for nature, and that policies
to promote and protect cultural diversity thus are an integral part of sustainable development1 .
Portugal has a notorious lack in music publishing from virtually all eras of its musical history.
In spite of most of the original manuscripts of music known before twentieth century being
kept in the national library in Lisbon, there is not any repository of musical information from
the last century. Although there are recent eorts to catalogue and to preserve in digital form
2
the Portuguese music from the twentieth century notably the Music Information Center
3
and the section on musical heritage from the Institute of the Arts website
most of the mu-
sic pre-dating computer notation software was never published and still exists in the form of
manuscripts or photocopies spread out all over the country in discreet places. The risk of irreversibly losing this rich cultural heritage is a reality that must be taken seriously and dealt with
accordingly. Digitization has been commonly used as a possible tool for preservation, oering
easy duplications, distribution, and digital processing. However, transforming the paper-based
music scores and manuscripts into a machine-readable symbolic format (facilitating operations
such as search, retrieval and analysis), requires an Optical Music Recognition (OMR) system.
Unfortunately, the actual state-of-the-art of handwritten music recognition is far from providing a satisfactory solution.
The project
Automatic recognition of handwritten music scores
initiated in 2007 by Instituto
de Engenharia de Sistemas e Computadores do Porto (INESC Porto) and Escola Superior de
Música e das Artes do Espectáculo (ESMAE) was the starting point for creating an OMR
1
http://www.unesco.org/bpi/eng/unescopress/2001/01-112e.shtml
http://www.mic.pt
3
http://patrimonio.dgartes.pt/?lang=pt
2
3
Chapter 1. Introduction
4
system that addresses some of the identied problems which will be described in detail in the
following sections.
It is the aim of this project to overcome the problem of musical symbol recognition in handwritten scores through the research and application of the most recent techniques of machine
learning and articial intelligence. Moreover, there is also the intention of creating a web-based
system providing generalized access to a wide
corpus
of handwritten unpublished music en-
coded in digital format. The database will not only centralize as much information as possible
but will also serve to preserve the musical heritage in an innovative way with a wide range of
possibilities [CCRG08a, CRCG08]. The architecture for the proposed system is represented in
Figure 1.1.
Figure 1.1: Generic system architecture.
The system is composed by three dierent entities: repository, web server and web browser.
Briey, the Repository module stores the original scanned score, the digital counterpart in
MusicXML and all the descriptive metadata inserted by the user. All the remaining system
contents, such as the user information, are also stored in this entity. The Web Server is the
user access point to the system as well as to all of its processing modules run on the server,
encompassing the search engine and the optical recognition engine for the musical scores. The
Web Server interacts with the Repository and with the Web Browser, which establishes the
interface between the user and the system. The user interface on a Web Browser allows the
complete management of the musical scores and associated metadata, as well as carrying out
the system administration.
The project presented here is innovative for several reasons:
1. It has the ambition to develop a formalism to model, in a consistency way, the heterogeneous knowledge about the language and musical notation.
2. Aims to explore the wealth and the potential from the most recent techniques of machine
learning and articial intelligence, not only for representing and merging knowledge, but
also to make decisions.
1.1. OMR Architecture
3. Intends to digitalize and preserve a wide
5
corpus of handwritten scores in an unprecedented
way.
4. Will include an OMR engine integrated with an archiving system and a user-friendly
interface for searching, browsing and edition. The digitized scores will be stored in MusicXML, a recent and expanding music interchange format designed for notation, analysis,
retrieval, and performance applications.
5. Intends to make accessible online the repository of handwritten scores for enjoyment,
educational and musicological purposes.
The work of this thesis, as carried out in this project, aims to develop new OMR algorithms
for the web-based sytem.
1.1 OMR Architecture
The principal aims of an OMR application are the recognition, the representation and the storage of musical scores in a machine-readable format.
An OMR program should thus be able
to recognize the musical content and make the semantic analysis of each musical symbol of a
musical work. In the end, all the musical information should be saved in an output format that
is easily readable by a computer.
In fact, the architecture of an OMR system is dependent on the methods used in the segmentation and recognition steps. Generally, the OMR process can be divided into three principal
modules (see Figure 1.2):
1. Recognition of musical symbols from a music sheet.
2. Reconstruction of the musical information to build a logical description of musical notation.
3. Construction of a musical notation model for its representation as a symbolic description
of the musical sheet.
It is important to note that this is a sequential architecture. In other words, the results obtained
in one phase are the input to the next phase. The rst module is typically further divided into
three stages:
Ê
Image pre-processing: it consists in the application of several techniques (e.g. binarization, noise removal, blurring, deskewing, among others) to make the recognition process
more robust and ecient.
Ë
Sta lines detection and removal to obtain an image containing only the musical symbols.
Ì
Objects segmentation and basic symbol recognition.
The second and third modules (Musical Notation Reconstruction and Final Representation
Construction) are intrinsically intertwined. In the second module of musical notation reconstruction, the symbols primitives are merged to form musical symbols.
Usually, in this step
Chapter 1. Introduction
6
graphical and syntactic rules are used. This is important for the introduction of context information to validate and solve ambiguities from the last module of music symbol recognition. It
is also produced a document where detected symbols are interpreted to assign them a musical
meaning. In the third module of nal representation construction this document is converted
into a format of musical description, such as MusicXML, that allows for storage.
In this thesis I focused in the step of music symbol recognition. The system is composed by
a preliminary stage of segmentation and a stage of classication of musical symbols.
More
specically, the segmentation phase is composed by sta line detection and removal, followed
by the stage of musical symbol extraction. In the classication step four distinct classiers are
used and an elastic matching method see Figure 1.2. This method is applied to all classiers
to simulate the possibility of deformations in the musical symbols. Moreover, it also permits
the enhancing of the test set.
Figure 1.2: OMR architecture.
1.2 Objectives
In order to overcome the limitations from the actual methods in musical symbol recognition
on handwritten musical scores and to organize them hierarchically, I researched the application
of the most recent techniques of machine learning and articial intelligence. I also developed
algorithms for the musical symbol detection and recognition. The musical notation is one of
the languages most widely known. It has had a continuous development over time, with requirements of consistency and precision. For this reason, the proposed methodology should be
1.3. Contributions of this Project and Related Publications
7
naturally adaptable to handwritten scores and to dierent standard musical notations.
In general, the rst step in the process of handwritten musical scores recognition is sta lines
detection. This operation is one of the most important phases in musical symbol recognition,
because it determines the results for subsequent proceedings. With this purpose in mind we
had the following objectives:
Ê
To test thoroughly the proposed algorithm for the sta lines detection using an appropriate database and error metrics, and compare this algorithm with others from the state of
the art.
Ë
To create a new database of ideal musical scores from an existing one, where several controled deformations are applied to the ideal musical scores to simulate feasible problems
of handwritten musical scores.
Ì
To use the proposed algorithm for the sta lines detection as a rst step in some state
of the art sta removal algorithms and conduct a series of experiments using appropriate
error metrics.
The musical symbol extraction is the next step before the classication.
process is based in existing algorithms [DDCF08].
recognize were split into four dierent types:
The segmentation
In this work, the symbols we want to
the symbols that were featured by a vertical
segment, the symbols that link the notes, the remaining symbols connected to sta lines and
the symbols above and under sta lines. Since we are working with handwritten scores many
situations may occur. The score degradation and highly dierent handwritten styles are possible
undesirable situations that may occur. The objectives for this phase were:
¬
Research and development of an algorithm to extract the handwritten musical symbols.
­
Simulate the variability with an elastic matching method in the music symbols, with the
aim of preparing the classier for several dierent situations that may occur in handwritten music symbols.
®
Compare and study dierent classiers for the musical symbols recognition.
1.3 Contributions of this Project and Related Publications
This dissertation presents the following contributions for the preservation and the general access
to musical and cultural heritage:
1. The introduction to the music analysis community of the algorithm of the sta lines
detection based in the Stable Paths approach.
2. The creation of a database of real scores with its references:
detection and removal
references.
3. New algorithms for the automatic detection and classication of musical symbols. In this
way, analysis and search of the score by the user will be more easily.
Chapter 1. Introduction
8
4. Integration of the new algorithms in the web-based system being developed under the
project.
The work related with the Project where the dissertation belongs, already resulted in the
publication of the followings papers:
ˆ
A Connected Path Approach for Sta Detection on a Music Score , Jaime S. Cardoso,
Artur Capela, Ana Rebelo, Carlos Guedes, in the
Image Processing (ICIP 2008).
ˆ
Integrated Recognition System for Music Scores , Artur Capela, Jaime S. Cardoso, Ana
Rebelo, Carlos Guedes, in the
ˆ
IEEE International Conference on
International Computer Music Conference (ICMC 2008).
Sta Line Detection and Removal with Stable Paths , Artur Capela, Ana Rebelo, Jaime
International Conference on Signal
Processing and Multimedia Applications (SIGMAP 2008), pages 263270, 2008.
S. Cardoso, Carlos Guedes, in Proceedings of the
And it waits for the result of the submission of the:
ˆ
Optical Recognition of Music Symbols: a comparative study, Ana Rebelo, Artur Capela
and Jaime S. Cardoso, in
(IJDAR 2008).
ˆ
International Journal of Document Analysis and Recognition
Sta Detection with Stable Paths, Jaime S. Cardoso, Artur Capela, Ana Rebelo, Carlos
Guedes, Joaquim Pinto da Costa, in
Intelligence (TPAMI 2008).
IEEE Transaction on Pattern Analysis and Machine
During the second semester of the academic year 06/07 one more paper was written. The rst
version of the proposed algorithm has already been presented and published:
ˆ
A Shortest Path Approach for Sta Line Detection , Ana Rebelo, Artur Capela, Joaquim
F. Pinto da Costa, Carlos Guedes, Eurico Carrapatoso, Jaime S. Cardoso, in Proceedings
International Conference on Automated Production of Cross Media Content for
Multi-channel Distribution (AxMedis 2007), pages 7985, Nov. 2007.
of the
1.4 Dissertation Structure
This dissertation is organized in 5 chapters that describe the work developed in the last year.
It also has a set of appendices with complementary information that will help the description of
the work done. After this introductory Chapter, the description of the works related with this
project will be made in Chapter 2. It is constituted by three states: a presentation of the state
of the art of the algorithms in the sta lines detection area and the already existing algorithms
in the eld of sta lines removal; a presentation of the state of the art of the algorithms in the
musical symbols classication area; and a background presentation of the several methods used
in the classication phase.
In Chapter 3, the proposed stable paths algorithm to detect the
sta lines is described. The database used, the error metrics applied in the several experiments
1.4. Dissertation Structure
9
done to test the stable paths algorithm, and the results obtained are also presented. In chapter
4, the segmentation and classication phase is presented.
A brief description of the musical
symbols that the algorithm tries to recognize is also described. Finally, conclusions are drawn
and future work is suggested in chapter 5.
Chapter 2
Related Works
The investigation in the OMR eld began with Pruslin [Pru66] and Prerau [Pre70]. However,
it was only in the decade of 1980's, when the equipment of digitalization became accessible,
that work in this area has expanded [Car89, Ng95, Coü96, Bai97].
Over the years, there
have been appearing several commercial OMR software, but no one has a satisfactory performance in terms of precision and robustness. The complexity of the OMR task caused by the
bidimensional structure of the musical notation and also by the existence of several combined
symbols organized around the note heads have been conditioning the progress in this area. Until
now, even the most developed recognition systems (PIANOSCAN, NOTESCAN in Nightingale,
MIDISCAN in Finale, PHOTOSCORE in Sibelius, SMARTSCORE, SHARPEYE, etc) can not
identify all music notations. Besides that, classic OMR is more focused in regular printed music
sheets; so, a good eciency is only obtained when this type of scores are processed.
In this thesis, the recognition of standard handwritten music is the goal. With these manuscripts
emerge new and dierent additional problems comparatively with printed music. Handwritten
musical scores tend to be rather irregular and determined by the authors' own writing style.
Moreover, if we consider that most of these works are old, the quality of paper in which it is
written might have degraded throughout the years, making it a lot harder to correctly identify
its contents. Handwritten musical scores are likely to have changes of size, shape and intensity
of handwritten symbols by the same author into the same score.
and recognition process is even more complicated.
As a result, the detection
Furthermore, problems exist not only in
this level but also in the detection of the handwritten sta lines. These are rarely straight and
horizontal, and are not parallel to each other.
For example, some staves may be tilted one
way or another on the same page or they may be curved. Depending on the level of the paper
degradation there may also exist discontinuities in the sta lines and in the remaining symbols.
In general, these scores have many years of existence, and therefore there is a sharp decay in
the quality of the paper and ink.
Many works are also photocopies from photocopies which
adds noise through the degree of copy.
2.1 State of the Art in Sta Lines Detection
One important stage in OMR is sta line detection. The reason for that is the possibility to
isolate the musical symbols present in the score enabling their extraction. Sta lines detection
10
2.1. State of the Art in Sta Lines Detection
11
is, consequently, one of the fundamental steps in all OMR processes, being the following proceedings very dependently on the performance and results obtained in this initial phase.
In
doing so, any OMR algorithm begins with this operation.
The problem of sta line detection is often considered simultaneously with the goal of their
removal, although exceptions exist [Mat85, MO07, Szw05, Pug06].
The simplest approach
consists in nding local maxima in the horizontal projection of the black pixels of the image
+
[KI90, Fuj04, RCF 93, MN96, BBN01, TSM06]. These local maxima represent line positions,
assuming straight and horizontal lines. Several horizontal projections can be made with different image rotation angles, keeping the image in which the local maxima are bigger. This
eliminates the assumption that the lines are always horizontal.
identifying sta lines is to use vertical scan lines [Car89].
An alternative strategy for
This process is based on a Line
Adjacency Graph (LAG). LAG searches for potential sections of lines: sections that satisfy
criteria related to aspect ratio, connectedness and curvature.
More recent works present a
nearly sophisticated use of a combination of projection techniques in order to improve the basic
approach [Bai97, BBN01, RB05].
Fujinaga [Fuj04] incorporates a set of image processing techniques in the algorithm, including
run-length coding (RLC), connected-component analysis, and projections. After applying the
RLC to nd the thickness of sta lines and the space between the sta lines, any vertical black
run that is more than twice the sta line height is removed from the original. Then, the connected components are scanned in order to eliminate any component whose width is less than
the sta space height. After a global deskewing, taller components, such as slurs and dynamic
wedges are removed.
Other techniques for nding sta lines include the grouping of vertical columns based on
their spacing, thickness and vertical position on the image [RP96], rule-based classication
of thin horizontal line segments [Mah82], and line tracing [Pre70, RT88]. The methods proposed in [MO07, Szw05] operate on a set of
sta segments,
with methods for linking two
segments horizontally and vertically and merging two segments with overlapping position into
one. Dalitz [DDCF08] is an improvement on these methods.
In spite of the variety of methods available, they all suer from some limitations. In particular, lines with some curvature or discontinuities are inadequately resolved. The dash detector [LCL93] is one of a few works that try to handle discontinuities. The dash detector is an
algorithm that searches the image, pixel by pixel, nding black pixel regions that it classies
as stains or dashes. Then, it tries to unite the dashes to construct lines.
A common problem to all the above-mentioned techniques is that they try to build sta lines
from local information, without properly incorporating global information in the detection
process. To our knowledge, none of the proposed methods in the literature tries to dene a
reasonable process from the intrinsic properties of sta lines, namelly the fact that they are the
only extensive black objects on the music score. Usually, we argue that the most interesting
Chapter 2. Related Works
12
techniques arise when one denes the detection process as the result of optimizing some global
function.
In this thesis, we suggest a graph-theoretic framework where sta lines are the
solutions of a global optimization process.
2.2 State of the Art in Musical Symbol Extraction and
Classication
The segmentation task has been the object of study by some authors [Mah82, RT88, Car89,
RB05, TSM06].
This process deserves some careful attention because it interferes with the
classication stage. Major problems result from the diculty to obtain individual meaningful
objects. This is due to the printing and digitalization as well as to the paper degradation and
the lack of a standard notation. In addition there are distortions caused by sta lines, broken
and touching symbols as well as high density of symbols, with dierent sizes and shapes. Besides, few research works have been done in handwritten scores [FLS05].
The most usual approach consists in extracting elementary graphic symbols: note heads, rests,
dots, etc, that can be composed to build music notation. Usually, the primitive segmentation
step is made along with the classication task [Mah82, RT88, Car89, KI90, RB05, TSM06],
however exceptions exist [Fuj04, BBN01, Pug06]. In [Mah82] and [Car89] it is applied the same
technique used in sta lines detection to the detection of musical symbols. Mahoney [Mah82]
builds a set of candidates to one or more symbols types and then use descriptors to select
the matching candidates.
This technique is not feasible for handwritten scores, because the
manuscript symbols do not have stable shapes and sizes.
the LAG to extract symbols.
Carter [Car89], in his turn, uses
The objects resulting from this are classied according to the
bounding box size, to the number and organization of their constituent sections. Once again
this process of classication is not promissing for handwritten symbols, for the reasons presented above. Other authors [Fuj04, BBN01] have also chosen to apply projections to detect
symbols primitives. The recognition is done using features extracted from the projection proles. In [Fuj04] the k-nearest neighbor rule is used in the classication phase, whether neural
networks is the classier used in [BBN01].
+
Randriamahefa [RCF 93] proposed a structural method based in construction of graphs for
each symbol. These are isolated by using a growing region method and thinning. Template
matching is adopted in [Mat85, MN96, Ros02, RB05, TSM06]. In [Ros02, RB05] a fuzzy model
that depends on a robust symbol detection and template matching results was developed. This
method aims to deal with uncertainty, exibility and fuzziness at a symbol level. In fact in this
approach the segmentation process has two steps: individual analysis of musical symbols and
fuzzy model construction. In the rst step, the vertical segments are detected by region growing
method and template matching. Then, beams are detected by a region growing algorithm and
a modied Hough transform. The remaining symbols are extracted again by template matching. From this rst step results three recognition hypotheses; each assigning the pattern to a
possible class. The fuzzy model is used to make a consistent decision. The proposed process
2.3. Background Knowledge
13
incorporates graphical and syntactic rules. Besides, it allows the possibility of applying learning
procedures when potential errors occur, in an eort to gain robustness.
Other techniques for extracting and classifying musical symbols include rule-based systems to
represent all the musical information [RT88], a collection of processing modules that communicate by a common working memory [KI90] and pixel tracking with template matching [TSM06].
Toyama [TSM06] checks for coherency in the primitive symbols detected by estimating touching
positions. This evaluation is done by music writing rules. Coüasnon [CC93, Coü96] proposed
a recognition process entirely controlled by grammar which formalizes the musical knowledge.
In [RP96] the segmentation process involves three stages: line and curves detection by LAG,
accidentals, rests and clefs detection by a character prole method and note heads recognition
by template matching. The contextual recognition is done by graph grammars. In [Pug06] the
segmentation task is based in Hidden Markov Models. This process performs segmentation and
classication simultaneously. However, this technique results only for very simple scores, that
is, scores without slurs or more that one symbol in the same column and sta.
We can conclude, based on this brief description of the state of the art, that although the OMR
systems already contribute to the analysis of musical scores by several methods, it is still far
from solved the problems imposed by handwritten scores.
Therefore, new research and new
algorithms are important and necessary. By all the works already presented, only the fuzzy
model proposed by [RB05] lead us to deduce that maybe it is a good method to take into
account in a future work. It will be enriching for this project to do a comparative study of this
method with elastic matching and also with a segmentation process proposed in this thesis.
2.3 Background Knowledge
In this Section we review some concepts necessary for a better understanding of the work
presented in the course of this thesis.
2.3.1 Graphs
A
graph
is a pair
subsets of
V. G
G = (V, E)
of sets such that
is composed of two sets
the set of edges (or lines)
(p, q), p, q ∈ V
V
and
E ⊆ [V ]2 1
E. V
the elements of
E
are
2-element
is the set of vertices (or nodes), and
E
is
[Die05].
on V . V (G) is the vertex set of a graph G and E(G)
its edge
The number of vertices of a graph G is its order, |G|, and ||G|| is the number
of edges. A vertex v is incident with an edge e if v ∈ e; then e is an edge at v . Endvertices
or ends are two vertices incident with an edge, and an edge joins its ends. If x ∈ X and
A graph with vertex set
V
is called a graph
2
set .
y ∈ Y,
then
xy
is an
set of all the edges in
1
X−Y
E
edge.
at a vertex
E(X, Y )
v
is the set of all
is denoted by
E(v).
X−Y
edges in a set
Two vertices
x, y
E.
The
of a graph
G
[V ]k ≡ the set of all k-element of V .
Note that these conventions are independent of any names, for instance if we have a graph H = (W, F ) the
vertex set w is still referred to as V (H), not as W (H).
2
Chapter 2. Related Works
14
are
adjacent, if xy is an edge of G.
The graph is
weighted
if a weight
the edges are directed, i.e.,
Two edges
e 6= f
w(p, q) is associated to each edge,
(p, q) 6= (q, p).
e
an initial vertex
init(e)
A path is a (non-empty) graph
xi
where the
x1 ,. . .,xk−1
are all distinct.
P = (V, E)
are the
inner
The
init : E → V
and a terminal vertex
V = {x0 , x1 , . . . , xk }
end
vertices of
P.
and it is called a
In other words, a digraph is a pair
sets of vertices and edges together with two maps
every edge
are adjacent if they have an end in common.
and
ter(e).
(V, E)
ter : E → V
digraph
if
of disjoint
assigning to
of the form
E = {x0 x1 , x1 x2 , . . . , xk−1 xk } ,
x0
vertices
The
and
path cost
xk
are linked by
P
and the vertices
is the sum of each arc weight in the
path. The length of a path is the number of its edges, and the path of length
k
is denoted by
connected graph G is a graph where two of its vertices are linked by a path
graph D is oriented with an orientation of a graph G if V (D) = V (G) and
P k . A non-empty
in
G.
A direct
E(D) = E(G),
and if
{init(e), ter(e)} = (x, y)
for every edge
e = xy .
In graph theory, the shortest-path problem seeks the shortest path connecting two nodes; ecient methods are available to solve this problem, such as the dynamic programming algorithms.
Dynamic Programming
Before presenting the operations of dynamic programming techniques let us take a look to the
following example [SDV06]. Figure 2.1 represents a directed acyclic graphs and its linearization
(topological ordering the nodes are arranged on a line so that all edges go from left to right).
This linearization process is important for the shortest path.
4
A
C
1
E
start
3
5
1
2
6
B
D
6
3
E
start
2
B
5
4
A
C
1
D
1
Figure 2.1: A directed acyclic graphs and its linearization.
Suppose we want to gure out distances from node
let's focus on node
C.
E
to the other nodes. For concreteness,
The only way to get to it is through its predecessors,
B
or
A;
so to nd
2.3. Background Knowledge
the shortest path to
C,
15
we need only compare these two routes:
dist(C) = min {dist(B) + 3, dist(A) + 4} .
A similar relation can be written for every node. If we compute these
to-right order of Figure 2.1, we can certainly get to a node
information needed to compute
dist(v).
v,
dist
values in the left-
and then we already have the
We are therefore able to compute all distances in a
single pass:
initialize all dist(·) values to ∞
dist(s) = 0
for each v ∈ V {s}, in linearized order:
dist(v) = min(u,v)∈E {dist(u) + l(u, v)}
In face of it, we can note that this algorithm is solving a collection of subproblems,
it starts with the smallest of the distances,
dist(e),
{dist(u) : u ∈ V }:
since it immediately knows its answer to
be 0; then, it proceeds with progressively larger subproblems distances to vertices that are
further and further along in the linearization where it is thinking of a subproblem as large if
it needs to have solved a lot of other subproblems before it can get to it. This is a very general
technique.
At each node, the algorithm computes some function of the values of the node's
predecessors. In this case, the particular function is a minimum of sums.
This is dynamic programming. This is a powerful algorithmic paradigm for eciently solving a
wide range of search and optimization problems which exhibit the characteristics of
overlapping
subproblems (the problem can be broken down into subproblems which are reused several times)
and optimal substructure (optimal solutions of subproblems can be used to nd the optimal
solutions of the overall problem).
Now, it is crucial to know what the subproblems are when we are solving a problem by dynamic
programming. Each node will represent a subproblem, and each edge will represent a precedence
constraint, of the form
(i − 1, j) → (i, j), (i, j − 1) → (i, j),
and
(i − 1, j − 1) → (i, j),
on the
order in which the subproblems are tackled. We can also put weights on the edges.
2.3.2 Classication Methods
Dierent approaches were evaluated in this work for the classication problem: hidden markov
models, support vector machine, neural networks and k-nearest neighbor. Besides that, concerning the fact that we are trying to classify handwritten musical symbols, we state one main
problem: the enormous variations in symbols. Therefore, the classication process needs to take
into account the variability in the writing style.
An Elastic Matching method [Nis95, JZ97]
was used to expand our database to achieve a better approach for these several variances. In
the next subsections a brief explanation of these methods will be provided.
Chapter 2. Related Works
16
Hidden Markov Models
Hidden Markov Models (HMMs) have rarely been used in OMR except in experiences made
by [KPM96, MMM04, Pug06]. The application of this technique in musical symbol classication had its origins on optical character recognition. One of the reasons for the use of HMM
lies in its capability to perform segmentation and recognition at the same time. Limited by
time, for now, HMM were only used to recognize the symbols.
a doubly stochastic process, that generates symbols sequence, with an underlying
stochastic process that is hidden and can only be detected through another process whose realizations are observable [BS00]. The hidden process consists of a set of states connected to each
A HMM is
other by transition probability. Left-right HMMs were used here. Transitions probabilities from
a state
i
to another state
i, j ≤ N .
j
are given by
A = {aij },
where
aij = P [qt+1 = Sj |qt = Si ] ,
1≤
The observed process consists of a set of outputs or observations. Each observation
is contained in a state with some probability density function. The set of observations probabilities is given by
bj (k)
B = bj (k),
where
bj (k) = P [ot = xk |qt = Sj ], 1 ≤ k ≤ M , j = 1, 2, ..., N .
represents the probability of the observation
in time
where
t
π
and
qt
represents the state in time
t.
xk
in state
Sj , oj
denotes the observation
HMM can now be formulated as
λ = (A; B; π),
is a set of initial probabilities of states [WFJC01].
Support Vector Machines
Support Vector Machines (SVMs), pioneered by Vapnik [Vap98], deal with the problem of
classication as a problem of quadratic optimization. This technique has as main idea the construction of a hyperplane as the decision surface in such a way that the margin of separation
between positive and negative examples is maximized. Support vector machines classify the
data using support vectors [Hay98].
Without loose of generalization, SVMs try to maximize the margin of the optimal hyperplane
which separates the data. This is typically done in a much higher dimension than that of the
original feature space.
Formally, given the training set
and corresponding binary class labels
dened by
g(x) =
bias parameter.
have
di
solving
[wt ϕ(x)
x
wt ϕ(x) + b where
w,b
s.t
1
or to
if
denotes a xed-feature space transformation and
g(x) > 0
−1
if
g(x) < 0.
b
a
This is equivalent to
Summarizing, maximizing the margin is equivalent to
1 t
ww
2
di
xi ∈ RP
the linear separable optimal hyperplane is
+ b] ≥ 1, i = 1, . . . , N .
min
with input data
di ∈ {−1, 1},
ϕ(x)
is assigned to class
{xi , yi }N
i=1
[wt ϕ(x)
(2.1)
+ b] ≥ 1, i = 1, . . . , N
If the training classes are not linearly separable, it means that above conditions and problem
formulation can not be sustained. For this reason, slack variables
ξi , i = 1, . . . , N
were added.
These allowed to have a penalty for the data points wrongly classied. Finally, the objective
2.3. Background Knowledge
17
is to minimize the error. That is,
N
X
1 t
w w+C
ξi
2
min
w,b,C,ξi
i=1
di
s.t
where the parameter
C>0
[wt ϕ(x)
ξi ≥ 0
(2.2)
+ b] ≥ 1 − ξi , i = 1, . . . , N
controls the trade-o between the slack variables and the margin.
In the feature space it is easier to solve the
dual problem, and sometimes it is the only way to
train the support vector machines. It is possible to formulate the dual problem
3
for a sample
N
of training {xi , yi }i=1 not separable as follows:
N
X
max
α
i=1
PN
N
N
1 XX
αi −
αi αj di dj k(xi , xj )
2
i=1 j=1
(2.3)
i=1 αi di = 0
s.t
0 ≤ αi ≤ C
where
k(xi , xj ) = ϕT (xi )ϕ(xj ) =
is the
l
Pm1
i = 1, 2, ..., N
l=0 ϕl (xi )ϕl (xj ),
component in the application
ϕ(xi )
of
i = 1, 2, ..., N
xi ; m1
and
j = 1, 2, ..., N . ϕl (xi )
is the dimension of the feature space.
Above a binary classier was described whereas in this work a multiclass problem is presented.
In fact, several works have been suggested as an extension to the binary problem classication
that originally was proposed [eeFc02, HL02, MA98]. Usually, there are two types of approach for
the multiclass classication problems. One is to build and to combine several binary classiers one-against-one and one-against-all - and another approach is to consider directly a resolution of
the quadratic problem. The one-against-one methodology was used in this work. This method
consists of training the
j th
k th
and
classes in the following binary problem
N
min
wjk ,bjk ,C,ξijk
s.t
with
l
training data
class of
xi ;
X jk
1 jk t jk
(w ) w + C
ξi (wjk )t
2
i=1
yi
[(wjk )t ϕ(x)
ξijk ≥ 0
(x1 , y1 ), . . . , (xl , yl )
the training data
xi
(2.4)
+ b] ≥ 1 − ξijk , i = 1, . . . , N
where
xi ∈ <n , i = 1, . . . , l,
and
yi ∈ {1, . . . , k}
are mapped to an high dimensional space by the funtion
The three most common types of inner-product kernels for SVMs are:
machine, radial-basis function network and tangent hyperbolic.
is the
ϕ.
polynomial learning
In this work a radial-basis
function network was used, given by:
k(x, xi ) = exp(−γ||x − xi ||2 ), γ ≥ 0
3
See Appendix A.1 to see the demonstration.
(2.5)
Chapter 2. Related Works
18
Neural Networks
The term
neural network was initially studied with the aim to represent the information pro-
cessing in biological systems [MP88, WH88, RHW88]. The attempts were to produce intelligent
perception and cognition machines by simulating the physical structure of human brains. In
our days, the principles and algorithms of neural networks have found several applications in
diverse elds including pattern recognition and signal processing.
Neural networks are composed by interconnected neurons. These are the information processing
units. The neural model is form by three basic elements: a set of connecting links (each one is
multiplied by a weight), an adder (for summing the input signals) and an activation function
(for limiting the amplitude of the output of a neuron) [Hay98]. Formally, we can dene the
output of a neuron by
m
X
yk = ϕ(
ωkj xj + bk )
(2.6)
j=1
where
x1 , x2 , . . . , xm
are the input layers,
is the activation function.
ω
are the weights of neuron
k , bk
is the bias and
ϕ(.)
There are three common types of activation functions: threshold
function, piecewise-linear function and sigmoid function. In this work, the last one was used
, whose graph is s-shaped. This way, the function accepts input values between
and return values between 0 and 1. Hence, it is dened by
f (x) =
1
, where a > 0
1 + exp(−ax)
−∞
and
represents the slope
∞,
(2.7)
A multilayer feedforward neural network was used. Typically, this network consists of input
and output layers of neurons and one or more hidden layers that are not part of the input or
output of the network. The input signal propagates through the network in a forward direction.
4
The learning algorithm used to train this network was the backpropagation algorithm . This
algorithm is based on the gradient descending technique to minimize the cost function. In a
simple explanation, the error backpropagation learning consists of two passes through the layers
of the network: a forward pass and a backward pass. In the forward pass the link weights of
the neurons are all xed.
The input vector is applied to the sensory nodes of the networks
and it is produced a set of outputs.
In the backward pass the link weights are all adjusted
in accordance with an error-correction rule.
Basically, the output values of the network are
subtracted from a desired response (targets) to produce an error signal. This error signal is
then propagated backward through the network to all neurons.
A network with
K
outputs
was used, one corresponding to each class, and target values of 1 for the correct class and 0
otherwise.
K-nearest Neighbour
K-nearest neighbour is a supervised learning method for classifying objects based on training
examples in the feature space [Fuk90].
4
See Appendix A.2 for more details.
This algorithm belongs to a set of techniques called
2.3. Background Knowledge
19
Instance-based Learning. The K-nearest neighbour algorithm is very simple and basically it
does not have train. It starts by extending the local region around a data point
k th nearest neighbour is found. The most represented class in the
k -closest
the predicted class. Data training lies only in the estimation of the best
k.
x
until the
samples denes
In this work the
Euclidean distance was used :
deuclidean (a, b) =
Elastic Matching
qP
d
i=1 (ai
− bi )2
Several research methods exist in deformable template eld applied on handwritten digits and
handprinted characters recognition (e.g.
[LS88, Wak94, Nis95, JZ97]).
Lam [LS88], one of
the rst works in this area, proposed a method of recognition in two-stages. The images are
rst recognized by a tree classier.
Those which cannot be satisfactory assigned to a class
are passed to a matching algorithm, which deforms the image to match with a template. In
Nishida [Nis95] a grammar-like model for applying deformations in primitive strokes was developed. A new and robust shape-matching approach to recognize numbers manuscripts was
proposed by Wakahara [Wak94]. The method uses successive local ane transformation (LAT)
operations to gradually deform the image.
The aim is to yield the best match to an input
binary image. LAT on each point at one location is optimized using locations of other points
by means of least-squares data tting using Gaussian window functions.
The deformation and matching technique used in this work to classify the musical symbol is
based in [JZL96, JZ97]. In this approach, the image is mapped on a unit square
The points in this square are mapped by the function
displacement functions is given by
S = [0, 1]×[0, 1].
(x, y) → (x, y) + D(x, y).
The space of
exmn (x, y) = (2 sin(πnx) cos(πmy), 0)
(2.8)
eymn (x, y) = (0, 2 sin(πny) cos(πmx))
(2.9)
Specically, the deformation function is chosen as follows:
M X
N
x ex + ξ y ey
X
ξmn
mn mn
mn
D(x, y) =
λmn
(2.10)
m=1 n=1
where
x , ξ y ), m, n = 1, 2, . . .}
ξ = {(ξmn
mn
orthogonal basis. Because
coecients of
density on
ξmn
2
variance σ .
can represent complex deformations by choosing dierent
and dierent values of M and N, it is important to impose a probability
D(x, y).
pendent along the
D(x, y)
are the projections of the deformation function on the
x
Therefore, the
and
y
ξmn 's
are assumed to be independent of each other, inde-
directions and identically Gaussian distributed with zero mean and
Part II
Detection
21
Chapter 3
Staff Lines Detection and Removal∗
Sta line detection and removal are the rst fundamental stages in many systems of optical
music recognition, with subsequent processes relying heavily on their performance. The reasons
for detecting and removing the sta lines lie on the need to isolate the musical symbols for a
more ecient and correct detection of each symbol present on the score.
For the sta lines
to be properly removed and the symbols correctly detected it is necessary to make a correct
detection of the sta lines. When we are dealing with printed and regular scores the process is
much simpler. However, in the case of handwritten music sheets, which is the case of study of
this dissertation, several vicissitudes, already listed in Chapter 2, may occur. For this reason,
despite the multitude of attempts to treat the problem of sta lines detection, when we are
working with old and handwritten music scores, the results are not completely satisfactory yet.
In this dissertation a new conceptualization to detect the sta lines is presented. The proposed
+
paradigm begins with the work done in [RCC 07]. The main idea is to consider the sta lines
as the result of the shortest path between the two margins of the music sheet, giving preference
to black pixels. The initial version only implemented a heuristic of this basic idea, without a
principle properly reasoned and tested. The proposed new learning methodology is extended
and explored in various directions in this thesis. Firstly, the concept of stable paths is introduced in order to improve the computational performance of the method. Secondly, the design
of the weights on the graph resulting from the music score is generalized to dierentiate black
pixels belonging to the sta lines from black pixels resulting from the music symbols. Finally,
the post processing is rened, improving the overall performance. A further development is the
study of new sta removal algorithms, by incorporating the proposed sta line detection on
standard sta removal algorithms.
It is important to state that despite this Project being in the eld of recognition of standard
musical notation it is not necessary, for the sta line detection and removal phase, to be conned only to this type of notation. Hence, a database was built. This database contains not
only several symbols with dissimilar shapes and intensities, but also a varying number of lines
per sta.
∗
Some portions of this chapter appears in [CCR+ 08, CCRG08b].
23
Chapter 3. Sta Lines Detection and Removal
24
3.1 Basic Concepts
First of all, it is necessary to indicate that the paradigm described considers that a sta line
can be seen as a connected path from the left side of the music score to the right side. As sta
lines are almost the only extensive black objects on the music score, the path we are looking
for is the shortest path between the two margins
if paths (almost) entirely through black pixels
are favoured.
Consequently, the image grid the music sheet is considered as a graph with pixels as nodes
(vertices) and edges connecting neighbouring pixels. The weight
w(p, q) of each arc is a function
v1
of the pixels values and their relative positions. A path from vertex (pixel)
vn
v1 , v2 , . . . , vn ,
is a list of unique vertices
with
vi
and
vi+1
to vertex (pixel)
corresponding to neighbour pixels.
The total cost of a path is the sum of each arc weight in the path and is given by
n
X
w(vi−1 , vi ).
(3.1)
i=2
A path from a source vertex
minimum among all
u
on a graph,
v -to-u
d(v, u),
v
to a target vertex
is the total cost of a shortest path between
v
to a sub-graph
if its total cost is minimum among all
to a sub-graph
path if its total cost is
v
paths. The distance between a source vertex
A path from a source vertex
v
u is said to be a shortest
Ω, d(v, Ω),
v -to-u
Ω
v
and
and a target vertex
u.
is said to be a shortest path between
paths, where
u ∈ Ω.
v
and
Ω
The distance from a node
is the total cost of a shortest path between
v
and
Ω
and it is
represented by:
d(v, Ω) = min d(v, u).
(3.2)
u∈Ω
A path from a sub-graph
Ω2
to a sub-graph
if its total cost is minimum among all
from a sub-graph
Ω1
Ω1
and
Ω2 ,
Ω1
to a sub-graph
Ω2
is said to be a shortest path between
v -to-u paths,
Ω2 , d(Ω1 , Ω2 ),
where
u ∈ Ω1
and
v ∈ Ω2 .
Ω1
and
The distance
is the total cost of a shortest path between
and is given by:
d(Ω1 , Ω2 ) =
min
v∈Ω1 ,u∈Ω2
d(v, u).
(3.3)
3.2 Underlying Principle
As mentioned before, sta lines can be considered as the only extensive objects made from
black pixels in the music score, connected paths of black pixels from the left side to the right
side of the music score.
Assuming that paths through black pixels are preferred over paths
through white pixels, sta lines can then be found among the shortest paths from the left to
the right margin of the music score. Sta lines can be modelled as paths between two regions
Ω1
and
Ω2 ;
these represent, respectively, the left and right margins of the score. Besides that,
one may assume that sta lines do not zigzag back and forth, left and right. Therefore, one
may restrict the search among connected paths containing one, and only one, pixel in each
3.3. Algorithm Outline
25
2
column of the image . Formally, let I be an
N1 × N2
1
, s.t.
s = {(x, y(x))}Nx=1
where
y
is a mapping
image and dene an admissible sta to be
∀x |y(x) − y(x − 1)| ≤ 1,
y : [1, · · · , N1 ] → [1, · · · , N2 ].
That is, a sta line is an 8-connected path
of pixels in the image from left to right, containing one, and only one, pixel in each column of
the image.
Given the weight function
w(p, q),
the cost of a sta can be dened as
C(s) =
N1
X
w(vi−1 , vi )
(3.4)
i=2
The optimal sta line that minimizes this cost can be found using dynamic programming. The
rst step is to traverse the image from the second column to the last column and compute the
cumulative minimum cost C for all possible connected sta lines for each entry
C(i, j) = min
where
(l, m).
w(pi,j ; pl,m )
(i, j):



C(i − 1, j − 1) + w(pi−1,j−1 ; pi,j )


C(i − 1, j) + w(pi−1,j ; pi,j ) ,



 C(i − 1, j + 1) + w(p
i−1,j+1 ; pi,j )
represents the weight of the edge incident with pixels at positions
(i, j)
and
At the end of this process,
min
j∈{1,··· ,N2 }
C(N1 , j)
indicates the end of the minimal connected sta. Hence, in the second step, one backtrack from
this minimum entry on
C
to nd the path of the optimal sta.
3.3 Algorithm Outline
Assume one wants to nd all sta lines present in a score. This can be approached by successively nding and erasing the shortest path from the left to the right margin of the score. The
3
erase operation is crucial to ensure that a sta is not detected more than once .
Consider the music score presented in Figure 3.1(a); in Figure 3.1(b) the rst eleven shortest
paths are traced. From this example it is possible to conclude that music symbols placed on
top of sta lines do not interfere with the detection of the sta lines. Moreover, the example
also makes clear that slightly skewed scores do not pose any problem to the proposed approach.
Nonetheless, two main issues need to be properly addressed.
On the one hand, a criterion
is needed to stop the iterative detection of the shortest paths, that is, the sta lines. On the
2
These assumptions, 8-connectivity and one pixel per column, impose a maximum detectable 45 rotation
degree. However, a higher rotation degree, in a real case, is unlikely to happen, but may be previously corrected.
3
We implemented the erase operation by setting to white the pixels on the detected sta; image resizing
could be a valid alternative [AS07].
Chapter 3. Sta Lines Detection and Removal
26
(a) Skewed sta lines with music symbols.
(b) The rst 11 shortest paths between left and
right margins.
Figure 3.1: An exemplicative example of the methodology.
other hand, the initial and the nal parts of a path should also be trimmed, since these (almost)
completely white pixels do not belong to the sta line, despite belonging to the shortest path
of an opposite margin. Another subject deserving attention is the computational complexity.
The sequential computation of the shortest path may be prohibitive for some applications. It
would be interesting to be able to compute several 'shortest paths' simultaneously. Thus, before
presenting the complete algorithm, we introduce the concept of a stable path in a graph, which
will allow computing multiple sta lines in a single iteration, instead of sequentially computing
them one at a time.
3.4 Stable Paths on a Graph
Before presenting the formal denition of a stable path on a graph, it is necessary to consider
a hypothetical example of a simplied music score to motivate this concept.
Therefore, lets
consider Figure 3.2 where a sta with only four sta lines (rows 1, 4, 6, and 8) in a
8 × 9 image
is presented. The sta lines are the black elements, as in the real case. The presence of noise
on some of the sta lines is simulated by discontinuities.
The design of the weight function
will be considered in the Section 3.5; for now, it suces to know that the presented graph was
constructed to favour paths through black pixels.
In fact, we can see in Figure 3.2 that the shortest path between the left and right margins between the sub-graphs
Ω1
and
Ω2
is the path corresponding to the rst row, entirely through
black pixels. By following the strategy just delineated, one could nd the four sta lines in four
iterations, sequentially. Nonetheless, although only one sta line corresponds to the shortest
path, they all constitute a sort of (almost) optimal paths. The stable paths concept provides a
means to nd all of such paths simultaneously.
Denition 1. A path Ps,t is a stable path between regions Ω1 and Ω2 if Ps,t is the shortest
path between s ∈ Ω1 and the whole region Ω2 , and Ps,t is the shortest path between t ∈ Ω2 and
the whole region Ω1 .
The naming of stable paths has its roots in dynamical systems, as it resembles stable xed
points. If one considers the function
FΩ1 →Ω2 (),
mapping a node
s ∈ Ω1
to a node
t ∈ Ω2
by
3.4. Stable Paths on a Graph
27
Figure 3.2: Stable paths on a toy example.
nding the shortest path
Ps,t
of such shortest path, then
between
s ∈ Ω1
and
Ω2 ,
with
t = FΩ1 →Ω2 (s)
as the end node
GΩ1 →Ω1 (s) = FΩ2 →Ω1 (FΩ1 →Ω2 (s)) = s
if and only if
Ps,t
is a stable path. Note that the concept of stable path is valid for any graph
and any two sub-graphs in general. The computation of the stable paths on the toy example
of Figure 3.2 provides the three paths yellow-highlighted in the gure.
As a second example, in Figure 3.3(a) the shortest paths between
and the
whole
each point on the left margin
right margin are traced for the score in Figure 3.1(a). As seen, the paths got
attracted by the staines. Likewise, Figure 3.3(b) shows the shortest paths between
on the right margin and the
whole left margin.
each point
The set of stable paths between both margins
result as the set of paths present in both gures.
With the concept of Stable Paths, the computation of all the stable paths in the graph derived
from an image has only roughly twice the complexity of the shortest path computation. Noticing that the procedure delineated in Section 3.3 actually gives the shortest path between the
whole left margin
Ω1
and each point on the right margin
Ω2 .
The rst step on the computation
of the stable paths corresponds verbatim to the computation of the shortest path presented on
Section 3.3. In a second step one repeats the same procedure, traversing now the graph from
the right column to the left. At the end of this process, if the two endpoints of a direct and
reverse path coincide, we are in the presence of a stable path.
Chapter 3. Sta Lines Detection and Removal
28
(a) Shortest paths from each pixel in the left column
and the whole right column, superimposed on the original image.
(b) Shortest paths from each pixel in the right column
and the whole left column, superimposed on the original image.
Figure 3.3: Exemplication of stable paths for Figure 3.1(a).
On the toy example of Figure 3.2, the computation of the stable paths provides only three
out of the four sta lines, with the last stable path following partially through a segment of
4
the third sta line and a segment of the fourth sta line . As a consequence, it appears that
the computation of the stable paths does not guarantee the discovery of all paths of interest.
Applying, in a second time, the stable paths procedure, after erasing the paths found on the
rst iteration, we get a new set of paths, including a path joining the remaining segments of the
3rd and 4th sta lines. A search based on stable paths in preference to a sequential search by
the shortest path is related with the number of stable paths found simultaneously. For instance,
while a score with
60
sta lines would require
requires a few (typically between
4
to
6)
60
iterations of the shortest path algorithm, it
iterations with the stable paths method. Next, the
complete proposed algorithm for sta line detection is detailed.
3.5 Proposed Algorithm
The proposed algorithm can be implemented as a sequence of a few high-level operations see
Listing 1 that will be described in this Section.
Preprocessing
With the objective of detecting the sta lines, the proposed algorithm starts
by estimating the sta space height,
staspaceheight,
and sta line height,
staineheight.
These lengths will both be used as reference measures for the subsequent operations.
There are already, in common use, robust estimators for its calculation.
Usually, the
most used technique starts by computing the vertical run-lengths representation of the
image.
After doing that to a bit-mapped page of music, the most common black-runs
represents the sta line height and the most common white-runs represents the sta space
height [Fuj04]. After estimating the reference lengths, the edges' weights are determined
as explained in Section 3.6.
4
Note that the sequential search of the shortest path would suer from the same limitations.
3.5. Proposed Algorithm
29
BEGIN PreProcessing
compute staffspaceheight and stafflineheight
compute weights of the graph
END PreProcessing
CYCLE
compute stable paths
validate paths with blackness and shape
remove valid paths from image
add valid paths to list of stafflines
END OF CYCLE if no valid path is found
BEGIN PostProcessing
uncross stafflines
organize stafflines in staves
smooth and trim stafflines
END PostProcessing
Listing 1: Main operations of the proposed method.
Main Cycle
The preprocessing is followed by the main cycle of the methodology that incorpo-
rates successively nding the stable paths between the left and right margins. Moreover,
it is in this step that the paths found are added to the list of sta lines and then erased
from the image. The erase operation sets to white the pixels on a vertical strip of height
empirically xed at
staspaceheight, centred on the detected sta line.
As explained previ-
ously, the erase operation is necessary to ensure that a line is not detected multiple times,
even if its height is higher than one pixel.
Stopping Rule
To stop the iterative sta line search, a sequence of (arguably) sensible rules
is used to validate the stable paths found; if none of them passes the checking, the
iterative search is stopped.
Two validation rules were applied, both assessing features
with respect to the median values obtained during the rst iteration. If a path does not
have a percentage of black pixels above a xed threshold, it is discarded. The median
percentage of blackness of all lines found in the rst iteration of the main cycle provides
the necessary reference: a threshold of
80%
of the median value was empirically selected.
Likewise, if the shape of the detected path diers too much from the shape of the line with
median blackness, it is discarded. A dissimilarity measured as the average
between both paths, after removing the means above
y − distance
shapedi = 4×staspaceheight
was
selected as threshold.
PostProcessing
After the main search step, valid detected sta lines are post-processed.
Although true sta lines never intersect, the above algorithm may occasionally create
intersecting lines, detected on dierent iterations. That may be due to a local low quality
of a line, leading a stable path to jump between consecutive lines. In doing so, this local
discontinuities can be the reason for possible zigzag back and forth between consecutive
lines, because the detected path is likely to follow and connect the remaining segments,
and consequently intersect with the previous detected path. To preclude such nal, undesired state, lines are post-processed to remove intersections. That is easily and eciently
accomplished by, for each image column, sorting on
y
the pixels of the detected lines and
Chapter 3. Sta Lines Detection and Removal
30
assigning the
i-pixel
to the
i-line.
After this simple process, lines may touch but they do
not intersect.
After removing the intersections it is now possible to eliminate spurious sta lines and
to cluster them in staves. Since the lines are ordered, these operations require only iterating through the list of lines and starting a new sta whenever the distance between
two consecutive lines is above a xed threshold.
old was
= 2× staspaceheight.
The value considered for this thresh-
Subsequently, spurious staves are eliminated by simply
discarding those with only single sta line. However more robust rules can be created, because staves with only one sta line exist (e.g. percussion), although they are uncommon.
Finally, each retained line is trimmed at the beginning and at the ending and smoothed.
As visible in the example of Figure 3.1(b), before meeting with a sta line, a path travels
through a sequence of white pixels. Likewise, after the end of the sta line, the path goes
again through a sequence of white pixels until it meets the right margin of the image. In
order to ignore all of these white pixels (undesirable segments), the trimming operation
works per sta. Thus, for each sta, a sequence of median colours is computed as follows:
for each column, the median of the colours of the lines is added to the sequence. Next,
the trimming points are found on this sequence:
starting on the centre, we traverse
the sequence to the left and to the right until a run of
whiterun = 2×staspaceheight
white pixels is found (value obtained experimentally). The pixels between the left and
right runs are kept in the sta lines.
At the end, lines are smoothed with a standard
average low-pass lter. Considering a sta line as a sequence
dimensional averaging lter is applied. A window size of
y(x)
of y-positions, a one-
2×staspaceheight
was selected
on the experiments.
3.6 Design of the Weight Function
An immediate approach is to support the design of the weight function solely on the values
of the incident nodes, that is, if any of the corresponding pixels are black then a low cost is
assigned to the edge, otherwise the edge assumes a high cost.
We call this the
baseWeight
in Listing 2. Nevertheless, the weight function can be generalized taking into account other
factors. It is very useful to consider the prior knowledge about a music score when we want
to nd the sta lines.
In order to incorporate this idea in the shortest path process, we
consider dierent alternatives to modify the weight function of the graph.
In doing so, this
weight function will be modied by a term that codies that information about the music sheet.
Furthermore, two more attributes were considered in each pixel with the aim to inuence the
main contribution to the edge's weight resulting from the values of the incident pixels. The
prime intention of these additional features is to discriminate black pixels in the sta lines
from black pixels in the music symbols, penalizing the latter and favouring the former. With
this in mind, if a black pixel is part of a short vertical run of black pixels, then it is more
3.7. Sta Line Removal
31
likely to be part of a sta line rather than of a symbol.
edges is included in the weight function.
have another sta line at roughly
Therefore, a term beneting such
The other attribute is that a sta line is likely to
staspaceheight
pixels, assuming that staves have at least
two lines. Hence, if the nearest vertical run of black pixels on the same column is excessively
far from the vertical run of black pixels containing the current black pixel, then this pixel is
more likely to belong to a symbol (for instance, it may belong to a ligature) rather than to
sta line. Consequently, a penalising term is incorporated in the weight function for these cases.
The pseudo-code for the weight function is provided in Listing 2.
WeightFunction(pixelValue1, pixelValue2, vRun1,
vRun2, nearestVRun1, nearestVRun2,
NeighbourhoodType)
{
value = min(pixelValue1, pixelValue2);
weight = baseWeight(value, NeighbourhoodType);
if( (vRun1<=STAFFLINEHEIGHT)
OR(vRun2<=STAFFLINEHEIGHT))
weight = weight - delta;
if( (nearestVRun1>=STAFFSPACEHEIGHT+STAFFLINEHEIGHT)
OR(nearestVRun2>=STAFFSPACEHEIGHT+STAFFLINEHEIGHT))
weight = weight + delta;
return weight;
}
Listing 2: Pseudo-code for the weight Function. The base weight was set to
8 on white
delta penalizing
and
pixels for 4-neighbourhoods and to
6
term in the weight function was set to
and
1.
12
4
on black pixels
on for 8-neighbourhoods.
The
For eciency, weights were designed
with integer values.
3.7 Sta Line Removal
The sta lines removal algorithms used in this work were adopted from [DDCF08]. Line Track
Height, Line Track Chord, Roach/Tatem and Skeleton were the algorithms chosen for the
experiments. A brief description of these algorithms will be provided now:
Line Track Height
The algorithm tracks the sta lines and checks when a vertical black run
is longer than a threshold (experimentally set at
Line Track Height Modied
2×staineheight).
The version modied of the Line Track Height algorithm also
track the sta lines positions obtained by a detection algorithm and removes vertical run
sequences of black pixels that have a value lower than a specied threshold (chosen experimentally as
2×staineheight).
In this version, a carefully attention to the deformations
sta lines may have discontinuities, be curved or inclined that may occur in the music
scores are given. These problems will inuence the success to achieve a correct detection
of lines contained on the score. The positions of the sta lines obtained by a sta line
detection algorithm may pass slightly above or under the real sta lines positions. Therefore, if we are in presence of a white pixel when the sta lines are tracked, we search
Chapter 3. Sta Lines Detection and Removal
32
vertically for the closest black pixel. If that distance is lower than a specied tolerance
experimentally chosen as 1+ceil(
staineheight/3.0) we move the reference position of
5
the sta line to the position of the black pixel found .
Line Track Chord
This algorithm computes, for a xed angle resolution of three degrees,
the chord length through the skeleton point (see Figure 3.4). This results in a function
chordlenght(ϕ),
where
ϕ
is the chord angle, for each skeleton point. When the sta line
pixel also belongs to a crossing music symbol, the function should have a second distinct
peak. To detect this peak the following thresholds are used:
1. There is a local maximum when chordlength is greater than
angle below
than
30
5×staineheight
at an
degrees and another local maximum when chordlength is greater
1.75×staineheight× sin(ϕ)
at an angle
ϕ > 30
degrees.
2. The valley between two maxima must have a depth greater than
1.5×staineheight.
Concluding, this algorithm removes the sta line through the angles peaks of the chord
lengths. There are two distinct peaks depending if the pixels belong to a sta line or a
music symbol.
Figure 3.4: Length of a chord through a skeleton point at some angle
Roach/Tatem
ϕ.
This algorithm uses a labelling scheme based on the angle information and
pixel adjacency to identify the sta line pixels [RT88]. The chord length and the angle
function described in the Line Track Chord are computed for every pixel. Consequently,
the original image can be transformed into two-dimensional vector eld by picking the
angle and length of the longest chord for each black pixel. This will assign pixels on sta
lines a high length value and an angle value of zero. To avoid the removal of symbol pixels
on the sta lines, some horizontal line pixels are iteratively relabelled as non-horizontal
pixels, depending on the labels of their neighboring pixels.
Skeleton
This method consists of the following steps:
1. The skeleton is split at branching point and corner points with an angle below
135 degrees.
Around each spliting point a number of pixels are removed see
Figure 3.5(a).
5
See Appendix A.6 for more details.
3.8. Database of Music Scores
33
2. Sta line segment candidates are picked as skeleton segments if the orientation angle (least square tted line) is below 25 degrees, the segment is wider than tall
straightness
staineheigth2 /2.
and the
(mean square deviation from least square tted line) is bellow
3. Apply a sta-nding algorithm to the sta segment candidates. Two sta segments
are horizontally linked when their extrapolations from the end points with the least
square tted angle come closer than
staineheigth/2.
4. Remove false positives: from each ovelapping sta segment group on the same line
the one that is closest to its least square tted neighborhood is picked and the others are discarded; non-sta segments that have the same branching point as a sta
segment are extrapollated by a parametric parabola:
if this parabola is approxi-
mately tangential to the sta segment, the latter is considered a false positive see
Figure 3.5(b).
5. Remove sta lines:
all vertical black runs around the detected sta skeleton are
removed.
(a) Pixels within the distance transform
radius around each splitting point are removed.
Figure 3.5:
(b) A falsely detected sta segment that can
be identied as belonging to a music symbol
because it is appoximately tangential to an extrapolated parabola from a non-sta segment.
Example (from [DDCF08]).
3.8 Database of Music Scores
For the purpose of testing the proposed algorithm, a database with a sucient number of scores
to produce signicant results on which we can draw conclusions and test improvements, was
created. Although the assessment of a new sta detection algorithm may be done by visually
+
inspecting the output on a set of scores as adopted on [RCC 07] , here we support the
comparison with two test sets adopted for the qualitative evaluation of the proposed method:
the test set of synthetic scores from [DDCF08] and a new set of real score.
The synthetic
database consists of 32 ideal scores see Figure 3.6(a) with dierent musical notation to
which several known deformations were applied.
In total 2688 images were generated.
The
simulation of distortions in a controlled manner made possible to build many feasible problems
that can be found in the handwritten music scores. Despite this we also test the performance of
Chapter 3. Sta Lines Detection and Removal
34
the stable path algorithm with a test set of real scores in total 50 images from the Portuguese
composer Fernando Lapa with other old handwritten scores (see Figure 3.6(b)).
(a) Music score from the test set of 32 ideal scores.
(b) Music score from the test set of 50 real scores.
Figure 3.6: Two examples of music scores from the test set used on the experimental evaluation.
3.8.1 Deformations in the Synthetic Scores6
7
The Gamera framework
(Generalized Algorithms and Methods for Enhancement and Restora-
tion of Archives) was the base platform used to create the deformation database with intention
to development and test the proposed algorithm. In a brief description, this framework is a
toolkit for building document image recognition systems. It has a programming library with a
set of graphics tools for experimentation and training. Gamera provides a wealth of tools for
image processing. In this work, in order to complete the Gamera framework, for the concrete
problem of building a syntactic database, the MusicStaves plug-in was included. This toolkit
appears to be ideal for the problem, because
¶
It is a specic platform for the development and test of the detection and removal sta
lines algorithms.
·
It is an open source toolkit.
¸
It allows using handwritten and printed scores with the algorithms already implemented
in Gamera.
The distortions adopted from [DDCF08] applied to the test set of ideal scores can be classied
in two categories:
1. Deterministic deformations, which depend of certain parameters, as for instance the
tation.
ro-
2. Random defects, which use various parameters about the deformation and a pseudorandom number generator.
In both cases it is necessary to apply the deformation in parallel with the original score and
the ground-truth sta image.
Despite some deformations being easy to understand by the name, there are others where this
is not true (the defects
6
7
resolution, rotation
and
line interruption
are self explanatory). A brief
Consult [DDCF08] for more details.
See http://ldp.library.jhu.edu/vhost-base/gamera and [Cap08] for more details.
3.8. Database of Music Scores
35
Deformation
Type
Parameter description
Rotation
Deterministic
Rotation angle
Curvature
Deterministic
Height ratio: width of sine curve
Typeset Emulation
Both
Range width, maximal height
and variance of vertical shift
Sta Line Interruptions
Random
Interruption frequency, maximal width
and variance of range width
Sta Line Thickness Variations
Random
Markov chain stacionary
distribution and inertia factor
Sta Line y-variation
Random
Markov chain stacionary
distribution and inertia factor
Degradation After Kanungo
Random
(η, α0 , α, β0 , β, κ),
White speckles
Random
Speckles frequency, randow walk
+
see [KHB 00]
length and smoothing factor
Table 3.1:
Deformations in the images.
explanation of these distortions will be done. In Table 3.1 the deformations considered, which
are available in the MusicStaves toolkit, are listed. In Figure 3.7 it is possible to see the eects
caused in the images.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 3.7: Some examples of applied deformations from the original image: a) Original; b)
Curvature c) Degradation after Kanungo; d) Sta line thickness variation; e) Sta line yvariation; f ) Typeset emulation; g) Rotation; h) White Speckles; i) Sta line Interruptions.
Curvature
The curvature is obtained through a half sine wave over the entire score area. The
intensity of the resulting curvature can be measure as the ratio of amplitude (height) and
the width of the wave.
Chapter 3. Sta Lines Detection and Removal
36
Typeset emulation
This deformation tries to reproduce the 16th-century prints, which are
similar, in some ways, to typewriters, causing, therefore, interruptions in the lines between
the symbols and a random vertical shift in each portion containing a symbol.
Sta line thickness variation and sta line y-variation
sta line
y -variation
Sta line thickness variation and
are obtained by a Markov chain describing the evolution of the sta
line thickness from left to right. These deformations are achieved by that process because,
usually, the thickness at a particular
x-position.
depends on the thickness at the previous
The parameter is the transition probability matrix
of transition from thickness or
or
x-position
y -deviation
can be of
n
y -deviation i
to thickness or
P
with
pij :=
y -deviation j .
probability
The thickness
dierent values (states). To the stationary distribution of the
individual states a symmetric binomial distribution is assumed, that is:
πi =
The mean value
(n − 1)/2
n−1
i−1
1
2n−1
of this distribution is associated with the original value in the
staine_height
image without the deformation (
tion from the original
!
y -position).
Hastings algorithm [Has70].
for the thickness or zero for the devia-
The Markov chain is generated with the Metropolis-
The transition probability matrix
Q,
to obtain candidate
transition points, is chosen to be:
where the probability
tions: the closer
c
c


 c for j = i
qij =
1 − c/2 for j = i ± 1


0 otherwise
can be considered as an inertia factor that allows smooth transi-
is to one, the slower is the state variation.
Degradation after Kanungo
This deformation tries to imitate the local distortions caused
during printing and scanning.
The model has six parameters
(η, α0 , α, β0 , β, κ)
with
dierent meanings:
ˆ
Each black pixel in the original image is ipped with probability
where
ˆ
d
2
α0 exp−αd +η ,
is the distance to the closest background pixel.
Each background pixel is ipped with probability
2
β0 exp−βd +η ,
where
d
is the
distance to the closest foreground pixel.
ˆ κ
is the diameter of a disk of a morphological closing operation.
White speckles
This degradation model has three parameters
(p, n, k)
with the following
meaning:
ˆ
Each black pixel in the original image is taken with probability
for a random walk of length
ˆ k
p as a starting point
n.
is a rectangle of a morphological closing operation that will smooth an image
containing the random walk.
3.8. Database of Music Scores
37
Deformation
Parameter range
Rotation
angle = −5 : 2.5 : 5
Curvature
Typeset Emulation
Sta Line Interruptions
Sta Line Thickness Variation
Sta Line
y -variation
Degradation After Kanungo
White Speckles
Table 3.2:
ˆ
= 0.02 : 0.02 : 0.10
n_gap = 1 : 3 : 13, n_shif t = 1 : 3 : 13, p_gap = 0.5
frequency α = 0.01 : 0.02 : 0.10,
binomial parameter for width: n = 6, p = 0.5
inertia c = 0.8, maximum deviation = 2 : 1 : 6
inertia c = 0.8, maximum deviation = 2 : 1 : 6
η = 0, α0 = 0.5, α = 0.25 : 0.3 : 1.5,
β0 = 0.5, β = 0.25 : 0.3 : 1.5, κ = 2
smoothing factor k = 2,
random walk length n = 10,
speckle frequency p = 0.03 : 0.02 : 0.11
amplitude/stawidth
Ranges of deformation parameters used in the tests:
min:step:max.
The image with the random walks is subtracted from the original: an image with
white speckles at the random walk positions is obtained.
Therefore,
and
k
p
can be interpreted as the speckle frequency,
n
as a measure for the speckle
as a smoothing factor.
The range of deformations parameters in our test set is given in Table 3.2. The values were
restricted with the aim of obtaining realistic deformations.
Even so, when real scores are
treated, generally, the deformations do not occur in its pure form; a combination from several
deformations is more prone to happen.
However, because the evaluation of the detection
algorithms is our objective it is more adequate to research the performance in each isolated
deformation.
3.8.2 Ground-truth Information
In order to establish a qualitative comparison between the various sta lines detection and
removal algorithms a ground-truth information is necessary.
Hence the black pixels of the
images need to be labelled as being sta lines or otherwise.
The manual process has two
disadvantages: it is very time consuming and it is prone to occur errors (e.g.
the existence
of dubious pixels that may belong to the line or to the symbol that crosses the sta line).
Therefore, ground-truth information of the syntactic images released by the authors of the
MusicStaves toolkit was used.
In Figure 3.8 the deformation process and the ground-truth
generation are displayed. In the code of the postscript images (scores) the macros of drawing
lines were removed and the resulting images were converted to one-bit raster images. In this
manner, real musical scores without lines can be obtained. The distortions are applied in the
bitmap images.
When we are dealing with handwritten music sheets we can only achieve the ground-truth
information manually, that is, we need to delete the symbols present in the score in order to
Chapter 3. Sta Lines Detection and Removal
38
Figure 3.8: Generation of the deformed images and the respective Ground-Truth.
retain just the sta lines segments.
In order to evaluate the performance of the sta lines detection algorithm the scores with
only lines will be used as references, while the sta-less images will be used to test the sta
lines removal algorithms. Note that it is possible to directly store the same references for the
deformations synthetics images through skeleton generated in each distortion. However, once
again, in the handwritten music scores this process must be manual.
3.9 Evaluation Metrics and Results
This section presents the error metrics used to evaluate the performance of the proposed algorithm for the detection and removal of sta lines in comparison with the shortest path al-
8
gorithm
9
and the algorithms present in [DDCF08]. However, since the Dalitz's algorithm
has
a signicantly better performance than the other two proposed in [DDCF08], the comparison
results presented here will be only for the Daltiz's algorithm.
3.9.1 Detection Error Metrics
With the purpose of evaluating the performance of the stable path algorithm and to study their
behaviour with respect to the several deformations applied to the images, two dierent error
metrics were considered: the percentage of false positive sta lines and the percentage of sta
lines missed to detect.
The process starts by computing the average Euclidian distance between each reference sta
line given by the ground-truth information; see Subsection 3.8.2 and each sta line actually
detected. Next, the matching problem on the resulting bipartite graph is solved by minimizing
10 .
the assignment cost, that is, the distance
Only pairs with average error-distance below
staineheight were assumed correctly matched.
The remaining pairs were assumed to originate
from a false positive sta line being matched to an undetected true sta line and were therefore
unmatched. In the end, the two metrics result as the number of unmatched detected sta lines,
8
Description of the algorithm in [CCRG08b].
See Appendix A.3 for a brief description.
10
See Appendix A.4 for a brief description of the method.
9
3.9. Evaluation Metrics and Results
39
that is, false positive, and unmatched reference sta lines, that is, false negative. It should be
noted that these metrics only measure whether sta lines are found, not how good the match
is.
Results
From the results presented in Table 3.3 we can conclude that the stable path based approach
11 ,
(and the shortest path approach) outperforms the Dalitz algorithm
for both error metrics
dened. In some deformations a visible/considerable dierence between stable path and Dalitz
algorithms does not exist.
However, there are not cases where the proposed algorithm is in
disadvantage with respect to Dalitz algorithm.
Besides that, the performance of the stable path approach is almost independent of intensity of
the deformation, for the range of values considered. This performance gain is even more noteworthy as the Dalitz algorithm is receiving as input the correct number of lines per sta. Had
this not been the case, the dierence between both would have been much larger. The current
implementation of the stable path algorithm run as fast as the Dalitz algorithm (and about ve
times faster than the shortest path version), as available at the Sta Removal Toolkit [DDCF07].
By generalizing the design of the weight function and enhancing the postprocessing, we were
able to improve the detection performance on the initial results in [CCRG08b]; the use of the
stable paths allowed to improve the detection speed.
Since all the parameters of the method are scaled by one of these two values estimates
staine-
height and staspaceheight, a simple analysis of the sensitivity of the approach to these parameters was done. We re-run the experiments with values for staineheight and staspaceheight
one pixel above and below the true estimated values. In both cases, the stable path method
presented a good performance and continued to yield the best results (the strongest degradation
occurred for the rotation degradation, with angle=5°, where both error rates increased to 1.7%).
In a second similar experiment we evaluated the sta line detection methods on a set of 50
real music scores, for which reference sta lines were manually outlined as referenced in Subsection 3.8.2.
12 ,
Images were previously binarized with the Otsu threshold algorithm
plemented in the Gamera
13
project .
as im-
The evaluation of both detection algorithms yielded the
results presented in Table 3.4. Again, and although the correct number of sta lines per sta
was inputted into the Dalitz' algorithm, the stable path approach obtained the best performance.
11
For the deformations not shown, the stable path is not signicantly better than Dalitz.
See Appendix A.5 for a brief description of the method.
13
http://gamera.sourceforge.net
12
0.04
0.8 (3.6); 0.8 (3.6)
0.8 (3.6); 0.8 (3.6)
53.8 (44.5); 55.7 (45.2)
4
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
21.4 (27.4); 15.4 (17.3)
4
0.7 (3.8); 0.7 (3.8)
0.7 (3.8); 0.7 (3.8)
15.9 (27.1); 12.0 (17.1)
0.02
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
12.7 (29.0); 12.6 (28.7)
0.03
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
0.0 (0.0); 0.0 (0.0)
2
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
31.0 (50.7); 31.7 (46.1)
1
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
19.7 (34.1); 13.7 (21.6)
1
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
3.3 (10.4); 2.3 (7.3)
Amplitude/stawidth
Stable path
Sortest Path
Dalitz
Rate whitened pixels
Stable path
Sortest Path
Dalitz
Max deviation, n
Stable path
Sortest Path
Dalitz
Max gap width, ngap
Stable path
Sortest Path
Dalitz
Max vertical shift, nshift
Stable path
Sortest Path
Dalitz
Error
Error
Error
Error
Error
3
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
22.1 (38.9); 32.6 (45.6)
rotation
5
0.8 (3.6); 0.8 (3.6)
0.8 (3.6); 0.8 (3.6)
13.0 (20.1); 33.7 (45.0)
0.09
1.2 (3.8); 1.2 (3.8)
1.7 (4.0); 1.9 (4.3)
89.3 (54.6); 86.9 (25.6)
0.08
0.8 (3.6); 0.8 (3.6)
0.8 (3.6); 0.8 (3.6)
95.8 (17.5); 100.0 (0.0)
2.5
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
4.2 (19.6); 9.8 (29.0)
10
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
24.2 (38.9); 16.7 (22.0)
7
0.7 (3.8); 0.7 (3.8)
0.7 (3.8); 0.7 (3.8)
39.0 (48.7); 24.6 (27.3)
10
0.7 (3.8); 0.7 (3.8)
0.7 (3.8); 0.7 (3.8)
48.7 (60.8); 29.3 (30.7)
typeset emulation II
7
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
22.3 (30.0); 17.4 (19.0)
typeset emulation I
4
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
15.7 (27.2); 33.7 (45.0)
line y-variation
0.07
0.9 (3.7); 0.9 (3.7)
0.9 (3.7); 0.9 (3.7)
26.7 (25.3); 29.9 (27.2)
white speckle
0.06
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
83.3 (30.6); 83.9 (30.4)
curvature
0
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
0.0 (0.0); 0.0 (0.0)
13
0.7 (3.8); 0.7 (3.8)
0.7 (3.8); 0.7 (3.8)
58.7 (54.9); 37.3 (29.0)
13
0.6 (3.5); 0.6 (3.5)
0.6 (3.5); 0.6 (3.5)
31.4 (42.3); 19.2 (20.3)
6
1.1 (3.8); 1.1 (3.8)
1.1 (3.8); 1.1 (3.8)
12.8 (18.6); 34.2 (44.7)
0.11
2.1 (4.6); 2.3 (4.8)
5.3 (7.4); 7.0 (9.6)
54.5 (55.9); 95.2 (17.0)
0.10
1.2 (4.0); 1.2 (4.0)
1.2 (4.0); 1.2 (4.0)
96.0 (17.5); 100.0 (0.0)
5
1.2 (4.0); 1.2 (4.0)
1.2 (4.0); 1.2 (4.0)
5.5 (9.3); 37.5 (41.9)
Runtime
886 sec.
6106 sec.
842 sec.
Runtime
739 sec.
5085 sec.
703 sec.
Runtime
767 sec.
5122 sec.
768 sec.
Runtime
809 sec.
5122 sec.
872 sec.
Runtime
822 sec.
5554 sec.
639 sec.
Runtime
858 sec.
6006 sec.
612 sec.
rate and miss detection rate. See [DDCF08] for parameter details.
Table 3.3: Eect of dierent deformations on the overall sta detection error rates in percentage: average (standard deviation) of the false detection
Error
0.05
0.8 (3.6); 0.8 (3.6)
0.8 (3.6); 0.8 (3.6)
0.3 (1.4); 0.3 (1.4)
-2.5
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
8.6 (14.0); 15.5 (28.7)
-5
0.7 (3.5); 0.7 (3.5)
0.7 (3.5); 0.7 (3.5)
17.8 (22.0); 51.7 (38.1)
Angle
Stable path
Sortest Path
Dalitz
40
Chapter 3. Sta Lines Detection and Removal
3.9. Evaluation Metrics and Results
41
False detection rate
miss detection rate
Runtime
Dalitz
5.2% (10.4)
5.9% (11.3)
112 sec.
Shortest path
1.4% (3.5)
2.5% (7.3)
612 sec.
Stable path
1.3% (5.7)
1.4% (6.4)
115 sec.
Table 3.4: Detection performance on real music scores in percentage: average (standard deviation).
3.9.2 Removal Error Metrics
Sta line detection algorithms can be used as a rst step in many sta removal algorithms.
To understand the potential of our algorithm to leverage the performance of existing sta removal algorithms, we conducted a series of experiments, comparing existing versions of sta
line removal algorithms with modied versions of them, making use of the stable path algorithm at the sta line detection step. The quantitative comparison of the dierent algorithms
is totally in line with the comparison presented in [DDCF08]. Adopting the naming convention
from [DDCF08], the following algorithms were adapted: LineTrack Height, LineTrack Chord,
Roach/Tatem, as already described in Section 3.7.
The original version of the algorithms was considered as available in [DDCF07], making use of
the Dalitz algorithm in the detection phase. The modied versions use instead the stable path
for detecting lines. We also adopted the same error metrics (individual pixels, sta-segment
regions and sta interruption location) presented in [DDCF08] and conducted the comparative
study on the same test set see Section 3.8.
Individual Pixels
This metric considers the sta line removal as a two-class classication
problem at the pixel level, that is, one pixel can belong to a sta line or not. Therefore,
a natural performance measure is the error rate for this classication, given by
Pixel error rate
x =Number
y =Number
=
x+y
z
of misclassied sta pixels
of misclassied non sta pixels
z =Number
of all black pixels
However, this metric has one problem: little information is given about how well the sta
removal algorithm separates symbols that are otherwise connected by sta lines; this error
only indicates how badly the symbols are distorted when compared to the ideal sta-less
images.
Segmentation Region Level
This error metric has as base the regions of the line segments.
The sta line removal can be considered as a segmentation problem where the sta lines
segments need to be separated from the symbol segments.
Chapter 3. Sta Lines Detection and Removal
42
In an OMR application the sta lines segments are considered background and the remaining symbols are taken as being the segments of interest.
Nevertheless, when we
are trying to evaluate the quality of the sta line removal the situation is reversed: our
interest lies on the sta segments and the rest constitutes background.
Following the notation given in [TMD99], we have two segmentations for the set of black
pixels present in the test image:
1. The ground-truth segmentation
G = Gobj ∪ {gnoise }
2. The segmentation detected from the algorithm
{s1 , . . . , sN },
gi
where each
and
gnoise
segment respectively, and
sj
and
with
G = {g1 , . . . , gM }
S = Sobj ∪ {Snoise }
with
S =
contains the black pixels of a contiguous sta
snoise
contain the remaining background black
pixels, respectively.
In the set of all sta line segments from both segmentations an equivalence classes of
overlapping segments is built(two segments are considered equivalent
quence
class
r
c1 , c2 , . . . , cn
exists with
c1 = a, cn = b
and
ci ∩ ci+1 6= 0).
a'b
when a se-
For each equivalence
we count the number of the contained numbers of segments
G
and
S
and thus
detect recognition errors. All possible cases are listed in Table 3.5. The formula is given
by
Segmentation error rate
x =Number
y =Number
x−y
z
of all classes
r
of classes representing a correct recognition
z =Number
Classes number
n1
n2
n3
n4
n5
n6
=
Segments from Gobj
1
1
0
1
>1
>1
of all classes
Segments from Sobj
1
0
1
>1
1
>1
r
Error description
Correct
Missed segment
Falsely detected segment
Segment split
Segments merged
Both splitting and merging occurred
Table 3.5: Sta segment extraction errors based on the number of segments in an equivalence
class
r.
Sta Line Interruption
To obtain this metric a comparison between the ground-truth im-
ages, containing only the segments removed from the ideal case, with the image that
contains exactly the pixels that were in fact removed by the removal algorithm under
evaluation, is made.
In doing so, each sta line is followed, from left to right, in the
images containing only the removed sta segments; and interruptions in the sta line are
looked at. Each interruption represents a detected music symbol that crosses the sta
line. It follows two sets of intervals: the interrupting intervals
ground-truth data and those in the algorithm output
G = {g1 , . . . , gM }
S = {s1 , . . . , sN }.
in the
3.9. Evaluation Metrics and Results
43
With the purpose to establish an error metric, a bipartite graph is created by adding links
between intervals
intervals from
G
gi
to
and
S
sj
that overlap. In this manner, two types of error are revealed:
without a link and intervals with more than one link. In order to
count the number of errors of the second type, the maximum cardinality matching in this
graph is computed [Gal86].This cardinality matching also removes the minimal number
of links leading to the second type error. As a result error rate we have
Sta line interruption error rate
n1 =Number
min {n3 , n1 + n2 }
n3
of interruptions without link
n2 =Number
n3 =Number
=
of removed links
of ground-truth interruptions
Results
From the results obtained we can conclude that the qualitative eects of the deformations
and the insertion of the stable path as the detection algorithm are similar for all three error
metrics.
In Tables 3.6, 3.7, 3.8 and 3.9, we present the results for all three error metrics
(individual pixels, sta-segment regions and sta interruption location) for the Line Track
Height algorithm (modied) (LTH), plus the Skeleton algorithm, which exhibited a competitive
performance in [DDCF08]. This algorithm does not clearly separate the sta lines detection
from the sta lines removal, doing both tasks simultaneously.
So, this algorithm was used
alone, without providing the detection performed by stable paths algorithm.
Stable path + LTH
Dalitz + LTH
Skeleton
Pixel Error Rate
2.8 (1.2)
3.8 (2.6)
6.5 (8.2)
Segmentation Error Rate
0.3 (0.1)
0.3 (0.1)
0.3 (0.1)
Sta Line Interruption Error Rate
0.3 (0.1)
0.3 (0.1)
0.3 (0.1)
Table 3.6: Removal performance on real music scores (in percentage): average (standard deviation).
A rst observation is that, overall, the replacement of the Dalitz method by the stable path
approach as the sta detection step improved the results in the algorithms under comparison.
Additionally, the LineTrack Height algorithm with the stable path consistently outperformed
the other algorithms. Nevertheless, the skeleton method [DDCF08], which does not have a clear
line detection step, continues to present a competitive performance. It is worth to nalize by
noticing that the skeleton algorithm is about two times slower than the modied Line Tracking
Height algorithm.
In Figure 3.9 it is possible to see some results obtained when we apply the LineTrack Height
algorithm with stable path approach to our test set.
Chapter 3. Sta Lines Detection and Removal
44
rotation
Angle
Stable path + LTH
Dalitz + LTH
Skeleton
-5
1.7 (0.7)
19.4 (18.4)
1.9 (0.9)
-2.5
1.5 (0.7)
5.2 (8.7)
1.7 (0.8)
0
1.4 (0.7)
1.4 (0.8)
1.5 (0.7)
2.5
1.4 (0.7)
4.4 (8.8)
1.6 (0.7)
5
1.6 (0.7)
17.5 (18.9)
1.7 (0.8)
Amplitude/stawidth
Stable path + LTH
Dalitz + LTH
Skeleton
0.02
1.4 (0.7)
3.8 (5.8)
2.6 (2.4)
0.04
1.4 (0.7)
14.0 (12.2)
5.2 (5.1)
0.06
1.4 (0.7)
22.8 (13.7)
8.1 (7.2)
0.08
1.5 (0.7)
31.1 (11.0)
11.9 (8.6)
0.10
1.6 (0.7)
35.0 (10.6)
15.4 (10.4)
Rate whitened pixels
Stable path + LTH
Dalitz + LTH
Skeleton
0.03
11.9 (3.1)
11.5 (3.2)
14.6 (3.2)
0.05
17.2 (4.9)
16.8 (4.9)
21.5 (4.6)
0.07
21.1 (5.9)
26.7 (8.0)
27.1 (5.6)
0.09
24.0 (6.7)
53.3 (14.9)
35.2 (12.8)
0.11
26.1 (7.2)
73.3 (14.6)
46.9 (18.7)
Max deviation, n
Stable path + LTH
Dalitz + LTH
Skeleton
2
1.2 (0.7)
9.0 (13.2)
1.5 (0.8)
3
1.3 (0.7)
10.4 (14.1)
1.7 (0.8)
4
1.3 (0.6)
10.9 (14.5)
2.2 (0.9)
5
1.4 (0.6)
10.9 (14.5)
3.7 (1.7)
6
1.4 (0.6)
11.0 (14.6)
5.2 (2.2)
Max gap width, ngap
Stable path + LTH
Dalitz + LTH
Skeleton
1
1.4 (0.7)
2.6 (1.8)
26.4 (9.8)
4
1.4 (0.7)
2.9 (2.0)
27.3 (10.1)
7
1.4 (0.7)
3.2 (1.7)
27.2 (11.3)
10
1.4 (0.7)
2.9 (1.7)
25.5 (9.8)
13
1.4 (0.7)
3.0 (1.8)
26.4 (10.3)
Max vert. shift, nshift
Stable path + LTH
Dalitz + LTH
Skeleton
1
1.4 (0.7)
1.5 (0.8)
7.9 (8.9)
4
1.4 (0.7)
2.8 (1.6)
24.1 (9.1)
7
1.4 (0.7)
3.3 (2.5)
26.7 (11.0)
10
1.5 (0.7)
3.8 (2.4)
26.1 (9.6)
13
1.6 (0.7)
4.7 (3.7)
29.1 (10.7)
curvature
white speckle
line y-variation
typeset emulation I
typeset emulation II
Table 3.7: Eect of dierent deformations on the overall sta removal error rates in percentage:
average (standard deviation) Individual pixels error.
rotation
Angle
Stable path + LTH
Dalitz + LTH
Skeleton
-5
0.3 (0.2)
0.6 (0.3)
0.3 (0.2)
-2.5
0.2 (0.2)
0.3 (0.3)
0.2 (0.2)
0
0.2 (0.2)
0.2 (0.2)
0.2 (0.2)
2.5
0.2 (0.2)
0.3 (0.3)
0.2 (0.2)
5
0.3 (0.2)
0.5 (0.3)
0.3 (0.2)
Amplitude/stawidth
Stable path + LTH
Dalitz + LTH
Skeleton
0.02
0.2 (0.2)
0.3 (0.2)
0.6 (0.2)
0.04
0.2 (0.2)
0.4 (0.2)
0.3 (0.2)
0.06
0.2 (0.2)
0.5 (0.2)
0.3 (0.2)
0.08
0.3 (0.2)
0.6 (0.2)
0.3 (0.2)
0.10
0.3 (0.2)
0.7 (0.1)
0.4 (0.2)
Rate whitened pixels
Stable path + LTH
Dalitz + LTH
Skeleton
0.03
0.2 (0.1)
0.2 (0.1)
0.3 (0.1)
0.05
0.1 (0.1)
0.2 (0.1)
0.3 (0.1)
0.07
0.1 (0.1)
0.2 (0.1)
0.3 (0.1)
0.09
0.1 (0.04)
0.6 (0.2)
0.3 (0.1)
0.11
0.1 (0.04)
0.9 (0.2)
0.4 (0.2)
Max deviation, n
Stable path + LTH
Dalitz + LTH
Skeleton
2
0.2 (0.2)
0.5 (0.4)
0.2 (0.2)
3
0.3 (0.2)
0.5 (0.4)
0.3 (0.2)
4
0.3 (0.2)
0.5 (0.4)
0.3 (0.2)
5
0.3 (0.2)
0.5 (0.3)
0.3 (0.2)
6
0.4 (0.2)
0.5 (0.3)
0.3 (0.2)
Max gap width, ngap
Stable path + LTH
Dalitz + LTH
Skeleton
1
0.1 (0.1)
0.1 (0.1)
0.6 (0.1)
4
0.1 (0.1)
0.1 (0.1)
0.6 (0.1)
7
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
10
0.1 (0.1)
0.1 (0.1)
0.6 (0.1)
13
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
Max vert. shift, nshift
Stable path + LTH
Dalitz + LTH
Skeleton
1
0.1 (0.1)
0.1 (0.1)
0.3 (0.2)
4
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
7
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
10
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
13
0.1 (0.1)
0.2 (0.1)
0.7 (0.1)
curvature
white speckle
line y-variation
typeset emulation I
typeset emulation II
Table 3.8: Eect of dierent deformations on the overall sta removal error rates in percentage:
average (standard deviation) Sta line interruption error.
3.9. Evaluation Metrics and Results
45
Angle
Stable path + LTH
Dalitz + LTH
Skeleton
-5
0.2 (0.1)
0.6 (0.3)
0.2 (0.1)
-2.5
0.2 (0.1)
0.3 (0.3)
0.2 (0.1)
Amplitude/stawidth
Stable path + LTH
Dalitz + LTH
Skeleton
0.02
0.2 (0.1)
0.3 (0.2)
0.2 (0.1)
0.04
0.2 (0.1)
0.5 (0.2)
0.3 (0.1)
Rate whitened pixels
Stable path + LTH
Dalitz + LTH
Skeleton
0.03
0.6 (0.1)
0.6 (0.1)
0.6 (0.1)
0.05
0.5 (0.1)
0.5 (0.1)
0.6 (0.1)
Max deviation, n
Stable path + LTH
Dalitz + LTH
Skeleton
2
0.2 (0.1)
0.5 (0.3)
0.2 (0.1)
3
0.2 (0.1)
0.5 (0.3)
0.2 (0.1)
Max gap width, ngap
Stable path + LTH
Dalitz + LTH
Skeleton
1
0.1 (0.1)
0.1 (0.1)
0.6 (0.1)
Max vert. shift, nshift
Stable path + LTH
Dalitz + LTH
Skeleton
1
0.1 (0.1)
0.1 (0.1)
0.3 (0.2)
rotation
0
0.2 (0.1)
0.2 (0.2)
0.2 (0.1)
2.5
0.2 (0.1)
0.3 (0.3)
0.2 (0.1)
5
0.2 (0.1)
0.5 (0.3)
0.2 (0.1)
0.06
0.2 (0.1)
0.6 (0.2)
0.3 (0.1)
0.08
0.3 (0.1)
0.8 (0.1)
0.4 (0.2)
0.10
0.3 (0.1)
0.8 (0.1)
0.5 (0.2)
0.07
0.5 (0.1)
0.6 (0.1)
0.6 (0.1)
0.09
0.5 (0.1)
0.8 (0.1)
0.6 (0.1)
0.11
0.4 (0.1)
0.9 (0.1)
0.7 (0.2)
4
0.2 (0.1)
0.5 (0.3)
0.2 (0.1)
5
0.2 (0.1)
0.5 (0.3)
0.3 (0.1)
6
0.3 (0.1)
0.5 (0.3)
0.3 (0.1)
4
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
7
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
10
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
13
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
4
0.1 (0.1)
0.1 (0.1)
0.6 (0.1)
7
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
10
0.1 (0.1)
0.2 (0.1)
0.6 (0.1)
13
0.1 (0.1)
0.2 (0.1)
0.7 (0.1)
curvature
white speckle
line y-variation
typeset emulation I
typeset emulation II
Table 3.9: Eect of dierent deformations on the overall sta removal error rates in percentage:
average (standard deviation) Segmentation region level.
Figure 3.9: Examples of the results obtained using the sta line removal algorithm in our test
set.
Part III
Segmentation and Classication
47
Chapter 4
Segmentation and Classification Process
This chapter begins with a description of the dierent music symbols that can be found in a
music score. Then proceeds with the description of the segmentation and classication steps.
Music notation, like any other written language, did not appear as the invention of one person.
It emerged from the combined and prolonged eorts of hundreds of musicians. They all hoped
to express by written symbols the essence of their musical ideas [Rea69]. Improvement in those
symbols came about because it was necessary. Hence, the musical notation is very extensive if
we consider all the existing possibilities and their variations over time. The result notation is
a kind of alphabet, shaped by a general consensus of opinion to serve as a general expressive
technique. Music notation is, thus, the visual manifestation of the interrelated properties of
musical sound (as pitch, intensity, time, timbre and pace). Symbols indicating the choice of
tones, their duration and their manner of performance are important because they form this
written language that we call
music notation.
Several symbols can be confused with each other because of their shape similarity.
Besides
that, in some cases, complex symbols can appear in the score, as for instance the guitar symbols (diagram with the positions of the notes in the instrument). And if we consider that we
are working with handwritten music scores several additional problems come about. One with
more evidence is the variation in the symbols notation caused by each person. In this work we
decided to study and recognize the standard music notation. In this manner, variants of types
of specic notation will not be recognized, as drums notation.
Furthermore, since tablature
is something specic to certain instruments, it will also not be recognized.
In the following
Section the symbols that we treat in this thesis will be described.
4.1 Musical Symbols
Clefs
The term Clef is derived, logically enough, from the Latin word
clavis,
meaning key.
The three clef signs in common use today had their origins in pitch letters: the earliest
F related with the tone f (fa ), when a second line was latter
drawn above the text over the F the letter C was axed upon it, representing the pitch
of c' (do ), the last letter was G which represents the tone g' (sol ). The F clef became
form was the Latin letter
the
bass clef
because of its lower register; the C clef became either
49
tenor or alto clef ; and
Chapter 4. Segmentation and Classication Process
50
the G clef became the
soprano or treble clef.
The clef symbols are the rst symbols that
appears at the beginning of every music sta. It is very important because they tell us
which note is found on each line or space.
Table 4.1: Clefs.
Treble clef
Alto clef
Bass clef
The modern bass clef is used for all voices and instruments of essentially a bass register;
for instance baritone voices, keyboard and orchestra instruments as piano and organ.
The alto clef is used in English horn, trombones and violas.
It is important to stress
that none of these last instruments uses the alto clef exclusively; only the viola employs
it with consistent regularity. The treble clef is the highest of all the clef signs, beginning
below the sta and extending above it [Rea69]. All instruments and voices of high range
employ this symbol, as for instance lyric soprano, orchestra instruments as piccolo and
ute, English horn, saxophone, among others.
Noteheads and stems
A musical
note
indicates two important aspects. On the one hand,
the symbol indicates, by its position on the sta and by the clef used, a
pitch to be played
or sung. On the other hand, the note establishes, by the exact appearance of its three
integral parts (notehead, stem and ag), the relative time
duration
of this musical sound.
Table 4.2: Notes.
Minim
Crochet
The
notehead is somewhat oval in shape, and is either open Minim or closed Crochet.
4.1. Musical Symbols
The
stem
51
is the thin vertical line joined to the side of the note head (closed or open).
down, it is placed on the left side of the notehead; when the stem
is to go up, it is put on the right side of the notehead. This stem direction is measured
as follows: when the notehead is above the centre, line of the sta, the stem goes down
see Figure 4.1(a); when the note is below the middle line of the sta, the stem goes up
see Figure 4.1(b); and when the note is centred on the sta, the stem may go in either
direction see Figure 4.1(c), although it is more common practice to draw it down. The
stem length is proportional, that is, it is measure interms of the note position on the sta.
When the stem is to go
(a) Stem down.
(b) Stem up.
Figure 4.1:
Flags and beams
(c) Note positioned in the centre of
sta the stem may go in either direction.
Example.
Flags are employed to indicate the relative time values of the notes with
black (closed) noteheads. These symbols are featured by a thin curved stroke that joins
to the end of the stem, and always goes to the right of this symbol, regardless of whether
the stem goes up or down. When a dot follows the notehead, the end of the curved line is
often shortened so as not to conict with the dot. On a downward stem the curved line
barely touches the bottom of the notehead.
Table 4.3: Flags and beams.
Quaver
Semiquaver
Demisemiquaver
Hemidemisemiquaver
Beams are compound ags used to connect notes in note-groups. The number of beams
always equals the number of ags appropriate for each note. These symbols demonstrate
the metrical and the rhythmic divisions within the measure and also the note-groupings
Chapter 4. Segmentation and Classication Process
52
in oppositions to the normal patterns of the measure.
categories:
primary
Beams are considered in two
(beams that link an entire group; without breaks) and
secondary
(beams that are interrupted or partially broken) see Figure 4.2. The position of the
beams is solved by the same general principles that govern stem direction. That is, if the
up in the group to be beamed, then all the stems will go
up and the beams will be placed above the notes; on the other hand, if most of the notes
require a downward stem (most of the noteheads lie above the sta centre), then the
stems of all the notes must be written down and the beams placed below the noteheads.
The direction of all beams should approximately parallel to the main body of the notes
majority of stems normally go
they connect.
Figure 4.2: Example.
Rests
The rests were invented to indicate the exact duration of silence in the music. Therefore,
each note value has its corresponding rest sign. The written position of a rest between
two barlines is determined by its location in the meter, that is, a rest always occupies
the same position in a measure that an equivalent note value would ll, except the whole
rest, which is always centred.
The
whole rest
is solid, oblong symbol and placed just beneath the fourth line of the
sta (regardless of the clef used). It is incorrect to place it elsewhere unless two voices
or instruments share common sta. When this happens, the whole rest is usually written beneath the
The
half rest
top line for the upper part, and beneath the bottom line for the lower part.
is an oblong mark of the same length as the whole rest, located on the third
sta line. Like the whole rest it is incorrect to place this symbol on any other line unless
two voices or instrumental parts share the sta. In this case, the half rest goes above the
top
The
line for the upper part, and above the
quarter rest
bottom
line for the lower part.
is the most dicult symbol to write. The sign is centred on the sta,
with the top tail ending in the fourth space and the bottom hook resting in the rst space.
When two parts share a common sta, this rest goes somewhat
the upper part, and just
below
it for the lower part.
above
the sta centre for
4.1. Musical Symbols
53
Table 4.4: Rests.
Rest
Note
Whole rest
Half rest
Quarter rest
Eighth rest
Sixteenth rest
thirty-second rest
Sixty-fourth rest
The
eighth rest
has half of the value of the quarter rest.
This symbol is written as a
slanting stem with a single hook. The stems rest occupies the
hook occupies the
The
sixteenth rest
third
has half of the value of the eighth rest. This symbol has another hook
be extended to touch the
thirty-second rest
bottom
second
sta space and the stem must
line.
has half of the value of the sixteenth rest.
hooks and a further extended stem. The three hooks occupy the
spaces, with the stem resting on the
The
sta line and the
space.
to the slanting stem. This second hook occupies the
The
second
sixty-fourth rest
bottom
It consists of three
second, third, and fourth
line.
has half of the values of the thirty-second rest.
structure with its four hooks. Each hook must occupy one of the
four
It is a elaborate
sta spaces, and
the slating stem must accordingly be extended until the distance of one space
below
the
Chapter 4. Segmentation and Classication Process
54
sta.
Ties and slurs
Ties are a notational device used to prolong the time value of a written note
into the following beat. This symbol is a curved line that connects two successive noteheads indicating, together, the total time value desired. The tie appears to be identical
to slur, however, while tie almost touches the notehead centre, the slur is set somewhat
above or below the notehead. Besides that, ties are normally employed to join the time
value of two notes of identical pitch, while slurs have a totally dierent function. The tie
always loops in the direction opposite that of the stem. In other words, when the tie notes
have their stems going up, the tie mark loops below the noteheads see Figure 4.3(b),
when the two stems go down, the tie is curved above the noteheads see Figure 4.3(a).
Slurs aect note-groups as entities indicating that the two notes are to be played in one
physical stroke, without break between them. When the slur is used in conjunction with
tied notes, it may be placed over or under the noteheads tied together, but the tie always
loop according to the note position on the sta.
Table 4.5: Ties and Slurs.
Tie
Slur
(a)
Figure 4.3:
Accidentals
(b)
Example: a) Tie above noteheads; b) Tie under noteheads.
The signs that are placed before the note to designate changes in sounding pitch
are called accidentals. These signs are ve in number: the
the
double at
and the
double sharp.
natural,
the
at,
the
sharp,
However, in this work, only the rst three symbols
will be treated.
The
f.
at [
is literally the rounded small letter
b
found in Guido's soft hexachord on
In modern practice, the shape of the at is somewhat more slanted upward than the
4.1. Musical Symbols
55
ordinary small letter
b.
the second line above.
line, the top of the stem extends to
When the symbol is notated in a space, the stem extends to the
when the at is placed on a
center of the second space above.
b found in Guido's hard
hexachord. This sign represented the raised form of the pitch b. By adding a short tail
to the square b, and slightly slanting the horizontal strokes, we get the natural sing. The
The form of the sign
natural \
has its origin in the square
two tails of this symbol are to be extended quite as far on the sta as the stem of the
at. When the sign is on a line, the upper tail of the natural extends only to the
higher
space and the lower tail to the
next lower
next
space. If the sign is in a space, the tails
go respectively to the adjacent lines above and below.
The
sharp ] was written in a variety of ways, beginning with what looks like and continuing
with modied forms of the St. Andrew's cross. In our days, it resembles more closely an
elaborated natural sign with an extension of both vertical lines and the horizontal dashes.
Accents
Accents marks are symbols for
special or exaggerated stress upon any beat, or portion
of a beat. These indications are sometimes called ancillary signs because they are not
such basic symbols as the sta, clefs and notes. The accents are divided into categories:
those for
levels).
percussive
attack (higher dynamic levels) and those for
pressure
attack (lower
The two principal signs for percussive accent are 4.6 Accent and 4.6 Marcato.
The rst symbol is made identically whether it is placed above or below the notehead.
The second sign is inverted when it is placed below any notehead, stem or beam. Choice
between the two is governed by the degree of force and intensity desired. While the sign 4.6
Marcato is for a stronger attack, the sign 4.6 Accent is for a moderately sharp attack.
The principal symbol for pressure accent is a short, heavy dash placed over or below the
notehead - see Figure 4.6 Tenuto. In our work only the Accent and Staccatissimo accents
was used to facilitate the classication process.
Table 4.6:
Accents.
Percussive accents
Accent
Marcato
Staccatissimo
Tenuto
Pressure accents
Chapter 4. Segmentation and Classication Process
56
Time signatures
Meter is a recurring pattern of stress, and an established arrangement of
strong and weak pulsations. These pulsations are also known as beats. Meter is inextricably related to time signatures. This symbol is the vertical arrangement of two gures
placed on the sta following the clef sign and any key signature. When placed on the
sta, in the normal position, the numerator, which indicates the number of beats or the
number of inner pulsations, occupies the two top spaces and the denominator, which
indicates the unit of time, occupies the two bottom spaces.
Despite of these terms, a
time signature is not to be regarded as a fraction. There are several possible symbols for
the representation of time signatures. Some are fairly common in practice, some are very
rare. In this work we chose the common time and the cut time.
Table 4.7: Time signatures.
Common time
Cut time
4.2 Segmentation Process
The segmentation process used in this work is based on already existent algorithms [Fuj04,
BBN01]. This step is a hard one because of several reasons. One of the main problems deals
with the variability of the symbols. This is obvious when we consider handwritten music scores
see Figure 4.4(a). However this variability is also found in printed scores, when we have scores
from dierent editors see Figure 4.4(b). This variability will cause troubles of inconsistency
in the size and shape of each object.
Besides that, the number of possible arrangements of symbols primitives is very high. These will
bring complexity and ambiguity in the segmentation process, because we do not know all the
probable groups of symbols. Another problem is related with the sta line removal approach.
Sta lines connect most of the objects and when we eliminate them we may cause breaks in
the music symbols. These will bring undesirable imperfections that will dicult the segmentation process. Finally, a symbolic high density in the music score will also not help in this phase.
With this brief description of the several problems that we can nd in this step we may conclude
that it is clear that not all problems can be solved at this level, and the imperfections of the
segmentation have to be taken into account during the next processing steps.
The segmentation process architecture is presented in Figure 4.5 and an example of the process
on a real music score is presented in Figure 4.6. This process consists in localizing and isolating
4.2. Segmentation Process
57
(a) Variability in the real scores.
(b) Variability between dierent publishings in printed scores.
Figure 4.4: An exemplicative example of the variability in the music symbols.
the symbols in order to identify them.
In this work, the symbols we want to recognize can
be split into four dierent types: the symbols that are featured by a vertical segment with
height greater than a threshold (notes, notes with ags and open notes), the symbols that link
the notes (beams), the remaining symbols connected to sta lines (clefs, rests, accidentals and
time signature), and the symbols above and under sta lines (notes, ties, slurs and accents).
As a result, the segmentation method proposed was based on a hierarchical decomposition
of the music image.
sta lines removal,
1
This music sheet was therefore analyzed and split by stas, after the
as being the rst step.
Subsequently, the
connected components1
If you note by the Figure 4.6 there are symbols that are not linked together but connected. This is due
because in our connected component method we use a threshold of 5 pixels for the distance between objects.
This is necessary because some music symbols became separate in the sta line removal phase.
Chapter 4. Segmentation and Classication Process
58
Figure 4.5: Segmentation process I.
Figure 4.6: Segmentation process II.
were identied.
To extract the symbols with appropriate size, a
selection
of the connected
components detected in the previous step was done. The thresholds used for the height and
width of the symbols were experimentally chosen as
for the height and
staspaceheight
of the music symbols.
staspaceheight
and
16×staspaceheight
for the width. These values took into account the features
In each bounding box of the connected components repeated objects
(e.g. when we have a group of notes and a sharp; in this case we have two distinct connected
components: notes and sharp; however when we extract the bounding box of the notes the
remove them to
In the end, we are ready to nd and extract all
sharp is also included see Figure 4.7) can exist. Thus, it was necessary to
avoid multiple extractions of the same object.
the music symbols. Next we will present the methods used to extract the symbols.
Beam Detection
It is reasonable to think that beams are one of the symbols with harder detection process.
The vicissitude inherent in their action in the musical score is the principle reason for this
diculty. Their shape and size, and the way they connect to each other and to other symbols
are combined in a huge number of dierent ways. They are also prone to present inconsistency
in the thickness and in the link with the stems see Figure 4.8. Thus, we propose a solution
that just checks the presence of a segment of adequate height which connects the extremities
of notes see Figure 4.9. Firstly, in the connected component detected, the stems are found
4.2. Segmentation Process
59
(a) Original image.
(b) Image with two connect components: dierents shades of gray.
Figure 4.7: An exemplicative example of two dierent connected components in the same
bounding box.
and removed to separate the beams from the notes. When we nd all the remaining objects
we select those who have a width bigger than
lower than
4×staspaceheight.
2×staspaceheight-staineheight
and a height
These values were selected experimentally and in accordance
with the geometric features of the object in cause. The height value needs to take into account
the inclination of the beams that may have. In the end, a check of the presence of a stem is
made to validate the objects selected. It is important to state that the beams are grouped if
the distance between them is lower than a half
staspaceheight.
Figure 4.8: An example of the existence of inconsistency in the beam thickness and in the link
with the stems.
Figure 4.9: Beam segmentation process.
Notes, Notes with Flags and Notes Open Detection
In this work, we decided to consider stems and note heads as one. In this way, we dened the
geometric features of the notes we want to extract as the objects with a height bigger than a
threshold, experimentally selected as twice
staspaceheight, and a width limited by two values,
Chapter 4. Segmentation and Classication Process
60
also experimentally chosen as half
staspaceheight and 3×staspaceheight.
To simplify this task
the beams detected were removed before applying this algorithm see Figure 4.10.
Figure 4.10: Notes segmentation process.
Accidentals, Rests, Accents and Time Signature Detection
Generally, these symbols have similar values for width and height. The procedure used to ex-
2
tract them was based on the combination of X-Y projection proles technique see Figures 4.11
and 4.12. The heuristics used were experimentally chosen and these values are arguably sensible. On the one hand, we have symbols that have vertical sequence of black pixels, for instance
sharps, naturals and rests.
On the other hand, we need to take into account the symbols
topological position, because in this case we are trying to detect accents and time signature.
Figure 4.11: An example of sharp detection on a real score.
2
Horizontal (Y) projection has as result a vector which the component i is the sum of all black pixels from
row i in the image; Vertical (X) projection has as result a vector which the component j is the sum of all black
pixels from column j in the image.
4.2. Segmentation Process
61
Figure 4.12: An example of sharp detection on a ideal score.
Clefs, Ties and Slurs Detection
These symbols have their own attributes, like a large width for the ties and slurs and a bigger
height for the clefs.
In both of them we have no presence of stems.
With these in mind,
staspaceheight
< 2× numberstaspace ×
the projection proles procedure was used with specic heuristics. For clefs a
<
width
< 4× staspaceheight
and
2× staspaceheight <
height
staspaceheight + numberstaine × staineheight were used in the experiments.
These values
take into account the fact that clef symbols are the largest of all the signs, beginning below
the sta and extending above it, as we said in the description 4.1. On the other hand, for the
relations symbols (ties and slurs) the rules for extracting them were based in a large width.
4.2.1 Results
With the purpose of evaluating the performance of the segmentation process and to study
its behaviour with respect to some deformations curvature and rotation applied to the
images, two dierent error metrics were considered: the percentage of false positive symbols
and the percentage of symbols missed to detect.
A false positive (failed symbol) happens
when the algorithm falsely identies a musical symbol which is not one; and a false negative
(missed symbol) is when the algorithm fails to detect a music symbol present in the score.
These percentages are computed using the symbols position reference and the symbols position
obtained by the segmentation algorithm. The references for the symbols position were partially
done manually, that is, the music symbols that are in the ideal scores were manually splited
in order to separate the composed existing symbols and the junction of the beams signs see Figure 4.13(b). In the resulting images the deformations described in Subsection 3.8.1 are
applied. In the end, the music symbols positions are extracted using a Matlab
©
function that
gives their bounding box. It is important to stress that this experiment was only made with
ideal scores where deformations were applied, because it was our intention to see the behaviour
Chapter 4. Segmentation and Classication Process
62
of the method with respect to some deformations and besides that, this was the only quickest
method of obtaining a database of reasonable number with the symbols position references in
3
a semi-automatic way .
(a) Music score before split the composed symbols.
(b) Music score after split the composed symbols.
Figure 4.13: An example of how the music symbols that are in the ideal scores were manually
splited.
An overview from the results from the Figure 4.14 we can conclude that the segmentation
process has a good performance, since the percentage of correct symbols are largely over 80%.
The high percentage of failed symbols that sometimes we nd are related with extra symbols or
noise; that is, in the score that we are segmenting we have more symbols than in the reference
score. For example numbers or symbols that are not our objective to classify. Besides this result,
the performance of segmentation process is almost independent of intensity of the deformation,
for the range of values considered see Figure 4.16.
Figure 4.14: Results of the error metrics.
3
We have only split the composed symbols in 18 ideal scores with standard music notation; the deformations
was obtained from these. In total we have 306 images.
4.3. Classication Process
63
Figure 4.15: Results of the error metrics for the rotation deformation.
Figure 4.16: Results of the error metrics for the curvature deformation.
4.3 Classication Process
In this Section several results obtained with the dierent classiers adopted are presented.
These classiers were already presented in Section 2.3.
For the neural network,
k -nearest
neighbour and support vector machines methods, each image of a symbol was initially resized
to
20 × 20
pixels and then converted to a vector of
400
binary values.
This process is sig-
Chapter 4. Segmentation and Classication Process
64
nicantly dierent for the case of the hidden markov model.
normalised in a sta height and width of
In the latter, the images were
150 and 30 pixels, respectively.
The segmentation-free
2-pixel
approach was used to extract features from images [Pug06]. For this reason, a
sliding
window mechanism was used to produce a sequence of observations. In doing so, dependent
observations are replaced by observations depending on the horizontal position of the window.
We extract the same features that in [Pug06]:
the largest black element (a(ni )/S , where
S = height × width),
n distinct connected components of black pixels,
a(ni )
is the area of the largest black element and
the area of the smallest white pixel (a(nj )/S , where
the smallest white element and
S = height × width)
a(nj )
is the area of
and the gravity centres.
To assign classes to the images of music symbols the process was as follows: for the case of
neural network, for each of the images converted in to vectors they were assigned to another
one of length
14
this value correspond to the
correct class and value
0
14
classes of symbols with value
1
in the
otherwise. For the other three classiers the images were assigned to
a vector numbering, between
1
and
14,
depending on the class to which they belong.
In the training phase, a generalization control capability on the classier was made. Putting
dierently, the available data was divided into training, validation and test set, towards the assessment of the network topology with better performance. By that, through a cross validation
we selected the number of neurons in the hidden layer. In other words, the cross validation
consists in optimizing the model by changing the number of neurons in the hidden layer of
the network using the training data; for a further evaluation of performance the validation set
was tuned. The network model with better result was used to compute the expected error in
the music symbols classication. Thus, the training data was added to the validation data to
conduct a nal training for this network and therefore obtaining the nal model. The expected
error was obtained computing the dierence between the target values and the values obtained
by the network, using the test set.
This process is slightly modied when we apply elastic
matching to our database. The elastic matching method is adopted in the training data and
training data with validation set. We increased the dataset by causing variability in our images
in order to simulate distortions that can happen in real scores. Our objective is to prepare the
classier for several possible situations that may occur in handwritten music symbols.
The neural network structure used to classify a real score without elastic matching is illustrated
in Figure 4.17. The input layer is composed by
400
units which output values are the vector
values that represent the image of the music symbol. The output layer has
respond to the
14
14
units that cor-
classes of the symbols. When a pattern belonging to the class
the target value for the unit
i
is presented,
in the output layer is the maximum value. The range for the
number of hidden neurons chosen in this work was
hidden layer, for this case, was
i
9
[1, 2, . . . , 9].
The number of neurons in the
neurons see Table 4.8. This value was estimated during the
training of the neural network. To each neuron is associated a log-sigmoid activation function.
Model discriminant HMM was adopted to construct a model for each class [AYV01].
training of the HMM to learn the parameters of the models (λ
= (A; B; π))
The
was done by
4.3. Classication Process
65
Figure 4.17: Neural network topology.
Handwritten symbols
printed symbols
With EM
9
8
Without EM
9
9
Table 4.8: Number of neurons in the hidden layer (not)using elastic matching method (EM) in
the dataset.
4
the Baum-Welch algorithm . The goal of classication is to decide which class the unknown
sequence belongs to, based on the model obtained in the training phase. These symbols are
5
classied on the basis of the maximum likelihood ratio which is obtained by Viterbi algorithm .
In Table 4.9 the number of states obtained by cross validation are represented. The range was
4
5
See Appendix A.7 for more details.
See Appendix A.8 for more details.
Chapter 4. Segmentation and Classication Process
66
from
3
to
8.
Handwritten symbols
Printed symbols
8
8
7
8
With EM
Without EM
Table 4.9: The number of states in the hidden Markov model with and without the elastic
matching method (EM) in the dataset.
As in the previous methods, the value of
C
of
and
γ,
k,
in the
k -nearest
neigbor classier, and the values
in the support vector machines, were also obtained by cross validation and the
expected error was computed with the test set.
The
k
value needs to be suciently big to
minimize the probability of error occurrence in the classication and suciently tiny for the
closest
k
points becoming closer to the point to classify. In doing so, for a
obtained a
k = 1, 2, . . . , 10,
we
k = 1 for a handwritten and printed dataset with and without the application of the
elastic matching. The parameters for the support vector machines are presented in Table 4.10.
The range was
23
for
γ
2, 21.25 , 21.5 ,. . ., 24.75 , 25
for
C
value estimation, and
2−7 , 2−6.75 , 2−6.5 ,. . ., 22.75 ,
value estimation.
Handwritten symbols
Printed symbols
C
γ
C
γ
With EM
C = 23.5 = 11.3137
γ = 2−7 = 0.0078
C = 21.25 = 2.3784
γ = 2−5.75 = 0.0186
Without EM
C = 24.5 = 22.6274
γ = 2−6 = 0.0156
C = 21.25 = 2.3784
γ = 2−5.75 = 0.0186
Table 4.10: The values of the parameters
C
and
γ
in the support vector machines with and
without the elastic matching method (EM) in the dataset.
4.3.1 Results
Towards musical handwritten symbol classication, several sets of symbols were extracted from
dierent musical scores to train the classiers. In Table 4.11 the number of handwritten printed
music symbols per class used in the training phase of the classication structures is represented.
The symbols were grouped according to their shape. In doing so, the rests signs were divided
in two groups RestI and RestII; see Table 4.11. Besides that we include the unknown class
to classify those symbols that do not t into any of the other classes. In total we have 3222
handwritten music symbols and 2521 printed music symbols.
4.3. Classication Process
Symbol
67
Class
Total number
Accent
Class
Total number
189
AltoClef
201
BassClef
26
Beam
291
Beam
438
Flat
155
Flat
230
Natural
127
Natural
317
Note
304
Note
466
NoteFlag
120
NoteFlag
122
NoteOpen
309
NoteOpen
208
TieSlur
67
RestI
135
RestI
63
RestII
401
RestII
321
Sharp
345
Sharp
13
Staccatissimo
21
Time
122
TrebleClef
99
TrebleClef
305
Unknown
404
Unknown
404
Handwritten
Music
Symbol
Printed
Symbols
Music
Symbols
Table 4.11: Classes list of handwritten and printed music symbols.
Chapter 4. Segmentation and Classication Process
68
Results obtained Without Elastic Matchingk
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
99% CI for the Expected
performed in percentage:
average (standard deviation)
Neural network
Nearest neighbor
100%
100%
98,17%
100%
100%
97,41%
83,33%
66,67%
100%
100%
96,51%
100%
100%
76,24%
Support vector
machines
89,36%
100%
96,33%
100%
100%
95,69%
93,33%
50%
96,97%
100%
98,84%
100%
83,33%
93,01%
Hidden Markov
model
95,83%
55,56%
91,89%
81,36%
97,50%
69,49%
62,50%
22,22%
100%
93,07%
85,23%
100%
96,30%
40,59%
89,36%
0%
86,24%
78,95%
83,54%
84,48%
70%
0%
78,79%
99%
79.07%
60%
58.33%
58.42%
[78.85 (1.21); 82.88 (4.47)]
[93.41 (0.34); 94.53 (1.24)]
[94.59 (0.07); 96.90 (0.27)]
[76.58 (0.57); 81.04 (2.09)]
Table 4.12: Results obtained with the test set in the classication process for the handwritten
music symbols.
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
99% CI for the Expected
performed in percentage:
average (standard deviation)
Neural network
Nearest neighbor
100%
100%
100%
100%
98,68%
93,33%
100%
93,33%
100%
100%
100%
100%
100%
82,89%
Support vector
machines
100%
100%
97,37%
100%
100%
90%
100%
86,67%
100%
100%
100%
100%
100%
97,37%
Hidden Markov
model
78%
97,26%
92,68%
100%
93,42%
63,33%
88,31%
83,33%
100%
86,67%
100%
33,33%
36,67%
73,68%
94%
93,15%
100%
100%
88,16%
96%
90,67%
63.33%
100%
60%
98.75%
0%
93.33%
67,11%
[87.60 (0.41); 88.97 (1.53)]
[95.93 (0.24); 96.74 (0.90)]
[97.30 (0.30); 98.29 (1.10)]
[84.17 (0.32); 86.71 (1.19)]
Table 4.13: Results obtained with the test set in the classication process for the printed music
symbols.
From the results obtained for the handwritten music symbols we can conclude that the classier
with better performance was the support vector machines with a
the expected performance
[94.59%; 96.90%].
99%
condence interval for
Besides that, Hidden Markov Model performed
better than neural network since the latter had an higher standard deviation than the other
and it also had two classes (BassClef and NoteOpen) of symbols with 0%.
k
See the tables of confusion in the Appendix B.
4.3. Classication Process
69
The results obtained for the printed music symbols show us that the support vector machine
is again the better classier compared with the others with a
expected performance
99%
condence interval for the
[97.30%; 98.29%].
Results obtained With Elastic Matching∗∗
The database during the cross validation process was divided into train, validation and test
sets. The deformations caused by the deformation function 2.10 of the elastic matching with
M = 1, 2, 3
and
N = 1, 2, 3
were applied in the training data and training data with validation
set.
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
99% CI for the Expected
performed in percentage:
average (standard deviation)
Neural network
Nearest neighbor
100%
100%
95,41%
100%
100%
96,55%
86,67%
66,67%
100%
100%
98,84%
100%
100%
63,37%
Support vector
machines
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
Hidden Markov
model
91,66%
66,67%
96,40%
71,19%
92,50%
77,97%
28,13%
22,22%
100%
85,15%
81,82%
100%
77,78%
30,69%
91,50%
0%
86,24%
71,93%
89,87%
74,14%
40%
0%
45,45%
93%
88,37%
20%
58.33%
44,55%
[74.88 (0.84); 77.67 (3.09)]
[92.24 (0.32); 93.29 (1.17)]
[87.79 (1.95); 96.00 (8.82)]
[70.89 (0.67); 76.17 (2.47)]
Table 4.14: Results obtained with the test set in the classication process for the handwritten
music symbols.
The results lead us to conclude that the application of the elastic matching to the music symbols does not outperform the performance of the classiers without elastic matching. However,
in handwritten music symbols with higher similarities in their shapes, the process had a better
behaviour than the classiers without elastic matching see Table 4.16. This feature lead us
to suggest a study, in a future work, where the application of the deformation function used
only in these types of symbols to determine if the expected performed is augmented.
The same does not happen with printed music symbols. Our explanation resides on the fact
that these signs do not have high distortions in their shapes. Thus the elastic matching method,
in presence of such music symbols, makes the performance of the classiers slightly lower than
with handwritten music symbols.
The best classier for the test set of real scores was the support vector machines with a
condence interval for the expected performance [87.79%; 96.00%].
99%
However, the standard
deviation was very high [1.95%; 8.82%]. Consequently, I sugest the nearest neighbor classier as
∗∗
See the tables of confusion in the Appendix B.
Chapter 4. Segmentation and Classication Process
70
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
99% CI for the Expected
performed in percentage:
average (standard deviation)
Neural network
Nearest neighbor
98%
100%
100%
100%
100%
90%
98,70%
90%
100%
100%
98,77%
100%
100%
78,95%
Support vector
machines
90%
90,41%
94,74%
96,77%
97,37%
93,55%
90,79%
86,67%
93,75%
93,33%
97,5%
100%
100%
100%
Hidden Markov
model
78%
100%
97,37%
100%
93,33%
50%
80,52%
70%
100%
86,67%
100%
33,33%
100%
43,42%
86%
97,26%
94,74%
87.10%
84.21%
40%
92.21%
13.33%
100%
0%
97.5%
0%
93,33%
56,57%
[77.66 (1.34); 82.11 (4.93)]
[95.34 (0.22); 96.06 (0.80)]
[93.13 (0.26); 94.23 (1.18)]
[73.00 (1.00) 88.71 (3.67)]
Table 4.15: Results obtained with the test set in the classication process for the printed music
symbols.
Neural network
Nearest neighbor
Support vector machines
With EM
Without EM
With EM
Without EM
With EM
Without EM
Natural
89.87%
83.54%
100%
100%
100%
100%
Sharp
88.37%
79.07%
98.84%
96.51%
100%
98.84%
Table 4.16: The performed of the natural and sharp symbols.
the best classier with a
99% condence interval for the expected performance [92.24%; 93.29%],
because with a lower standard deviation ([0.32%; 1.17%]) the results will not largely span in
the classier performance. The best classier for the synthetic scores was the nearest neighbor
classier with a
99%
condence interval for the expected performance [95.34%; 96.06%].
Part IV
Conclusions and Future Work
71
Chapter 5
Conclusion
This dissertation had the aim to overcome the predicament of musical symbol recognition in
handwritten scores through research and application of recent techniques of machine learning
and articial intelligence to this problem. In fact, despite advances made in the area of research
in optical musical recognition algorithms, as we have seen in the brief description of state of
the art, several open problems still exist with handwritten music sheets. On the one hand, the
scores tend to be rather irregular and conditioned by the authors' own writing style, and the
quality of the paper in which it is written might have degraded throughout the years, making
it a lot harder to correctly identify its contents. On the other hand, they are likely to have
changes of size, shape and intensity of handwritten symbols by the same author into the same
score and the sta lines may be tilted one way or another on the same page. They also may
be curved and may have discontinuities. As a result, the detection and recognition process in
the handwritten music scores becomes more complicated than in printed music sheets.
The rst challenge faced by an OMR system was at sta lines detection and removal. These
operations were one of the most important phases of musical symbol recognition, because they
determine the results for subsequent proceedings. It was seen that the reasons for detecting and
removing the sta lines lie on the need to isolate the musical symbols for a more ecient and
correct detection of each symbol present on the score. The work of this dissertation resulted
in a robust algorithm for the automatic detection of sta lines in music scores based in stable
paths. The proposed method used a very simple but fundamental principle to assist detection
and avoids the diculties typically asserted by symbols superimposed on sta lines. The main
idea was to consider the sta lines as the result of the shortest path between the two margins
of the music sheet, giving preference to black pixels. This approach for the sta line detection
algorithm was adapted to a wide range of image conditions, thanks to its intrinsic robustness to
skewed images, discontinuities, and curved sta lines. The proposed method was also robust to
discontinuities in sta lines (due to low-quality digitalization or low-quality originals) or sta
lines as thin as one pixel.
In order to take full advantage of the method, existing sta line
removal algorithms were enhanced by using the stable paths method as its rst processing step.
Several tests that enabled improvements in this proposed approach that induced better results
were also performed. The encouraging results obtained led us to consider that investigation in
the detection of music symbols would be beneted from the improved sta line detection and
removal. In doing so, the next step was the application of already existent algorithms for the
73
Chapter 5. Conclusion
74
segmentation process using in the previous step the stable paths approach for the detection of
the sta lines.
Since in this work we dealt with handwritten musical scores the segmentation and classication
steps were hard.
On one hand, we had a huge variability in the music symbols that caused
inconsistency problems in the size and shape of each object.
On the other hand, the sta
lines removal may cause breaks in the music symbols that can insert imperfections into them.
Besides that, we also had complexity and ambiguity problems related with the number of possible arrangements of music primitives. The segmentation method was based on a hierarchical
decomposition of the music image and the symbols that we wanted to recognize were split into
four dierent types: the symbols that are featured by a vertical segment, the symbols that link
the notes, the remaining symbols connected to sta lines and the symbols above and under sta
lines. In the music symbol classication a profound comparative study with some classiers,
as the hidden Markov model and support vector machine, was presented. A new methodology
based on elastic matching was used in our dataset with conjunction with others classiers. Our
aim was to simulate controlled distortions, in the database, that can happen in real scores to
prepare the classier for several undesirable situations that may occur in handwritten music
sheets.
With the study done, we concluded that the application of the elastic matching to
the music symbols did not outperform the performance of the classiers without this method.
However, we also saw that the results for classiers with elastic matching presented a better
behaviour for the handwritten music symbols with higher similarities in their shapes.
The
best classier for handwritten music symbols was support vector machines with a performance
higher than 94.59% for the test set without elastic matching and 93.10% for the test set with
elastic matching. The best classier for printed music symbols without elastic matching was
support vector machines with a performance higher than 97.30% and nearest neighbor classier
for the test set with elastic macthing with a performance higher than 95.34%.
5.1 Future Work
The various approaches in OMR to musical symbols segmentation and classication are still
below the expectations for handwritten musical scores.
It is intention of this project, in a
future work, that the proposed methodology incorporates in a natural way the prior knowledge
of the musical rules in the recognition of symbols, in order to overcome the limitations of
the current approaches that are incapable of dealing, in a robust form, with specicities of
the handwritten music.
The new proposed methodology should also be naturally adaptable
to manuscript images and to dierent music notations. Specically, we intend to explore our
initial work in several directions:
ˆ
Continue to investigate new methods of optical music recognition for handwritten musical
scores and for dierent music notations.
This line of research involves a study and a
profound understanding of the latest techniques of pattern recognition, machine learning
and inductive logic programming. The merger of rules and techniques from dierent areas
should help overcome the problems of the existing algorithms.
5.1. Future Work
ˆ
75
Integration of the algorithms developed in an OMR system with remote access via the
internet to a wide corpus of unpublished handwritten music encoded in a adequate format;
The creation of a system like this will not only centralize as much information as possible
but will also serve to preserve this corpus in a way that is easily accessible for browsing,
analysing and downloading, while keeping the scores in their original format along with
their digital counterpart. The availability of a system with these features will contribute
for the preservation of the musical heritage.
Specically, one of the objectives for a future work, is the investigation of new methods for
the segmentation phase.
It is our intention to incorporate Hidden Markov Models in this
step. It will be necessary to perform an intense study and investigation in this eld. Besides
that, a simple detection of the music symbols is not sucient. It is also necessary to perform
their interpretation and to understand the connections between them to have a recognition
of the analyzed score. Therefore, the denition of these relations are needed before they are
implemented in a system. In this manner, the integration of syntactic music rules is required
in the recognition process to acquire semantic information. Thus, this project has as another
goal the research and application of inductive logic programming techniques.
References
[AS07]
Shai Avidan and Ariel Shamir.
Seam carving for content-aware image resizing.
ACM Trans. Graph., 26(3):10, 2007.
[AYV01]
N. Arica and F.T. Yarman-Vural. An overview of character recognition focused
Systems, Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, 31(2):216233, May 2001.
on o-line handwriting.
[Bai97]
D. Bainbridge.
Extensible Optical Music Recognition.
PhD thesis, Department of
Computer Science, University of Canterbury, Christchurch, NZ, 1997.
[BBN01]
P. Bellini, I. Bruno, and P. Nesi. Optical music sheet segmentation.
of Music, 2001. Proceedings. First International Conference on,
Web Delivering
pages 183190,
Nov. 2001.
[BS00]
Marija Bojovic and Milan D. Savic. Training of hidden markov models for cursive
handwritten word recognition.
In
ICPR '00: Proceedings of the International
Conference on Pattern Recognition, page 1973, Washington, DC, USA, 2000. IEEE
Computer Society.
[Cap08]
Artur Capela. Reconhecimento de símbolos musicais manuscritos na framework
gamera. Master's thesis, Faculdade de Engenharia da Universidade do Porto, 2008.
[Car89]
N. P. Carter.
Publishing.
[CC93]
Automatic Recognition of Printed Music in the Context of Electronic
PhD thesis, Departments of Physics and Music, 1989.
B. Coüasnon and J. Camillerapp. Using grammars to segment and recognize music
Proc. of DAS-94: International Association for Pattern Recognition
Workshop on Document Analysis Systems, pages 1527, Kaiserslautern, 1993.
scores.
+
[CCR 08]
In
Jaime S. Cardoso, Artur Capela, Ana Rebelo, Carlos Guedes, and Joaquim Pinto
da Costa. Sta detection with stable paths, 2008. IEEE Transaction on Pattern
Analysis and Machine Intelligence (TPAMI), Submited.
[CCRG08a] Artur Capela, Jaime S. Cardoso, Ana Rebelo, and Carlos Guedes.
Integrated
recognition system for music scores, 2008. International Computer Music Conference (ICMC), Accepted.
[CCRG08b] Jaime S. Cardoso, Artur Capela, Ana Rebelo, and Carlos Guedes. A connected
path approach for sta detection on a music score, 2008.
Conference on Image Processing (ICIP), Accepted.
76
IEEE International
References
[Coü96]
77
Segmentation et reconnaissance de documents guidées par la connaissance a priori: application aux partitions musicales. PhD thesis, Université de
B. Coüasnon.
Rennes, 1996.
[CRCG08]
Artur Capela, Ana Rebelo, Jaime S. Cardoso, and Carlos Guedes. Sta line de-
Proceedings of the International Conference on Signal Processing and Multimedia Applications (SIGMAP 2008), pages
tection and removal with stable paths. In
263270, 2008.
[DDCF07]
Christoph Dalitz, Michael Droettboom, Bastian Czerwinski, and Ichiro Fujigana.
Sta removal toolkit for gamera, 2005-2007.
http://music-staves.
sourceforge.net.
[DDCF08]
Christoph Dalitz, Michael Droettboom, Bastian Czerwinski, and Ichiro Fujigana.
A comparative study of sta removal algorithms.
IEEE Transactions on Pattern
Analysis and Machine Intelligence, 30:753766, 2008.
[Die05]
Reinhard Diestel.
Graph Theory. Graduate Texts in Mathematics. Springer-Verlag,
third edition, 2005.
[eeFc02]
Vojt ech ech Franc and Václav Hlavá c. Multi-class support vector machine. Technical report, 2002.
[FLS05]
Alicia Fornés, Josep Lladós, and Gemma Sánchez. Primitive segmentation in old
handwritten music scores. In Wenyin Liu and Josep Lladós, editors,
3926 of
[Fuj04]
GREC, volume
Lecture Notes in Computer Science, pages 279290. Springer, 2005.
Ichiro Fujinaga.
Sta detection and removal.
In Susan George, editor,
Perception of Music Notation: On-Line and O-Line Recognition,
Visual
pages 139.
Idea Group Inc., 2004.
[Fuk90]
Keinosuke Fukunaga.
Introduction to statistical pattern recognition (2nd ed.). Aca-
demic Press Professional, Inc., San Diego, CA, USA, 1990.
[Gal86]
Zvi Galil.
Ecient algorithms for nding maximum matching in graphs.
Comput. Surv., 18(1):2338, 1986.
[GWE04]
Rafael C. Gonzalez, Richard E. Woods, and Steven L. Eddins.
age processing using MATLAB,
pages 405407.
ACM
Digital Im-
Upper Saddle River,
NJ :
Pearson/Prentice-Hall, 2004.
[Has70]
W. K. Hastings.
applications.
[Hay98]
Monte carlo sampling methods using markov chains and their
Biometrika, 57(1):97109, April 1970.
Simon Haykin.
Neural Networks: A Comprehensive Foundation (2nd Edition).
Prentice Hall, July 1998.
[HL02]
Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support
vector machines.
IEEE Transactions on Neural Networks, 13(2):415425, 2002.
References
78
[JZ97]
Anil K. Jain and Douglas Zongker. Representation and recognition of handwritten
digits using deformable templates.
IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(12):13861391, 1997.
[JZL96]
Anil K. Jain, Yu Zhong, and Sridhar Lakshmanan.
formable templates.
Object matching using de-
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 18(3):267278, 1996.
+
[KHB 00]
Tapas Kanungo, Robert M. Haralick, Henry S. Baird, Werner Stuezle, and David
Madigan.
A statistical, nonparametric methodology for document degradation
IEEE Transactions on Pattern Analysis and Machine Intelli-
model validation.
gence, 22(11):12091223, 2000.
[KI90]
H. Kato and S. Inokuchi.
A recognition system for printed piano music using
Proceedings of the International Association
for Pattern Recognition Workshop on Syntactic and Structural Pattern Recognition,
musical knowledge and constraints. In
pages 231248, 1990.
[KPM96]
Gary E. Kopec, Phil A. Chouxerox Parc, and David A. Maltzcarnegie. Markov
source model for printed music decoding.
Journal of Electronic Imaging,
pages
714, 1996.
[LCL93]
I. Leplumey, J. Camillerapp, and G. Lorette. A robust detector for music staves.
Proceedings of the International Conference on Document Analysis and Recognition, pages 902905, 1993.
In
[LS88]
L. Lam and Ching Y. Suen. Structural classication and relaxation matching of
totally unconstrained handwritten zip-code numbers.
Pattern Recogn.,
21(1):19
32, 1988.
[MA98]
E. Mayoraz and E. Alpaydim. Support vector machines for multiclass classication.
Technical report, 1998.
[Mah82]
J. V. Mahoney. Automatic analysis of music score images. B.Sc thesis, 1982.
[Mat85]
T. Matsushima.
vision system).
Automated high speed recognition of printed music (wabot2
Advanced Robotics 1985. ICAR 1985. International Conference
on, pages 477 482, 1985.
[MMM04]
Youichi Mitobe, Hidetoshi Miyao, and Minoru Maruyama.
A fast hmm algo-
rithm based on stroke lengths for on-line recognition of handwritten music scores.
IWFHR '04: Proceedings of the Ninth International Workshop on Frontiers
in Handwriting Recognition, pages 521526, Washington, DC, USA, 2004. IEEE
In
Computer Society.
[MN96]
H. Miyao and Y. Nakano. Note symbol extraction for printed piano scores using
neural networks.
D:548554, 1996.
IEICE TRANSACTIONS on Information and Systems,
E79-
References
[MO07]
79
Hidetoshi Miyao and Masayuki Okamoto. Stave extraction for printed music scores
using DP matching.
Journal of Advanced Computational Intelligence and Intelli-
gent Informatics, 8:208215, 2007.
[MP88]
Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent
in nervous activity. pages 1527, 1988.
[Ng95]
Kia Ng.
Automated computer recognition of music score.
PhD thesis, University
of Leeds, 1995.
[Nis95]
Hirobumi Nishida. A structural model of shape deformation.
Pattern Recognition,
28(10):16111620, 1995.
[Pre70]
D. Prerau.
Computer pattern recognition of standard engraved music notation.
PhD thesis, Massachusetts Institute of Technology, 1970.
[Pru66]
D. Pruslin.
Automatic recognition of sheet music.
PhD thesis, Massachusetts
Institute of Technology, 1966.
[Pug06]
Laurent Pugin. Optical music recognition of early typographic prints using hidden
markov models. In
[RB05]
ISMIR, pages 5356, 2006.
F. Rossant and I. Bloch. Optical music recognition based on a fuzzy modeling of
symbol classes and music writing rules.
Image Processing, 2005. ICIP 2005. IEEE
International Conference on, 2:II53841, Sept. 2005.
+
[RCC 07]
Ana Rebelo, Artur Capela, Joaquim F. Pinto da Costa, Carlos Guedes, Eurico Carrapatoso, and Jaime S. Cardoso. A shortest path approach for sta line detection.
Automated Production of Cross Media Content for Multi-Channel Distribution,
2007. AXMEDIS '07. Third International Conference on, pages 7985, Nov. 2007.
+
[RCF 93]
R. Randriamahefa, J.P. Cocquerez, C. Fluhr, F. Pepin, and S. Philipp. Printed
Document Analysis and Recognition, 1993., Proceedings of the
Second International Conference on, pages 898901, Oct 1993.
music recognition.
[Rea69]
Gardner Read.
Music Notation: A Manual of Modern Practice (2nd ed.).
Ta-
plinger, New York, 1969.
[RHW88]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. pages 673695, 1988.
[Ros02]
Florence Rossant. A global method for music symbol recognition in typeset music
sheets.
[RP96]
Pattern Recognition Letters, 23(10):11291141, 2002.
K.T. Reed and J.R. Parker.
Automatic computer recognition of printed music.
Pattern Recognition, 1996., Proceedings of the 13th International Conference on,
3:803807 vol.3, Aug 1996.
References
80
[RT88]
J. W. Roach and J. E. Tatem. Using domain knowledge in low-level visual processing to interpret handwritten music: an experiment.
Pattern Recognition, 21(1):33
44, 1988.
[SDV06]
C.H. Papadimitriou S. Dasgupta and U.V. Vazirani.
Algorithms,
pages 169179.
McGraw-Hill Higher Education, 2006.
[Szw05]
Mariusz Szwoch. A robust detector for distorted music staves. In
Computer Anal-
ysis of Images and Patterns, pages 701708. Springer-Verlag, Heidelberg, 2005.
[TK03]
Sergios Theodoridis and Konstantinos Koutroumbas.
Pattern recognition, chapter
4.6. Academic Press, second edition, 2003.
[TMD99]
Michael Thulke, Volker Märgner, and Andreas Dengel.
A general approach to
DAS '98: Selected Papers
from the Third IAPR Workshop on Document Analysis Systems, pages 4357,
quality evaluation of document segmentation results. In
London, UK, 1999. Springer-Verlag.
[TSM06]
Fubito Toyama, Kenji Shoji, and Juichi Miyamichi. Symbol recognition of printed
piano scores with touching symbols. pages 480483, 2006.
[Vap98]
Vladimir N. Vapnik.
Statistical Learning Theory.
Wiley-Interscience, September
1998.
[Wak94]
T. Wakahara. Shape matching using lat and its application to handwritten numeral
recognition.
[WFJC01]
IEEE Trans. Pattern Anal. Mach. Intell., 16(6):618629, 1994.
Yuan-Kai Wang, Kuo-Chin Fan, Yau-Tarng Juang, and Tai-Hong Chen.
hidden markov model for chinese business card recognition.
In
ICIP (1),
Using
pages
11061109, 2001.
[WH88]
Bernard Widrow and Marcian E. Ho. Adaptive switching circuits. pages 123134,
1988.
Part V
Appendix
81
Appendix A
Fundamentals
A.1 Primal Problem vs Dual Problem
The primal problem for the case of nonseparable training classes
space is given by
{(xi , di )}N
i=1
in the feature
N
X
1 T
w w+C
ξi
2
min
w,b,ξi
(A.1)
i=1
di (wT xi + b) ≥ 1 − ξi
s.a
where the non negatives slack variables,
{ξi }N
i=1 ,
i = 1, 2, ..., N
measure the deviation of a data point from
the ideal condition of pattern separability and the parameter
C
controls the relation between
the complexity of the classier and the number of nonseparable points. It is possible to nd the
solution of the quadratic problem by determining the saddle point of the Lagrangian function
that is given by:
J(w, b, ξ, α, µ) =
N
N
N
h
i X
X
X
1 T
w w+C
ξi −
αi di (wT xi + b) − (1 − ξi ) −
µi ξ i
2
i=1
i=1
i=1
w, b
which has to be minimized with respect to
respect to the nonnegative multipliers
J(w, b, ξ, α, µ)
αi
µi .
e
and
ξi ;
(A.2)
it also has to be maximized with
The parameters that minimize the function
meet the following conditions:
N
X
∂J
=w−
αi di xi = 0
∂w
(A.3)
i=1
N
X
∂J
=−
αi di = 0
∂b
(A.4)
i=1
∂J
= C − µi − αi = 0 i = 1, 2, . . . , N
∂ξi
(A.5)
From these conditions we have
w=
N
X
i=1
83
αi di xi
(A.6)
Appendix A. Fundamentals
84
N
X
αi di = 0
(A.7)
i=1
µi + α i = C
i = 1, 2, . . . , N
(A.8)
If we expand the equation A.2, term by term, we have:
J(w, b, ξ, α, µ) =
N
N
N
N
N
N
X
X
X
X
X
X
1 T
w w+C
ξi −
αi di wT xi + b
αi di +
αi −
αi ξi −
µi ξi
2
i=1
i=1
i=1
i=1
i=1
i=1
(A.9)
The fourth term in the right-hand side is zero by the optimality condition A.7 obtained above.
Besides that, from the condition A.6 we have
N
X
wT w =
N
αi di wT xi =
i=1
N
1 XX
αi αj di dj xTi xj
2
i=1 j=1
Replacing the condition A.8 and the result obtained above in the equation A.9, we obtain the
following dual objective function Lagrangiana:
JD =
N
X
i=1
N
αi −
N
1 XX
αi αj di dj xTi xj
2
(A.10)
i=1 j=1
which gives the inferior limit of the function objective A.1 to any admissible point. In doing
so, it is possible to formulate the dual problem for a non separable training sample{(xi , di )}i=1
N
as follows:
max
α
N
X
i=1
s.a
N
αi −
w=
PN
PN
N
1 XX
αi αj di dj xTi xj
2
i=1 j=1
i=1 αi di xi
i=1 αi di = 0
(A.11)
C − µi − αi = 0 i = 1, 2, ..., N
αi ≥ 0,
µi ≥ 0 i = 1, 2, ..., N
Adding the complementary conditions of the Karush-Kuhn-Tucker
αi di (wT xi + b) − (1 − ξi ) = 0
µi ξi = 0
to the conditions A.6A.8 for
follows:
αi di (wT xi + b) − (1 − ξi ) ≥ 0
i = 1, . . . , N
(A.12)
(A.13)
(A.14)
it is possible reformulate the dual problem A.11 as
A.2. Error Backpropagation Algorithm
max
α
s.a
N
X
i=1
85
N
αi −
PN
N
1 XX
αi αj di dj xTi xj
2
i=1 j=1
(A.15)
i=1 αi di = 0
0 ≤ αi ≤ C
i = 1, 2, ..., N
A.2 Error Backpropagation Algorithm
The error backpropagation method is a neural network train without feedback.
This tech-
nique requires a dierentiable transfer function to be possible the minimization of the cost
function [TK03, Hay98].
Since the multilayer perceptron architectures apply the activation
function degree:

1,
f (x) =
0,
se
x>0
se
x<0
(A.16)
which is discontinues in the point 0, this algorithm uses the family of continuous dierentiable
functions which is the family of sigmoid function:
f (x) =
1
, where a > 0
1 + exp(−ax)
is the slope parameter
(A.17)
A.2.1 Mathematic explanation
The learning method of using the error backpropagation uses the gradient descending technique
to minimize the cost function. This function expresses the mean square error dened as follows:
E(i) =
kL
N
X
1X
i=1
where
ym (i)
and
ŷm (i)
the output neuron
2
m=1
(ym (i) − ŷm (i))2
(A.18)
represent respectively the output obtained and the desired output for
m. kL
is the total number of the outputs neurons and
of training samples available (y(i),
x(i)).
N
is the total number
The updating of the weights of the edges is made
from
wjr (new) = wjr (old) − µ
∂E(i)
∂wjr
wjr (old) is the current estimate of the unknow weights,
∂E(i)
∂wjr is the gradient of error function.
where
Let
(A.19)
µ
is the learning coecients and
υjr be the sum of the input weights in the j th neuron in the rth layer and yjr the corresponding
output after the activation function. If
in the
r−
f (.)
j th
neuron in the
rth
of the latter neuron is
layer,j
neuron, k = 1, 2, ..., kr−1 ,
r
wjk is the current estimate of the corresponding
is the output of the
1th layer for the ith training pair and
weight in the
function
ykr−1
= 1, 2, ..., kr ,
k th
then the argument of the activation
Appendix A. Fundamentals
86
kr−1
υjr (i) =
X
kr−1
X
r r−1
r
wjk
yk (i) + wj0
=
k=1
r
where y0 (i)
= 1, ∀r, i.
(A.20)
k=0
For the output layer, we have
and for the network input we have
r r−1
wjk
yk (i)
r=
1, ykr (i)
r = L, ykr (i) = ŷk (i),
= xk (i),
with
with
k = 1, 2, ..., k0 .
k = 1, 2, ..., kL
By the chain rule
in dierentiation, we have
∂E(i)
∂E(i) ∂υjr (i)
=
∂wjr
∂υjr (i) ∂wjr
(A.21)
from A.20 we obtain


∂ r

υ
(i)
≡

∂wjr j

Let us dene,
∂
r
r υj (i)
∂wj0
.
.
.
∂
r
∂wjk
r−1
υjr (i)



 = y r−1 (i)

(A.22)
∂E(i)
= δjr (i)
∂υjr (i)
(A.23)
then A.19 becomes
wjr (new)
=
wjr (old)
−µ
N
X
δjr (i)y r−1 (i)
(A.24)
i=1
Summarizing the error backpropagation algorithm:
1.
Inicialization :
2.
Forward computations :
Initialize all the weights with small random values.
r
pute υj (i) and
yjr (i)
For each of the training feature vectors
=
f (υjr (i)), for
x(i), i = 1, 2, ..., N
j = 1, 2, ..., kr , r = 1, 2, ..., L
com-
from the sigmoid
f . Compute the cost function for the current estimate of weights from E(i) =
PkL
1 PkL
1 PkL
2
2
L
2
m=1 (ym (i) − ŷm (i)) and E(i) ≡ 2
m=1 em (i) ≡ 2
m=1 (f (vm (i)) − ym (i)) .
function
PN
1
i=1 2
3.
Backward computations :
δjL (i)
for
r
For each
i = 1, 2, ..., N
and
j = 1, 2, ..., kr
compute
Pkr
δjL (i)
from
r f 0 (υ r−1 (i)),
= ej (i)f 0 (υjL (i)) and in the end compute δjr−1 (i) from δjr−1 (i) = k=1 δjr (i)wkj
j
= L, L − 1, ..., 2 and j = 1, 2, ..., kr ; f 0 (.) represents the dierentiation with respect
to argument.
4.
Update the weights :
wjr (new)
=
wjr (old)
r = 1, 2, ..., L and j = 1, 2, ..., kr
PN r
− µ i=1 δj (i)y r−1 (i).
For
A.2.2 Graphic explanation
To facilitate the layout of this method we will only consider a neural network with three layers
see Figure A.1. Each neuron is composed by the sum of products of the weights with the
input signals and the sigmoid activation function. The signal
e
represents the output signal of
A.2. Error Backpropagation Algorithm
87
Figure A.1: Neural network architecture with three layers, two inputs and one output.
the sum and
y = f (e)
represents the output signal of the neuron.
To prepare the network for the classication it is necessary to train a data set constituted
by the input signals (x1 and
x2)
associated with their targets
z.
This training is an interac-
tive process, because for each interaction the node weights are modied using a new training
dataset, in order to minimize the cost function. This modication is computed using the error
backpropagation algorithm. In doing so, each learning step starts by forcing both input signals
of the training set to determine the values of the output signals for each neuron in each network
layer. Figure A.2 shows how the signal propagates through the network. The symbols
xm
are the weights of the edges between the network input
the symbols
yn
represent the output signal in the neuron
In the next step, the output signal
y
and the neuron
w(xm) n
n in the input layer;
n.
is compared with the value of the desired output (z ) see Figure A.3. The dierence is called
error signal d
of the neuron output layer. Since the
output values of the neurons in the hidden layers are unknown, it is possible to compute the
error signal. In this manner, the aim is to propagate the error signal
the neurons - see Figure A.4. The value of the weights
wmn
d in backward mode to all
is equal to the value of the weights
to calculate the output value. Only the direction of the sense of the data is changed, that is,
the signals propagate itself, for all layers of network, from the output to the input, one behind
the others.
When the error signal for each neuron has been computed, the weights coecients for each
1
input neuron are modied see Figura A.5 .
1 df (e)
de
represents the derivative of the activation function of the neuron whose weights are modied.
Appendix A. Fundamentals
88
(a) Forward propagation of the signal in the input
layer.
(b) Forward propagation of the signal in the input
layer.
[
(c) Forward propagation of the signal in the input (d) Forward propagation of the signal in the hidden
layer.
layer.
(e) Forward propagation of the signal in the hidden
layer.
Figure A.2:
(f) Forward propagation of the signal in the output
layer.
Forward propagation of the signal in the neural network.
Figure A.3: Comparison between the signal output and the target.
A.2. Error Backpropagation Algorithm
89
(a) Backward propagation of the signal in the output
layer.
(b) Backward propagation of the signal in the output
layer.
(c) Backward propagation of the signal in the hidden
layer.
(d) Backward propagation of the signal in the hidden
layer.
(e) Backward propagation of the signal in the hidden
layer.
Figure A.4:
Propagation of the error signal
d
in backward mode in the neural network.
Appendix A. Fundamentals
90
(a) The calculation of weight in the input layer.
(b) The calculation of weight in the input layer.
(c) The calculation of weight in the input layer.
(d) The calculation of weight in the hidden layer.
(e) The calculation of weight in the hidden layer.
(f) The calculation of weight in the output layer.
Figure A.5: The calculation of weights in the neural network.
A.3. Dalitz's Algorithm
91
A.3 Dalitz's Algorithm
Dalitz algorithms is a generalization of the method described in [MO07]. It stars by estimating
the values for the sta line thickness and the sta space height. These values are estimated by
the technique used in Fujinaga [Fuj04]: the most frequent black-runs represents the sta line
staine_height ) and the most common white-runs represents the vertical line distance
within the same sta (staspace_height ). This technique starts by computing the vertical runheight (
lengths representation of the image.
The process of Dalitz for nding the sta lines operates on a set of stasegments and requires
methods not only for linking two of those stasegments horizontally and vertically, but also for
merging two segments with overlapping positions into one. In the following steps the algorithm
will be described:
1. Add vertical links between stasegments with a vertical distance around
staine_height +staspace_height.
2. Add horizontal links between adjacent stasegments possibly belonging to the same
sta line.
3. Partition the resulting graph into connected subgraphs; each subgraph that is wide and
high enough corresponds to a sta.
4. All stasegments within a system are labelled as belonging to a certain sta line. Segments of the same line at the same horizontal position are merged into one segment.
5. Due to ledger lines, ties and beams, some subgraphs will contain too many sta lines. To
reduce them to a predened number of lines per sta (typically ve for modern notation,
four for chant and six for tablature), the outer sta lines of each sta are subsequently
removed until the predened number of sta lines remains.
Dalitz developed the following algorithm to nd the stasegments:
1. Extract horizontal runs with more than 60 percent black pixels within a window of width
staspace_height.
2. The resulting laments are vertically thinned by replacing each vertical black run with
its middle pixel. For black runs higher than
2*staine_height,
more than one skeleton
point is extracted.
3. The resulting skeleton segments wider than
2*staine_height
are the stasegments.
A.4 Matching in Bipartite Graphs
Before moving to the denition of the Kuhn-Munkres Algorithm it is necessary to give some
denitions. Let
has weight
and
G = (X ∪ Y, X × Y )
w(xy). X
and
Y
be a weighted complete bipartite graph, where edge
have the same size,
n,
and can be written as
Y = {y1 , y2 , . . . , yn }.
Denition 1:
A
feasible vertex labelling
l(y) = w(xy) for all x ∈ X
and
y∈Y
for
N
is a function
X = {x1 , x2 , . . . , xn }
l : V (N ) → Z
. We dene the size of l, by
xy
such that
size(l) =
P
l(x) +
v∈V (N ) l(v).
Appendix A. Fundamentals
92
Denition 2:
N
Suppose
w(e).
P
Let
l
be a feasible vertex labelling of
N.
equality subgraph Gl
Then the
l
in
l(x) + l(y) = w(xy).
is the spanning subgraph of
N
P
G = (X∪Y, X×Y ) by giving each edge e an integer weight
is a network obtained from
containing all edges
xy
for
for which
The algorithm iteratively constructs a sequence of feasible vertex labelling
such that
size(li+1 ) < size(li ),
and a sequence of matchings
matching in the equality subgraph
labelling li for which
Mi
G(li ),
i ≥ 1.
for all
is a perfect matching in
Mi
1. If
M
matched
2. If
G,
is complete for
x ∈ X.
JGl (S) 6= T ,
Set
Gl ,
then
S = {x}
where
Tc
Note that
by
min
x∈S,y∈T c
denotes the complement of
αl > 0
S ∪ {z}
and
y
in
and
T
Stop.
JGl (S) = T .
JG0l (S) 6= T .
T
in
Y,
Replace
JGl (S), not in T .
T ∪ {y},
If
y
l
by
l0
and
Gl
by
l0
by
Gl 0 .
M , say with z ∈ X , replace
and go to step 2. Otherwise, there will be an
0
by M and go to step 1.
M
Find
is matched in
x
Replace
y,
Gl .
and construct a new labeling
M 0 in
to
in
Otherwise, there is some un-



l(v) − αl for v ∈ S


0
l (v) =
l(v) + αl for v ∈ T




l(v) otherwise
by
M
{l(x) + l(y) − w(xy}
alternating path from
Gl .
Start with an arbitrary feasi-
T = ∅.
go to step 3. Otherwise,
3. Choose a vertex
S
is optimal.
and
αl =
is a maximum
G(li ).
and choose an arbitrary matching
M
Mi
for
It stops when it nds a feasible vertex
The Kuhn-Munkres Algorithm (or Hungarian method)
ble vertex labeling l, determine
such that
l1 , l2 , . . .
M
and we may use this path to nd a larger matching
A.5 Otsu Threshold Algorithm
The segmentation process divides the image into sub- homogeneous regions. Homogeneity is
the existence of compliance of some features as intensity of colour or levels of gray. The Otsu
algorithm is used to nd the parameter of intensity T that divides the initial image into two
parts - classes of white pixels and classes of black pixels. This method considers that any point
that presents an intensity equal or greater than T belongs to the region of interest, and all
the others are considered background of the image.
In this manner, a new image
generated as follows:
g(x, y) =
(
1 se f (x, y) ≥ T
0 se f (x, y) < T
g(x, y)
is
A.5. Otsu Threshold Algorithm
f (x, y)
where
93
is the original image. After the segmentation, the matrix of the new image will
be constituted by 0's black pixels and 1's white pixels [GWE04].
Otsu considers the image normalized histogram as a function of density of discrete probability,
as follows:
pr (rq ) =
where
rq
nq
,
n
q = 0, 1, 2, . . . , L − 1
n is the total number of pixels in the image, nq
and
L
is the number of pixels that have intensity
is the total number of possible levels of intensity in the image.
chooses the threshold
k,
such that
C1 = [k, k + 1, . . . , L − 1],
k
is the intensity level where
that maximize the
2
= ω0 (µ0 − µT )2 + ω1 (µ1 − µT )2
σB
where,
ω0
=
ω1
=
µ0
=
µ1
=
µT
=
C0 = [0, 1, . . . , k − 1]
variance between the classes
as
Pk−1
q=0
PL−1
q=k
Pk−1
q=0
PL−1
q=k
PL−1
q=0
pq (rq )
pq (rq )
qpq (rq )/ω0
qpq (rq )/ω1
qpq (rq )
The Otsu method
and
2 , that is dened
σB
Appendix A. Fundamentals
94
A.6 Sta Line Removal Algorithm
staffLineRemoval(IMAGE,STAVES)
{
threshold = 2*staffHeight;
tolerance = 1+ceil(staffHeight/3.0);
IMAGE\_REMOVE = copy(IMAGE);
For(nvalid = 0 to STAVES size)
{
Point2D staff = validStaves[nvalid];
For (i = 0 to staff size)
{
col = staff[i].x;
refRow = staff[i].y;
row = refRow;
pel = valuePixel(IMAGE, IMAGE\_REMOVE);
decrement/increase the reference row until one pixel
different from white pixel (dist1/dist2) is found;
If ( dist1 <= max(1, min (dist2, tolerance)) )
{
refRow-=dist1;
Else
If ( dist2 <= max(1, min (dist1, tolerance)) )
refRow+=dist2;
Else
continue;
}
Count the number of decrements/increase on the reference row
until the black pixel changes to white pixel (run);
If ( run >= threshold )
continue;
remove the vertical black sequences on the IMAGE;
}
}
}
Listing 3: Sta lines removal algorithm.
A.7 Baum-Welch Algorithm
The Baum-Welch algorithm or the Forward-Backward procedure consider the forward variable
αt (i)
dened as
αt (i) = P (o1 o2 . . . t , qt = Si |λ)
that is, the probability of the partial observation sequence,
given the model
λ
and the backward variable
βt (i)
o1 o2 . . . t ,
at time
t,
to the end, given state
Si
and state
Si
dened as
βt (i) = P (ot+1 ot+2 . . . T , qt = Si , λ)
that is, the probability of the partial observation sequence from
at time
t
and the model
λ.
t+1
Besides these variables the Baum-Welch algorithm also needs two
more auxiliary variables that can be expressed in terms of the forward and backward variables:
ξ(i, j) = P (qt = Si , qt+1 = Sj |o, λ)
A.7. Baum-Welch Algorithm
95
that is, the probability of being in state
Si
at time t, and state
Sj
at time
t + 1, given the model
and the observation sequence. This is the same as,
ξ(i, j) =
P (qt = Si , qt+1 = Sj , o|λ)
P (o|λ)
Using forward and backward variables this can be expressed as,
αt (i)aij bj (ot+1 )βt+1 (j)
ξ(i, j) = PN PN
i=1
j=1 αt iaij (i)bj (ot+1 )βt+1 (j)
(A.25)
The second variable is the a posteriori probability,
N
X
γt (i) =
ξt (i, j)
(A.26)
j=1
In forward and backward variables this can be expressed by,
Assuming a starting model
αt (i)βt (i)
γt (i) = PN
i=1 αt (i)βt (i)
λ = (A; B; π),
this algorithm starts by computing the
α's
and
β 's
as follows. The initialization of the forward probabilities as the joint probability of state
and initial observation
o1 :
α1 (i) = πi bi (o1 ),
the next step consists of the product over all the
αt+1 (j) =
N
X
1≤i≤N
N
[αt (j)aij ] bj (ot+1 ),
This result in the probability of
αt+1 (j)
Si
at time
possible states
t+1
at time
t:
1≤j≤N
with all the accompanying previous partial
is obtained by accounting for observation
multiplying the summed quantity by the probability
by the initialization of the
Si , 1 ≤ i ≤ N
1 ≤ t ≤ T − 1,
i=1
observations. The
Si
bj (ot+1 ).The
ot+1
in state
j,
that is, by
backward procedure starts
βT (i):
βT (i) = 1,
1≤i≤N
and the calculation of
βt (j) =
N
X
aij bj (ot+1 )βt+1 (j),
j=1
Then the algorithm computes the
1 ≤ i ≤ N,
t = T − 1, T − 2, . . . , 1.
ξ 's and γ 's using Equations A.25 and A.26, respectively.
The
next step is to update the HMM parameters according to
π̄i = γ1 (i),
PT −1
ξt (i, j)
āij = Pt=1
,
T −1
t=1 γt (i)
PT
t=1 γt (j)
s.to =v
b̄j (k) = PTt k
,
t=1 γt (j)
1≤i≤N
(A.27)
1 ≤ i ≤ N, 1 ≤ j ≤ N
(A.28)
1 ≤ j ≤ N,
(A.29)
1≤k≤M
Appendix A. Fundamentals
96
A.8 Viterbi Algorithm
Let
δt (i) = maxq1 ,q2 ,...,qt−1 P [q1 , q2 , . . . , qt = i, o1 o2 . . .t |λ] the highest probability along a single
path, at time
t,
which accounts for the rst
we have
t
observations and ends in state
Si .
By induction
δt+1 (j) = max δt (i)aij bj (Ot+1 )
The array
ψt (i)
has the track of
δt+1 (j)
i
for each
t
and
j.
The procedure is described below:
1. Initialization:
δ1 (i) = πi bi (o1 ),
1≤i≤N
ψ1 (i) = 0
2. Recursion:
δt (j) = max [δt−1 (i)aij ] bj (ot ),
1≤i≤N
ψt (j) = arg max [δt−1 (i)aij ] ,
1≤i≤N
2 ≤ t ≤ T, 1 ≤ j ≤ N
2 ≤ t ≤ T, 1 ≤ j ≤ N
3. Termination:
P ∗ = max [δT (i)]
1≤i≤N
qT∗
= arg max [δT (i)]
1≤i≤N
4. Path backtracking:
qT∗ = ψt+1 (qt+1 )∗ ,
t = T − 1, T − 2, . . . , 1
Appendix B
Table of confusion
B.1 Results Obtained Without Elastic Matching for the
Handwritten Music Symbols
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
47
0
0
0
0
0
0
0
0
0
0
0
0
0
BassClef
0
6
0
0
0
0
0
0
0
0
0
0
0
0
Beam
1
0
107
0
0
0
0
0
0
0
0
0
0
1
Flat
0
0
0
57
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
0
79
0
0
0
0
0
0
0
0
0
Note
0
0
0
1
0
113
2
0
0
0
0
0
0
0
NoteFlag
0
0
0
1
0
4
25
0
0
0
0
0
0
0
NoteOpen
0
0
0
0
0
1
0
4
0
0
0
0
0
1
RestI
0
0
0
0
0
0
0
0
33
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
100
0
0
0
0
Sharp
0
0
0
1
0
1
0
0
0
0
83
0
0
1
Staccatissimo
0
0
0
0
0
0
0
0
0
0
0
5
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
24
0
Unknown
0
0
4
3
3
6
4
1
0
2
1
0
0
77
Accent
Table B.1: Table of confusion of the nearest neighbor classier.
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
42
0
0
1
0
0
2
0
0
0
0
0
0
2
BassClef
0
0
0
0
0
3
0
0
1
0
0
0
0
2
Beam
1
0
94
0
1
1
1
0
0
0
1
0
0
10
Flat
2
0
0
45
0
0
3
0
0
0
0
0
0
7
Natural
1
0
0
0
66
3
0
0
0
0
0
0
0
9
Note
0
0
0
1
0
98
3
0
1
2
1
0
1
9
NoteFlag
0
0
0
2
0
4
21
0
1
0
0
0
0
2
NoteOpen
0
0
0
0
0
5
0
0
0
0
0
0
0
1
RestI
0
0
0
0
0
1
0
0
26
2
2
0
0
2
RestII
0
0
0
0
0
1
0
0
0
99
0
0
0
0
Sharp
0
0
0
2
0
0
4
0
3
0
68
0
0
9
Staccatissimo
0
0
0
0
0
0
0
0
0
0
2
3
0
0
TrebleClef
0
0
1
0
0
3
0
0
0
0
1
0
14
5
Unknown
0
0
4
2
0
17
9
0
4
3
3
0
0
59
Accent
Table B.2: Table of confusion of the neural network.
97
Appendix B. Table of confusion
98
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
42
0
0
0
0
2
0
0
0
0
0
0
0
3
BassClef
0
6
0
0
0
0
0
0
0
0
0
0
0
0
Beam
0
0
105
0
0
0
0
0
0
0
0
0
0
4
Flat
0
0
0
57
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
0
79
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
0
111
0
0
0
0
0
0
0
5
NoteFlag
0
0
0
0
0
0
28
0
0
0
0
0
0
2
NoteOpen
0
0
0
0
0
1
0
3
0
0
0
0
0
2
RestI
0
0
0
0
0
0
0
0
32
0
0
0
0
1
RestII
0
0
0
0
0
0
0
0
0
100
0
0
0
0
Sharp
0
0
0
0
0
0
0
0
0
0
85
0
0
1
Staccatissimo
0
0
0
0
0
0
0
0
0
0
0
5
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
20
4
Unknown
0
0
2
2
1
1
1
0
0
0
0
0
0
94
Accent
Table B.3: Table of confusion of the support vector machines.
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
46
0
0
0
0
1
0
0
0
0
0
0
0
1
BassClef
0
5
0
0
0
2
0
0
0
0
0
0
0
2
Beam
3
0
102
0
0
3
0
0
0
0
0
0
0
3
Flat
0
0
0
48
11
0
0
0
0
0
0
0
0
0
Natural
0
0
0
1
78
1
0
0
0
0
0
0
0
0
Note
0
0
4
1
12
82
0
0
1
4
3
0
0
11
NoteFlag
3
0
0
0
2
2
20
0
0
1
1
0
0
3
NoteOpen
0
0
0
0
3
2
2
2
0
0
0
0
0
0
RestI
0
0
0
0
0
0
0
0
36
0
0
0
0
0
RestII
2
0
0
2
2
0
0
0
0
94
0
0
0
1
Sharp
0
0
0
0
10
1
0
0
1
1
75
0
0
0
Staccatissimo
0
0
0
0
0
0
0
0
0
0
0
6
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
26
1
Unknown
2
0
17
2
7
23
1
0
1
0
3
0
4
41
Accent
Table B.4: Table of confusion of the hidden Markov model.
B.2. Results Obtained Without Elastic Matching for the Printed Music Symbols
99
B.2 Results Obtained Without Elastic Matching for the
Printed Music Symbols
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
50
0
0
0
0
0
0
0
0
0
0
0
0
0
Beam
0
73
0
0
0
0
0
0
0
0
0
0
0
0
Flat
0
0
38
0
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
31
0
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
75
1
0
0
0
0
0
0
0
0
NoteFlag
0
0
0
0
1
28
0
1
0
0
0
0
0
0
NoteOpen
0
0
0
0
0
0
77
0
0
0
0
0
0
0
TieSlur
0
0
0
0
0
0
0
28
0
0
0
0
0
2
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
15
0
0
0
0
Sharp
0
0
0
0
0
0
0
0
0
0
80
0
0
0
Time
0
0
0
0
0
0
0
0
0
0
0
3
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
30
0
Unknown
0
0
0
0
1
0
3
7
0
2
0
0
0
63
AltoClef
Table B.5: Table of confusion of the nearest neighbor classier.
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
47
3
0
0
0
0
0
0
0
0
0
0
0
0
Beam
1
68
0
0
0
0
2
0
0
0
0
0
0
2
Flat
0
0
38
0
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
31
0
0
0
0
0
0
0
0
0
0
Note
0
3
0
0
67
1
1
0
0
0
0
0
0
4
NoteFlag
0
0
0
0
0
24
0
0
0
0
0
0
0
1
NoteOpen
0
0
0
0
4
1
68
1
0
0
0
0
0
1
TieSlur
0
0
0
0
4
1
0
19
0
0
0
0
0
6
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
1
0
0
1
0
9
0
2
0
2
Sharp
0
0
0
0
0
0
0
0
0
0
79
0
0
1
Time
0
0
0
0
0
0
0
1
0
0
0
0
0
2
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
28
2
Unknown
3
0
0
0
4
4
5
7
0
1
0
1
0
51
AltoClef
Table B.6: Table of confusion of the neural network.
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
50
0
0
0
0
0
0
0
0
0
0
0
0
0
Beam
0
73
0
0
0
0
0
0
0
0
0
0
0
0
Flat
0
0
37
0
0
0
0
0
0
0
0
0
0
1
Natural
0
0
0
31
0
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
76
0
0
0
0
0
0
0
0
0
NoteFlag
0
0
0
0
0
27
0
0
0
0
0
0
0
3
NoteOpen
0
0
0
0
0
0
77
0
0
0
0
0
0
0
TieSlur
0
0
0
0
0
0
0
26
0
0
0
0
0
4
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
15
0
0
0
0
Sharp
0
0
0
0
0
0
0
0
0
0
80
0
0
0
Time
0
0
0
0
0
0
0
0
0
0
0
3
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
30
0
Unknown
0
0
0
0
0
0
1
1
0
0
0
0
0
74
AltoClef
Table B.7: Table of confusion of the support vector machines.
Appendix B. Table of confusion
100
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
39
0
0
0
0
0
0
0
0
0
0
0
0
11
Beam
0
71
0
0
0
0
0
0
0
0
0
0
0
2
Flat
3
0
38
0
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
31
0
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
71
0
0
0
0
0
0
0
0
5
NoteFlag
0
0
0
0
1
19
2
0
0
1
2
0
0
5
NoteOpen
2
0
0
0
0
1
68
0
0
0
0
0
0
6
TieSlur
0
0
1
0
0
0
0
25
0
0
0
0
0
4
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
13
0
0
0
2
Sharp
0
0
0
0
0
0
0
0
0
0
80
0
0
0
Time
0
0
0
0
0
0
0
0
0
0
0
1
0
2
TrebleClef
0
0
0
0
0
4
0
0
0
0
0
0
11
15
Unknown
1
4
0
0
6
0
3
6
0
0
0
0
0
56
AltoClef
Table B.8: Table of confusion of the hidden Markov Model.
B.3. Results Obtained With Elastic Matching for the Handwritten Music Symbols
101
B.3 Results Obtained With Elastic Matching for the
Handwritten Music Symbols
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
Accent
47
0
0
0
0
0
0
0
0
0
0
0
0
0
BassClef
0
6
0
0
0
0
0
0
0
0
0
0
0
0
Beam
2
0
104
0
0
0
0
0
1
0
0
0
0
2
Flat
0
0
0
57
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
0
79
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
0
112
1
1
0
0
0
0
0
2
NoteFlag
0
0
0
0
0
4
26
0
0
0
0
0
0
0
NoteOpen
0
0
0
0
0
2
0
4
0
0
0
0
0
0
RestI
0
0
0
0
0
0
0
0
33
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
100
0
0
0
0
Sharp
0
0
0
0
1
0
0
0
0
0
85
0
0
0
Staccatissimo
0
0
0
0
0
0
0
0
0
0
0
5
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
24
0
Unknown
1
0
7
14
0
7
2
0
1
2
3
0
0
64
Table B.9: Table of confusion of the nearest neighbor classier.
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
43
0
0
1
1
0
0
0
1
0
0
0
1
0
BassClef
0
0
0
0
0
3
1
0
0
0
0
0
0
2
Beam
0
0
94
0
0
3
0
0
0
0
7
0
0
5
Flat
1
0
0
41
6
1
0
0
3
0
3
0
0
2
Natural
2
0
1
3
71
0
0
0
0
0
1
0
0
1
Note
0
0
2
0
1
86
13
0
1
3
1
0
2
7
NoteFlag
0
0
1
0
0
8
12
0
0
0
1
0
1
7
NoteOpen
0
0
0
0
0
3
1
0
0
2
0
0
0
0
RestI
6
0
0
0
3
0
0
0
15
0
7
0
0
2
RestII
0
0
1
0
0
0
0
0
0
93
4
0
0
2
Sharp
2
0
0
2
1
2
0
0
0
1
76
0
1
1
Staccatissimo
0
0
0
0
0
0
0
0
0
2
2
1
0
0
TrebleClef
0
0
0
0
0
2
0
0
1
1
0
0
14
6
Unknown
3
0
8
3
6
17
10
0
0
4
3
0
2
45
Accent
Table B.10: Table of confusion of the neural network.
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
47
0
0
0
0
0
0
0
0
0
0
0
0
0
BassClef
0
6
0
0
0
0
0
0
0
0
0
0
0
0
Beam
0
0
109
0
0
0
0
0
0
0
0
0
0
0
Flat
0
0
0
57
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
0
79
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
0
116
0
0
0
0
0
0
0
0
NoteFlag
0
0
0
0
0
0
30
0
0
0
0
0
0
0
NoteOpen
0
0
0
0
0
0
0
6
0
0
0
0
0
0
RestI
0
0
0
0
0
0
0
0
33
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
100
0
0
0
0
Sharp
0
0
0
0
0
0
0
0
0
0
86
0
0
0
Staccatissimo
0
0
0
0
0
0
0
0
0
0
0
5
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
24
0
Unknown
0
0
0
0
0
0
0
0
0
0
0
0
0
101
Accent
Table B.11: Table of confusion of the support vector machines.
Appendix B. Table of confusion
102
Accent
BassClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
RestI
RestII
Sharp
Staccatissimo
TrebleClef
Unknown
44
0
0
0
1
0
0
0
1
1
0
0
0
1
BassClef
0
6
1
0
2
0
0
0
0
0
0
0
0
0
Beam
0
0
107
0
0
3
0
0
0
0
1
0
0
0
Flat
0
0
2
42
11
0
0
1
3
0
0
0
0
0
Natural
0
0
0
2
74
0
0
0
3
0
0
0
0
1
Note
0
0
7
2
4
92
2
0
4
6
0
1
0
0
NoteFlag
1
0
1
0
3
4
9
1
1
4
6
0
0
2
NoteOpen
0
0
0
0
4
2
1
2
0
0
0
0
0
0
RestI
0
0
0
0
0
0
0
0
36
0
0
0
0
0
RestII
4
0
0
0
3
1
0
1
6
86
0
0
0
0
Sharp
0
0
2
0
8
3
0
0
0
3
72
0
0
0
Staccatissimo
0
0
0
0
0
0
0
0
0
0
0
6
0
0
TrebleClef
0
1
0
0
1
0
0
0
0
0
4
0
21
0
Unknown
3
1
16
8
8
18
4
0
2
0
7
2
1
31
Accent
Table B.12: Table of confusion of the hidden Markov model.
B.4. Results Obtained With Elastic Matching for the Printed Music Symbols
103
B.4 Results Obtained With Elastic Matching for the Printed
Music Symbols
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
49
0
0
0
0
0
0
0
0
0
0
0
0
1
Beam
0
73
0
0
0
0
0
0
0
0
0
0
0
0
Flat
0
0
38
0
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
31
0
0
0
0
0
0
0
0
0
0
Note
0
0
0
0
76
0
0
0
0
0
0
0
0
0
NoteFlag
0
0
0
0
3
27
0
0
0
0
0
0
0
0
NoteOpen
0
0
0
0
1
0
76
0
0
0
0
0
0
0
TieSlur
0
0
0
0
0
0
2
27
0
0
0
0
0
1
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
15
0
0
0
0
Sharp
0
0
0
0
1
0
0
0
0
0
80
0
0
0
Time
0
0
0
0
0
0
0
0
0
0
0
3
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
30
0
Unknown
1
1
2
0
3
0
3
4
0
0
1
0
1
60
AltoClef
Table B.13: Table of confusion of the nearest neighbor classier.
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
43
0
0
0
0
0
0
0
0
0
0
0
0
7
Beam
0
71
0
0
0
0
1
0
0
0
1
0
0
0
Flat
0
0
36
0
0
0
0
0
0
0
0
0
0
2
Natural
0
1
0
27
0
0
0
2
0
0
0
0
0
1
Note
0
0
0
0
64
0
7
2
0
0
1
0
2
0
NoteFlag
0
0
0
0
3
12
4
0
0
0
2
0
3
6
NoteOpen
0
0
1
0
1
0
71
0
0
0
0
0
2
2
TieSlur
2
1
0
0
0
0
5
4
0
0
1
0
3
14
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
2
2
0
3
0
0
3
0
1
4
Sharp
1
0
0
1
0
0
0
0
0
0
78
0
0
0
Time
0
0
0
0
0
0
0
2
0
0
1
0
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
2
0
28
0
Unknown
3
1
2
0
7
3
8
3
0
0
5
0
1
43
AltoClef
Table B.14: Table of confusion of the neural network.
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
45
0
0
0
0
0
0
0
0
0
0
0
0
5
Beam
0
66
0
0
0
0
0
0
0
0
0
0
0
7
Flat
0
0
37
0
0
0
0
0
0
0
0
0
0
1
Natural
0
0
0
29
0
0
0
0
0
0
0
0
0
2
Note
0
0
0
0
69
0
0
0
0
0
0
0
0
7
NoteFlag
0
0
0
0
0
29
0
0
0
0
0
0
0
1
NoteOpen
0
0
0
0
0
0
72
0
0
0
0
0
0
5
TieSlur
0
0
0
0
0
0
0
26
0
0
0
0
0
4
RestI
0
0
0
0
0
0
0
0
15
0
0
0
0
1
RestII
0
0
0
0
0
0
0
0
0
14
0
0
0
1
Sharp
0
0
0
0
0
0
0
0
0
0
78
0
0
2
Time
0
0
0
0
0
0
0
0
0
0
0
3
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
30
0
Unknown
0
0
0
0
0
0
0
0
0
0
0
0
0
76
AltoClef
Table B.15: Table of confusion of the support vector machines.
Appendix B. Table of confusion
104
AltoClef
Beam
Flat
Natural
Note
NoteFlag
NoteOpen
TieSlur
RestI
RestII
Sharp
Time
TrebleClef
Unknown
39
0
0
0
0
1
0
0
0
0
0
0
0
10
Beam
0
73
0
0
0
0
0
0
0
0
0
0
0
0
Flat
0
0
37
1
0
0
0
0
0
0
0
0
0
0
Natural
0
0
0
31
0
0
0
0
0
0
0
0
0
0
Note
0
0
1
0
70
0
1
3
0
0
0
0
0
0
NoteFlag
0
0
0
0
11
15
0
1
0
0
2
0
0
1
NoteOpen
0
0
0
5
0
8
62
0
0
0
0
0
0
2
TieSlur
0
0
1
0
6
0
0
21
0
0
0
0
0
2
RestI
0
0
0
0
0
0
0
0
16
0
0
0
0
0
RestII
0
0
0
0
0
0
0
0
0
13
2
0
0
0
Sharp
0
0
0
0
0
0
0
0
0
0
80
0
0
0
Time
0
0
0
0
0
0
2
0
0
0
0
1
0
0
TrebleClef
0
0
0
0
0
0
0
0
0
0
0
0
30
0
Unknown
2
6
1
0
9
7
5
3
6
3
1
0
0
33
AltoClef
Table B.16: Table of confusion of the hidden Markov model.
Appendix C
Articles Submited to Conferences
105
A Shortest Path Approach for Staff Line Detection
Ana Rebelo
FCUP and INESC Porto
Portugal
Artur Capela
FEUP and INESC Porto
Portugal
Joaquim F. Pinto da Costa
FCUP
Portugal
[email protected]
[email protected]
[email protected]
Carlos Guedes
ESMAE
Portugal
Eurico Carrapatoso
FEUP and INESC Porto
Portugal
Jaime S. Cardoso
FEUP and INESC Porto
Portugal
[email protected]
[email protected]
[email protected]
Abstract
Many music works produced in the past still exist only as
original manuscripts or as photocopies. Preserving them
entails their digitalization and consequent accessibility in
a digital format easy-to-manage. The manual process to
carry out this task is very time consuming and error prone.
Optical music recognition (OMR) is a form of structured
document image analysis where music symbols are isolated
and identified so that the music can be conveniently processed. While OMR systems perform well on printed scores,
current methods for reading handwritten musical scores by
computers remain far from ideal. One of the fundamental stages of this process is the staff line detection. In this
paper a new method for the automatic detection of music stave lines based on a shortest path approach is presented. Lines with some curvature, discontinuities, and inclination are robustly detected. The proposed algorithm behaves favourably when compared experimentally with wellestablished algorithms.
1. Introduction
The impact of music in our lives can hardly be overestimated. Music is a pivotal part of our cultural heritage
and its preservation, in all of its forms, must be pursued.
Frequently, the preservation of many music works entails
the digitalization of these works and consequent accessibility in a format that encourages browsing, analysis and
retrieval. In fact, many music works produced during the
last centuries still exist only as original manuscripts or as
photocopies. The digitalization of these works is therefore
a highly desirable goal. Unfortunately, the ambitious goal
of providing generalized access to handwritten scores that
were never published has been severely hampered by the actual state-of-the-art of handwritten music recognition. The
manual process required to recognize handwritten musical
symbols in scores and to put them in relationship with the
spine structure is very time consuming.
Despite the fact that OMR systems dealing with machine printed scores exhibit good performance, handwritten music recognition introduces several additional difficulties. Outstanding problems include notation varying from
writer to writer, and possibly varying in the same score:
symbols and staff lines written with different sizes, shapes
or intensity. Despite the continued research on OMR, with
the availability of several commercial OMR systems, we are
still lacking a satisfactory performance in terms of precision
and reliability. Most of the existing work provides a real efficiency only when quite regular, printed music sheets are
processed. This condition is exacerbated with handwritten
music scores. This justifies the research around the definition of reliable OMR algorithms.
Staff line detection is one of the fundamental stages of
the OMR process, with subsequent processes relying heavily on its performance. The reasons for detecting and removing the staff lines lie on the need to isolate the musical
symbols for a more efficient and correct detection of each
symbol presented on the score.
The detection of staves is complicated by a variety of
reasons. The handwritten staff lines are rarely straight and
horizontal, and are not parallel to each other. For example, some staves may be tilted one way or another on the
same page or they may be curved. This is especially true
for handwritten scores. Since these scores tend to be rather
irregular and determined by a person’s own writing style,
the staff lines might be twisted, being curved and not really horizontal at all. It depends on how regular the author
writes the symbols in his scores. There might also be big-
ger or smaller gaps along a staff line. And if we consider
that most of these works are old, the quality of the paper
in which it is written might have degraded throughout the
years, making it a lot harder to correctly identify its contents.
In this paper a method for the automatic detection of staff
lines based on a shortest path approach is presented. The
proposed paradigm uses the image as a graph, where the
staff lines result as the shortest path between the two margins of the image.
This introduction is concluded with a brief review of the
work done in this area. In section 2 the proposed algorithm
is described. In section 3, the proposed algorithm is experimentally evaluated using real music scores. Finally, conclusions are drawn and future work is outlined in section 4.
1.1. Related Works
Different methods for staff line detection have been researched. The simplest approach consists on finding local
maxima in the horizontal projection of the black pixels of
the image [2]. These local maxima represent line positions.
This method assumes straight and horizontal lines. Several
horizontal projections can be made with different image rotation angles, keeping the image in which the local maxima
are bigger. This eliminates the assumption that the lines are
always horizontal. An alternative strategy for identifying
staff lines is to use vertical scan lines [3, 7, 8, 13]. More
recent works present a more or less sophisticated use of a
combination of projection techniques to improve on the basic approach [1].
Fujinaga [5] incorporates a set of image processing techniques in the algorithm, including run-length coding (RLC),
connected-component analysis, and projections. After applying the RLC to find the thickness of staff lines and the
space between the staff lines, any vertical black runs that are
more than twice the staff line height are removed from the
original. Then, the connected components are scanned in
order to eliminate any component whose width is less than
the staffspace height. After a global deskewing, taller components, such as slurs and dynamic wedges are removed.
Other techniques for finding stave lines include the application of mathematical morphology algorithms [11, 15, 6],
rule-based classification of thin horizontal line segments
[10], and line tracing [12, 14].
In spite of the variety of methods available, they all suffer
from some limitations. In particular, lines with some curvature or discontinuities are inadequately resolved. The dash
detector [9] is one of few works that try to handle discontinuities. The dash detector is an algorithm that searches
the image, pixel by pixel, finding black pixel regions that
it classifies as stains or dashes. Then, it tries to unite the
dashes to construct lines.
2. A Shortest Path Approach for Staff Line Detection
A staff line can be considered as a path from the left
side of the music score to the right side. As staff lines are
almost the only extensive black objects on the music score,
the path we are looking for is the shortest path between the
two margins if paths (almost) entirely through black pixels
are favoured.
In the work to be detailed, the image grid is considered
as a graph with pixels as nodes and edges connecting neighbouring pixels. Therefore, some graph concepts are in order.
2.1. Definitions and Notation
A graph G = (V, A) is composed of two sets V and A.
V is the set of nodes, and A the set of arcs (p, q), p, q ∈ V .
The graph is weighted if a weight w(p, q) is associated to
each arc, and it is called a digraph if the arcs are directed,
i.e., (p, q) 6= (q, p). A path from p1 to pn is a list of unique
nodes p1 , p2 , . . . , pn , (pi , pi+1 ) ∈ A. The path cost is the
sum of each arc weight in the path.
In graph theory, the shortest-path problem seeks the
shortest path connecting two nodes; efficient algorithms are
available to solve this problem, such as the well-known Dijkstra algorithm [4].
2.2. General Framework Description
As mentioned before, a staff line corresponds to a path
from (almost) the left margin of the image to (almost) the
right side of the image, (almost) always through black pixels.
Starting by modelling the edge image as a graph, match
a node to each pixel. Connect two nodes with an arc on
the graph iff the corresponding pixels are neighbours (8connected neighbourhoods) on the image. The weight of
each arc is a function of pixels values and pixels relative
positions (see Figure 1):
w2
w3
w1
w8
w4
w5
w7
w6
Figure 1. Arc weight between two pixels.
(
f (p, qi ) if qi ∈ 4-connected neighbourhood of p
wi =
h(p, qi ) if qi 6∈ 4-connected neighbourhood of p
√
In this work we set h(., .) = 2f (., .).
The objective is then to design the weights wi such that
the weights are low for the pixels of interest (black) and
high otherwise (this will lead to small weighted distances
from left to right for paths through staff lines and large for
the rest). Therefore this setting will favour paths through
black pixels, as required. We set
(
c1 if p or q are black pixels
f (p, q) =
c2 otherwise
with c2 > c1 . In this work c1 and c2 were experimentally
determined as 2 and 6, respectively. Note that c1 must be set
greater than zero in order to also favour the smallest path,
when more than one exists through black pixels only. Finally, the solution to the shortest path problem will yield
the intended staff line.
(a) A pair of staff lines.
s1
e1
s8
e8
(b) The shortest path between
some pairs of points.
Figure 3. A second exemplificative example.
(a) Skewed staff lines with music (b) The shortest path between
symbols.
some pairs of points.
Figure 4. A third exemplificative example.
2.3. Illustrative Examples
In order to get a better intuition of the general result, it is
instructive to first explore some basic examples. Consider
a music score with a single staff line as represented in Figure 2(a). In Figure 2(b) the shortest paths between starting
s1
e1
s5
e5
clear that slight skewed scores do not pose any problem to
the proposed approach.
Nonetheless, some issues are visible and need to be conveniently addressed. Due to the skew of the staff lines, some
of the shortest paths jump between consecutive staff lines.
The text on the top of the score also constitutes a false low
weight path between the two margins, inducing some paths
to go through it. Finally, even when a staff line is correctly
followed by the path, the initial and the final parts of the
path should be ignored.
2.4. Proposed Algorithm
(a) A single staff line.
(b) The shortest path between
some pairs of points.
Figure 2. A first exemplificative example.
points si on the left margin and ending points ei on the right
margin, at the same row, are traced. All paths get attracted
by the staff line. A similar condition is verified when we
have more than one staff line (see Figure 3). Now paths
get attracted to the nearest staff line. Half of the paths inbetween two consecutive staff lines goes along the top staff
line; the other half follows the bottom staff line.
The last example, in Figure 4, shows that music symbols
placed on top of staff lines do not interfere with the detection of the staff lines. Moreover, the example also makes
To detect the staff lines, the proposed overall algorithm
starts by estimating the staffspace height. This length will
be used as a reference length for the subsequent operations.
Robust estimators of the staffspace height are already in
common use. The technique starts by computing the vertical run-lengths representation of the image. If a bit-mapped
page of music is converted to vertical run-lengths coding,
the most common black-runs represents the staff line height
and the most common white-runs represents the staffspace
height [5].
After the estimation of the staffspace height, the proposed approach applies the main step of the framework:
for each row of the image, the shortest path between
the leftmost pixel and the rightmost pixel is found, using the Dijkstra algorithm [4]. Instead of considering the whole image when computing the shortest path,
only a strip centred on the row of interest is used—
see Figure 5. This allows to constrain the complexity of the algorithm. On the experiments we have set
STRIP HEIGHT=STAFFSPACE HEIGHT.
2 x STRIP HEIGHT
Figure 5. Vertical strip around the row of interest.
Now, because the search for the shortest path is constrained to stay in a vertical strip (or because the nearest
staff line is far enough from the current row), the shortest
path may not follow a staff line. Therefore, the main step is
followed by a sequence of (arguably) sensible rules, aimed
at discarding false staff lines. First, all paths without a percentage of black pixels above a threshold BLACK PERC
are discarded (a threshold of BLACK PERC = 0.75 was
used on the experiments).
Next, each retained path is trimmed at the beginning and
at the end. As visible in the previous examples (refer to
Figure 4), before meeting with a staff line, a path travels
through a sequence of white pixels. Likewise, after the end
of the staff line, the path goes again through a sequence of
white pixels until it meets the right margin of the image. In
order to ignore all these white pixels, the initial pixels of
the path are discarded until a run of at least BLACK RUN
black pixels are found in the path. In the same way, all
pixels of the path after the last occurrence of a run of at
least BLACK RUN black pixels are discarded. A threshold
of BLACK RUN = 2 × STAFFSPACE HEIGHT was used
on the experiments.
Finally, the proposed overall algorithm ends with the validation of the preserved paths. Because the staff lines are
expected to be straight lines, the linear correlation coefficient of the x and y components of the pixels of the path
is used to reject paths that do not meet the linearity criterion, with a correlation coefficient bellow a threshold CORRCOEF. The threshold used here is conservative, since for
manuscript lines the (perfect) linearity is not guaranteed.
This precaution is needed because staves on a page are often
distorted in different ways.
2.4.1
Summation
The shortest path approach has some advantages over standard algorithms presented in the literature for the detection
of staff lines:
• While almost all current approaches provide only isolated pieces of a staff line, the approach proposed
here outputs a complete line, with starting and ending
points.
• The staff lines are rarely straight and horizontal, and
are not parallel to each other. For example, some staves
may be tilted one way or another on the same page or
they may be curved. While current approaches apply
a chain of heuristics to correct these undesired imperfections, the shortest path algorithm is naturally robust
to these challenging conditions.
• The proposed approach is robust to broken staff lines
(due to low-quality digitalization or low-quality originals) or staff lines as thin as one pixel. Missing pieces
are automatically ’completed’ by the algorithm.
For reference, the overall algorithm is summarized in
Listing 1 with pseudo-code.
StaffLine_Detection(IMAGE, CORRCOEF)
{
STAFFSPACE_HEIGHT = computeStaffSpaceHeight();
BLACK_PERC = 0.75;
BLACK_RUN = STAFFSPACE_HEIGHT;
STRIP_HEIGHT = STAFFSPACE_HEIGHT;
for (int row = 0; row < ImageHeight; row++)
{
Point2D start(0, row);
Point2D end(ImageWidth-1, row);
Path path = findShortestPath (start, end);
if(blackPercentage(path) < BLACK_PERC)
continue;
path = trimPath(path);
if(corrcoef(path) < CORRCOEF)
continue;
addPathToSetOfStaffLines(path);
}
}
Listing 1: Shortest Path StaffLine Detection.
3. Results
We now present additional, complete, examples of the
proposed framework for automatic staff lines detection. The
code for all the examples in this paper has been written in
C++1 . For comparison purposes, the method by Fujinaga
[5] was also implemented. Three types of scores were used
1 The
source code is available upon request to the authors.
in the experiments: machine printed scores—see Figure 6;
handwritten music scores sitting on regular staff lines—see
Figure 7 and Figure 8; and irregular handwritten scores—
see Figure 9. Before applying the staff line detection algorithms, each image was binarized.
Analysing the results for Fujinaga’s method, we observe
that some information is lost, information that could be important for the recognition tasks which follow. The places
where musical symbols are present create gaps on the detected staff lines; in some places the gaps are larger than the
space occupied by those symbols—see Figure 7(c). This
is mainly due to the low quality of the lines. The shortest
path approach is able to overcome these conditions because
it follows a continuous path connecting both line ends. The
robustness of the proposed approach to possible defects in
staff lines (curvature, discontinuities) is apparent in the example of Figure 7(c).
Occasionally, the shortest path algorithm retains paths
that do not go completely through a staff line. This problem is prone to happen when there’s a large beamed note
density—see Figure 9(b)—making it go through the beams
for this set of notes (e.g. quavers and semiquavers) or when
the curvature of the lines is such that the path jumps between consecutive lines. This condition can be overcome
by automatically learning the best threshold on the linearity test of the path for each image being processed, or by
the incorporating additional rules to validate the path. Both
approaches are currently being investigated. Nevertheless,
considering handwritten scores, we can see that the proposed algorithm has good and promising results.
4. Conclusion
The first challenge faced by an OMR system is staff line
detection. This first task dictates the possibility of success
for the recognition of the music score. And when it comes
to handwritten music scores, existing solutions are far from
presenting satisfactory results.
In this paper, a new algorithm for the automatic detection of staff lines in music scores was proposed. The shortest path approach for staff line detection algorithm brings a
new and promising approach to the staff line detection task.
The technique is adaptable to a wide range of image conditions, thanks to its intrinsic robustness to (slightly) skewed
images, discontinuities, and curved staff lines.
There are several directions of research to pursue with
the framework introduced in this paper. We are currently investigating the computational feasibility of, instead of finding the shortest path between a starting point on the left margin and an ending point on the right margin at the same row,
finding the shortest path between a starting point on the left
margin and the whole right margin. That will allow coping
with severely skewed scores. Another line of investigation
is the automatic learning of the parameters of the algorithm.
Although most of the parameters values are scaled by the
staff height space, the threshold for the test of linearity is
still manually set. This improvement could make the algorithm work with a wide range of types of scores, without requiring any parameter tuning by the user. Finally, more than
just having spatial continuity of the path— enforced by the
shortest path algorithm—, we are also pursuing the continuity of the direction of the path. As staff lines do not suffer
abrupt changes of direction, even when manually drawn, it
seems sensible to impose this additional constraint.
Acknowledgments
This work was partially funded by Fundação para a
Ciência e Tecnologia (FCT) - Portugal through project
PTDC/EIA/71225/2006.
References
[1] D. Bainbridge. Extensible Optical Music Recognition. PhD
thesis, Department of Computer Science, University of Canterbury, Christchurch, NZ, 1997.
[2] D. Blostein and H. S. Baird. A critical survey of music image
analysis. In Baird, Bunke, and Y. (Eds.), editors, Structured
Document Image Analysis, pages 405–434. Springer-Verlag,
Heidelberg, 1992.
[3] N. P. Carter. Automatic Recognition of Printed Music in the
Context of Electronic Publishing. PhD thesis, Departments
of Physics and Music, University of Surrey, 1989.
[4] E. W. Dijkstra. A note on two problems in connexion with
graphs. Numerische Mathematik, 1:269–271, 1959.
[5] I. Fujinaga. Staff detection and removal. In S. George, editor, Visual Perception of Music Notation: On-Line and OffLine Recognition, pages 1–39. Idea Group Inc., 2004.
[6] I. Gawedzki. Optical music scores recognition. Technical
report, 2002.
[7] S. Glass. Optical music recognition. B.Sc thesis, 1989.
[8] H. Kato and S. Inokuchi. A recognition system for printed
piano music using musical knowledge and constraints. In
Proceedings of the International Association for Pattern
Recognition Workshop on Syntactic and Structural Pattern
Recognition, pages 231–248, 1990.
[9] I. Leplumey, J. Camillerapp, and G. Lorette. A robust detector for music staves. In Proceedings of the International
Conference on Document Analysis and Recognition, pages
902–905, 1993.
[10] J. V. Mahoney. Automatic analysis of music score images.
B.Sc thesis, 1982.
[11] B. R. Modayur, V. Ramesh, R. M. Haralick, and L. G.
Shapiro. Muser: A prototype musical score recognition system using mathematical morphology. Machine Vision and
Applications, 6:140–150, 1993.
[12] D. Prerau. Computer pattern recognition of standard engraved music notation. PhD thesis, Department of Computer
Science and Engineering, MIT, 1970.
(a) Original score.
(b) Staff lines detected by our algorithm.
(c) Staff lines detected by Fujinaga’s method [5].
Figure 6. Results for a machine-printed score.
(a) Original score.
(b) Staff lines detected by our algorithm.
(c) Staff lines detected by Fujinaga’s method [5].
Figure 7. Results for a handwritten score.
[13] T. Reed. Optical music recognition. Master’s thesis, Department of Computer Science, University of Calgary, Canada,
1995.
[14] J. W. Roach and J. E. Tatem. Using domain knowledge in
low-level visual processing to interpret handwritten music:
an experiment. Pattern Recognition, 21(1):33–44, 1988.
[15] M. Roth. OMR-optical music recognition. Master’s thesis,
Swiss Federal Institute of Tecnology, 1992.
(a) Original skewed score.
(b) Staff lines detected by our algorithm.
(c) Staff lines detected by Fujinaga’s method [5].
Figure 8. Results for a skewed handwritten score.
(a) Original score.
(b) Staff lines detected by our algorithm.
(c) Staff lines detected by Fujinaga’s method [5].
Figure 9. Results for a handwritten score.
STAFF LINE DETECTION AND REMOVAL WITH STABLE PATHS
Artur Capela, Ana Rebelo
INESC Porto, Campus da FEUP, Rua Dr. Roberto Frias 378, 4200-465 Porto, Portugal
[email protected], [email protected]
Jaime S. Cardoso
INESC Porto, Faculdade de Engenharia, Universidade do Porto, Portugal
[email protected]
Carlos Guedes
INESC Porto, Escola Superior de Música e Artes do Espectáculo, Portugal
[email protected]
Keywords:
Music, optical character recognition, document image processing, image analysis.
Abstract:
Many music works produced in the past are currently available only as original manuscripts or as photocopies.
Preserving them entails their digitalization and consequent accessibility in a machine-readable format, which
encourages browsing, retrieval, search and analysis while providing a generalized access to the digital material.
Carrying this task manually is very time consuming and error prone. While optical music recognition (OMR)
systems usually perform well on printed scores, the processing of handwritten music by computers remains
below the expectations. One of the fundamental stages to carry out this task is the detection and subsequent
removal of staff lines. In this paper we integrate a general-purpose, knowledge-free method for the automatic
detection of staff lines based on stable paths, into a recently developed staff line removal toolkit. Lines
affected by curvature, discontinuities, and inclination are robustly detected. We have also developed a staff
removal algorithm adapting an existing line removal approach to use the stable path algorithm at the detection
stage. Experimental results show that the proposed technique outperforms well-established algorithms. The
developed algorithm will now be integrated in a web based system providing seamless access to browsing,
retrieval, search and analysis of submitted scores.
1
INTRODUCTION
The Universal Declaration on Cultural Diversity
adopted by the General Conference of UNESCO on
2001 asserts that cultural diversity is as necessary for
humankind as biodiversity is for nature, and that policies to promote and protect cultural diversity thus are
an integral part of sustainable development. Being
music a pivotal part of our cultural heritage, its preservation, in all of its forms, must be pursued. Frequently, the preservation of many music works entails their digitalization and consequent accessibility
in a format that encourages browsing, analysis and retrieval.
There is a vast amount of invaluable paper-based
heritage, including printed and handwritten music
scores, which are deteriorating over time due to natural decaying of paper and chemical reaction (i.e., between written ink and paper). Various efforts have
been focused on this issue in order to preserve the
record of our heritage. Digitisation has been com-
monly used as a possible tool for preservation. Although a digital copy may not conserve the original
document, it can preserve the most important part:
its data. It has also the advantages of easy duplications, distribution, and digital processing. Nevertheless, the output of the digitalization process is not
amenable for further analysis or semantic search operations. Thus, an Optical Music Recognition (OMR)
process is needed. However, the manual process required to recognize handwritten musical symbols in
scores and to put them in relationship with the spine
structure of the score is very time consuming. This
justifies the research around reliable automatic OMR
algorithms as current solutions are still below the expectations.
As a concrete example, Portugal has a notorious
lack in music publishing from virtually all eras of
its musical history. However, whereas most of the
known original music manuscripts before the twentieth century are kept at the National Library Archive
in Lisbon, there is virtually no national repository for
the Portuguese music from the twentieth century. Although there are recent efforts in order to catalogue
and preserve in digital form the Portuguese music
from the late twentieth century—notably the Music
Information Center (MIC, 2008) and the section on
musical heritage from the Institute of the Arts website (IOA, 2008)—most of the music pre-dating computer notation software was never published and still
exists as manuscripts or photocopies spread out all
over the country in inconspicuous places. The risk
of irreversibly losing this rich cultural heritage is thus
a reality.
1.1
OMR System
The project “Optical recognition system for handwritten music scores” initiated in 2007 by INESC Porto
and ESMAE is the point of departure for creating
a web-based system of music manuscripts of Portuguese composers from the twentieth century. This
database will provide generalized access to a wide
corpus of unpublished handwritten music encoded in
MusicXML, which can be accessed remotely via the
Internet. The database will not only centralize as
much information as possible but will also serve to
preserve this corpus in a way that is easily accessible for browsing, analysis, and ultimately, for performing this repertoire, therefore helping to keep the
Portuguese music alive (Capela et al., 2008). Although the aim of this project is Portuguese music,
it is equally valid for all printed and handwritten music scores that need to be preserved from all around
the world.
The ambitious goal of providing generalized access to handwritten scores that have never been published has been severely hampered by the actual stateof-the-art of handwritten music recognition. There
are currently various commercial OMR software solutions (Capella software1 , SharpEye Music Reader2 ,
OMeR3 ) and a few open source solutions (AOMR24 ,
OpenOMR5 , Audiveris6 ), but they are all offline standalone applications. The existing online archives of
music scores (Lester Levi Collection, 2008; Classical
Sheet Music Collection, 2008; Mutopia Collection,
2008) usually provide them in inadequate formats—
1 http://www.capella-software.com/capscan.
htm
2 http://www.visiv.co.uk
3 http://www.myriad-online.com/en/products/
omer.htm
4 http://www.bzzt.net/ arnouten/wiki/index.
˜
php/Gamera#AOMR2:_omr_toolkit
5 http://sourceforge.net/projects/openomr
6 http://audiveris.dev.java.net
usually only as the scanned score image—for retrieval
or automatic analysis. These online archives are mere
standard websites, without facilities for optical recognition, editing and searching through the scores musical content. The creation of an OMR system, integrating optical recognition, storage, search, browsing and downloading capabilities, while keeping the
scores in their original format along with their digital
counterpart, would therefore be extremely beneficial.
An integrated score editor would be provided in order
to view and edit the submitted music scores.
In our previous work on this project we have
presented a complete OMR System solution—
OMRSYS (Capela et al., 2008)—comprising a
database driven web application with one or more
OMR applications integrated in the proposed system.
The proposed architecture successfully attends the
stated objectives. At the end of our project we plan
on developing a fully functional system according to
the specified architecture and integrating a complete
OMR package.
1.2
Detection and Removal of Staff
Lines
Staff line detection and removal are the first fundamental stages on the OMR process, with subsequent
processes relying heavily on their performance. The
reasons for detecting and removing the staff lines lie
on the need to isolate the musical symbols for a more
efficient and correct detection of each symbol present
on the score. Although their primary application is as
a preprocessing step in the recognition of music notation, the line detection problem also occurs in different contexts (e.g., the recognition of bank transfer
forms).
The detection of staves is complicated due to a
variety of reasons. The handwritten staff lines are
rarely straight and horizontal, and are not parallel to
each other. For example, some staves may be tilted
one way or another on the same page or they may be
curved. These scores tend to be rather irregular and
determined by a person’s own writing style. Moreover, if we consider that most of these works are old,
the quality of the paper in which it is written might
have degraded throughout the years, making it a lot
harder to correctly identify its contents.
In (Cardoso et al., 2008a; Cardoso et al., 2008b)
we presented a new and robust staff line detection algorithm based on a stable paths approach. The proposed paradigm uses the image as a graph, where the
staff lines result as connected paths between the two
lateral margins of the image. A staff line can be considered as a connected path from the left side to the
right side of the music score. As staff lines are almost
the only extensive black objects on the music score,
the path to look for is the shortest path between the
two margins if paths (almost) entirely through black
pixels are favoured.
In this paper we present our recent work and results focusing on the implementation of the Stable
Paths algorithm as a C++ plugin for the MusicStaves
Toolkit (Dalitz et al., 2008; MusicStaves Toolkit,
2008) (based on the Gamera Framework (MacMillan
et al., 2002; Gamera Framework, 2008)), as well as
on the removal stage by using our detection algorithm
in the first stage. We have also adapted a removal algorithm based on the LineTrack Height approach proposed in (Dalitz et al., 2008), which we will present in
section 3. In section 2 the Gamera and MusicStaves
Toolkit are presented, together with our C++ plugin.
In section 4, both our proposed detection and removal
algorithms are evaluated experimentally using a wellknown dataset of music scores. Finally, conclusions
are drawn and future work is outlined in section 5.
2
STABLE PATHS INTEGRATION
In this section we present the platform in which our
detection algorithm was integrated for testing and validation. The platform is comprised by its core—the
Gamera Framework (MacMillan et al., 2002; Gamera
Framework, 2008)—and by a toolkit for the evaluation of detection and removal algorithms—the MusicStaves Toolkit (Dalitz et al., 2008; MusicStaves
Toolkit, 2008). This toolkit is a set of Gamera plugins aiming to support the development and test of
staff line detection and removal algorithms for music
scores, by extending the Gamera functionality.
After we describe each part constituting this platform, we present the implementation of our staff line
detection algorithm as a C++ plugin on the MusicStaves Toolkit. Finally, we present the integration of
our detection algorithm on some removal algorithms
in the MusicStaves toolkit. We have integrated our
algorithm in those removal algorithms where the detection is processed separately from the removal operation, replacing the Dalitz algorithm (Dalitz et al.,
2008) as the detection stage.
2.1
“Generalized Algorithms and Methods for Enhancement and Restoration of Archives”. It combines a programming library with graphical tools for an interactive training and development of recognition systems.
This framework tries to be a tool to create custom applications through the domain experts knowledge instead of responding to various requirements with a
monolithic application. It aims at providing an efficient test and refinement development cycle. The
programming language in its basis is the high-level
language Python, although it has many extensions
written in C++ to carry out low-level image processing. Nevertheless, due to its nature, Python turns the
code writing process more agile and facilitates the use
of scripting, which makes Gamera an interactive and
batchable framework. Besides the large number of extensions deployed with this framework, it is also possible to customize and extend with plugins or toolkits,
written in Python and C++.
The Gamera framework is modular and organized
in a series of horizontal layers, which can be seen at
Figure 1.
Gamera Framework
Gamera (MacMillan et al., 2002; Gamera Framework,
2008) is a portable and open source framework to create structured documents analysis applications by domain experts. The name Gamera is an acronym to
Figure 1: Gamera Architecture.
Gamera follows a modular plugin architecture. It
is made of modules (plugins), both written in Python
and C++, integrated in a high-level scripting environment. Each module executes a task on the recognition
process. The framework maintains a toolbox design
approach, i.e., a user has access to a large set of tools
for the optical recognition stages.
2.2
MusicStaves Toolkit
MusicStaves (Dalitz et al., 2008; MusicStaves
Toolkit, 2008) is a Gamera toolkit specific for the development and test of staff line detection and removal
algorithms. This Phyton toolkit also integrates facilities to create a test set of music scores and to evaluate
results with established metrics. As with Gamera, this
toolkit is portable and its source code is freely available. In order to use MusicStaves, it must be imported
to Gamera, either in the GUI or in a programmatic
manner.
MusicStaves is structured as seen in Figure 2. The
toolkit is composed by a set of main classes where the
staff line detection and removal algorithms are implemented, and by a set of plugins in Python and C++.
Some plugins are used by the implemented algorithms
while others are tools to aid the algorithms testing. It
is an extendable toolkit in which one can integrate a
new staff line detection and/or removal algorithm. Its
plugin system follows the Gamera framework rationale and as such the plugins may be written in Python
and C++. In order to write new staff line detection
or removal algorithms the toolkit provides two interfaces: StaffFinder (detection) and MusicStaves (removal).
Figure 3: Some examples of applied deformations: a) Curvature; b) Rotation; c) White Speckles; d) Staffline Interruptions; e) Staffline y-variation; f) Typeset Emulation. See
(Dalitz et al., 2008) for details.
The plugin encompasses the main StaffFinder class in
the toolkit root, the respective Python plugin in the
Python plugins folder and the algorithm implementation in C++ in the C++ plugins folder. In Figure 4
we present a diagram of the algorithm as integrated in
MusicStaves.
The algorithm processing starts with the call to the
method find staves, which receives a binarized image
as input. That image is then passed to the function
in C++ by the plugin Python code. In the C++ implementation, after the respective staff line detection
function is called, the image is converted to the format
used internally by our algorithm. After the whole detection task is complete it returns the staff lines skeleton in the MusicStaves format. After receiving the
skeleton list on its Python class, it fills the structure
self.linelist with the obtained values.
Figure 2: MusicStaves Architecture.
A test set of 32 synthetic music scores is also provided by the authors of this toolkit. Moreover, the
toolkit allows applying a set of deformations (e.g., rotation, curvature, typeset emulation, white speckles)
commonly found in the real world to these perfect
scores—see Figure 3. The purpose is to be able to
measure the performance of the removal algorithms
contained in MusicStaves by using three defined error metrics (Dalitz et al., 2008): Pixel Level, Segmentation Region Level and Staffline Interruptions.
However, the same set may be used to evaluate staff
line detection algorithms alone by defining adequate
error metrics (Cardoso et al., 2008a; Cardoso et al.,
2008b). We have restricted the range of the parameters controlling the intensity of the deformations to
values considered realistic.
2.3
Integration
We have integrated our recently proposed algorithm
for staff line detection (Cardoso et al., 2008a; Cardoso et al., 2008b) in MusicStaves as a C++ plugin.
Figure 4: Overall view.
Besides integrating our Stable Paths Approach algorithm into the MusicStaves toolkit as a C++ plugin, we have also integrated the algorithm with staff
removal algorithms in order to evaluate the improvements over the original results. However, this is not
a standard integration as the toolkit does not provide
the means for this kind of integration. This integration was coded in the staff removal algorithms by
adding the possibility to choose the staff line detection algorithm through a parameter value. A diagram
illustrating this integration can be seen in Figure 5.
Some removal algorithms present in MusicStaves detect the staff lines along with the removal process.
From those, we have used the one with best results
in general (Dalitz et al., 2008)—Skeleton—for comparison purposes.
Figure 5: Stable Paths integration with staff removal algorithms.
3
STAFF REMOVAL
ALGORITHMS
In the process of recognizing music scores the staff removal algorithm is processed after the staff line detection stage takes place. In our current work the removal
algorithm is based on the LineTrack Height algorithm
presented on (Randriamahefa et al., 1993). The goal
on this method is to track the staff lines positions obtained by a detection algorithm and remove vertical
run sequences of black pixels that have a value lower
than a specified threshold, which was experimentally
set at 2*staffHeight.
As music scores may suffer from deformations,
the staff lines may have descontinuities, be curved or
inclined. These problems will influence the success
to achieve a correct detection of lines contained on
the score to recognize. However, due to the abovementioned problems, the positions of the staff lines
obtained by a staff line detection algorithm may pass
slightly above or under the real staff lines positions.
That way, if we are in presence of a white pixel
when the staff lines are tracked, we search vertically
for the closest black pixel. If that distance is lower
than a specified tolerance—experimentally chosen as
1+ceil(staffHeight/3.0)—we move the reference position of the staff line to the position of the black pixel
found.
On (Dalitz et al., 2008) a new method is
presented—Skeleton—which uses the skeleton information, but performs the staff removal on the original
Algorithm 1 Staff Removal Algorithm.
procedure
STAFF L INE R EMOVAL (IMAGE,STAV ES)
threshold = 2∗staffHeight;
tolerance = 1 + ceil(staffHeight/3.0);
IMAGE REMOV E = copy(IMAGE);
for nvalid = 0 to STAVES size do
Point2D staff = validStaves[nvalid];
for i = 0 to staff size do
col =staff[i].x;
re f Row =staff[i].y;
row = re f Row;
pel = valuePixel(IMAGE, IMAGE REMOV E)
decrement/increase the reference row until
one pixel different from white pixel (dist1/dist2) is
found;
if dist1 ≤ max(1, min(dist2,tolerance)) then
re f Row− = dist1;
else
if dist2 ≤ max(1, min(dist1,tolerance))
then
re f Row+ = dist2;
else
continue;
end if
end if
Count the number of decrements/increase on
the reference row until the black pixel changes to white
pixel (run);
if run ≥ threshold then
continue;
end if
remove the vertical black sequences on the
IMAGE;
end for
end for
end procedure
image instead of the skeleton. The method relies on
the fact that symbols on the stafflines lead to junction
points or corner points in the skeleton.
4 RESULTS
Although the evaluation of new staff detection algorithms may be done by visually inspecting the output on a set of scores—as adopted on (Rebelo et al.,
2007)—, our current comparison is supported on
quantitative measures. The test set adopted for the
qualitative evaluation of the proposed method is the
one presented in (Dalitz et al., 2008) and already described. It consists of ideal synthetic scores to which
a set of known deformations have been applied—see
(Dalitz et al., 2008) for more details. In total we have
generated 2688 deformed images originated from 32
perfect scores. In order to conveniently measure the
performance of staff line removal algorithms we have
adopted two error metrics from (Dalitz et al., 2008):
Pixel Level and Segmentation Region Level.
Staff line detection algorithms can be used as a
first step in many staff removal algorithms. To understand the potential of our algorithm to leverage
the performance of existing staff removal algorithms,
we conducted a series of experiments, comparing the
original version of a staff removal algorithm with the
modified version of it, making use of the Stable Paths
algorithm at the staff line detection step. The quantitative comparison of the different algorithms is totally
in line with the comparison presented in (Dalitz et al.,
2008).
With respect to the considered distortions, regarding the detection stage, the Stable Paths based approach outperforms the Dalitz algorithm. In Figure 6
we present our results for the removal algorithms:
LineTrack Height (with Dalitz and Stable Paths),
Skeleton and LineTrack Height Modified (with Stable
Paths). We chose the methods that present the best results in (Dalitz et al., 2008), implementing our own
removal algorithm with LineTrack Height as a basis.
In general we verify that the replacement of the Dalitz
method by our Stable Paths Approach algorithm as
the staff detection step has improved the final staff
line removal results7 . Additionally, the LineTracking
Height Modified algorithm presents an overall better
performance than the original LineTrack Height algorithm from (Dalitz et al., 2008). Our staff line detection and removal approaches also outperform the
Skeleton method, although it continues to present a
competitive performance. We have not integrated the
Stable Paths algorithm with the Skeleton algorithm
as the second performs the lines detection along with
their removal instead of using two separate stages. All
the parameters, both on the Stable Paths detection algorithm and LineTrack Height Modified, were preliminary tuned over an independent set of images.
This performance gain is even more noteworthy as
the MusicStaves algorithms are receiving as input the
correct number of lines per staff. Had not this been the
case, the differential between both would have been
much larger. In summary, these experiments show the
strength of the algorithms presented here. Despite being based on simple and intuitive underlying principles, the performance of the proposed algorithms is
quite competitive.
The analysed results have covered the detection
and removal accuracy but a brief word on speed is also
in order. Comparing different algorithms for speed
is notoriously difficult; we are simultaneously judging mathematical properties and specific implementa7 For
the deformations not shown, the stable path is not
significantly better than Dalitz.
tions. In the experimental study, the current implementation of the Stable Paths algorithm run almost
as fast as the Dalitz algorithm (20% slower). In respect to the removal algorithms our LineTrack Height
version with Stable Paths is significantly faster than
the Skeleton algorithm (two times faster). Comparing to the original LineTrack Height algorithm with
the Dalitz detection algorithm the runtime difference
is not significant. The algorithms were evaluated as
available at the Staff Removal Toolkit (Dalitz et al.,
2008).
5
CONCLUSIONS
This paper presented the integration of our robust Stable Paths Approach algorithm (Cardoso et al., 2008a;
Cardoso et al., 2008b) in the MusicStaves Toolkit
(Dalitz et al., 2008) as a C++ plugin, an improved
version of an existing staff line removal algorithm—
LineTrack Height (Dalitz et al., 2008), and the results we have obtained in our staff line removal tests.
We have integrated our detection algorithm with existing staff line removal algorithms. Our approach
successfully deals with the difficulties posed by the
symbols superimposed on the staff lines as well as a
wide range of image conditions (e.g., discontinuities,
curved lines), frequently found on handwritten scores.
The encouraging results lead us now to consider
investigating the detection of music symbols benefiting from the improved staff line detection and removal, creating a complete OMR application in order to integrate it on our proposed complete OMR
solution—OMRSYS (Capela et al., 2008). Thus, our
proposed system offers a complete solution for the
preservation of our musical heritage. It includes an
optical recognition engine integrated with an archiving system and a user-friendly interface for searching, browsing and edition. The digitized scores are
stored in MusicXML, a recent and expanding music
interchange format designed for notation, analysis, retrieval, and performance applications.
Our proposed algorithms and complete OMR system promote the creation of a full corpus of music documents, promoting its preservation and study.
This project will culminate in the creation of a repository of handwritten scores, accessible online. The
database will be available for enjoyment, educational
and musicological purposes, thus preserving this corpus of music in an unprecedented way.
(a) Curvature.
(b) Curvature.
(c) White Speckles.
(d) Rotation.
(e) Staffline Y-Variation.
(f) Staffline Y-Variation.
(g) Staffline Interruptions.
(h) Typeset Emulation Part 1.
(i) Typeset Emulation Part 2.
Figure 6: Effect of different deformations on the overall staff removal error rates. See (Dalitz et al., 2008) for parameter
details.
ACKNOWLEDGEMENTS
This work was partially funded by Fundação para
a Ciência e a Tecnologia (FCT) - Portugal through
project PTDC/EIA/71225/2006.
REFERENCES
Capela, A., Cardoso, J. S., Rebelo, A., and Guedes, C.
(2008). Integrated recognition system for music
scores. In International Computer Music Conference
(ICMC) (Accepted).
Cardoso, J. S., Capela, A., Rebelo, A., and Guedes, C.
(2008a). A connected path approach for staff detection on a music score. In International Conference on
Image Processing (ICIP 2008) (Accepted).
Cardoso, J. S., Capela, A., Rebelo, A., and Guedes, C.
(2008b). Staff detection with stable paths. Transactions on Pattern Analysis and Machine Intelligence
(TPAMI) (Submitted).
Classical Sheet Music Collection (2008). Classical Sheet
Music and MIDI Files, http://www.music-scores.
com.
Dalitz, C., Droettboom, M., Pranzas, B., and Fujinaga, I.
(2008). A comparative study of staff removal algorithms. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 30(5):753–766.
Gamera Framework (2008). Gamera Framework, http://
ldp.library.jhu.edu/projects/gamera/.
IOA (2008). Institute of the Arts, http://patrimonio.
dgartes.pt/?lang=pt.
Lester Levi Collection (2008). The Lester S. Levi Collection of Sheet Music, http://levysheetmusic.mse.
jhu.edu/.
MacMillan, K., Droettboom, M., and Fujinaga, I. (2002).
Gamera: Optical music recognition in a new shell. In
International Computer Music Conference (ICMC),
pages 482–485.
MIC (2008). Music Information Center, http://www.mic.
pt.
MusicStaves Toolkit (2008).
MusicStaves Toolkit,
http://lionel.kr.hsnr.de/˜dalitz/data/
projekte/stafflines/.
Mutopia Collection (2008). The Mutopia Project, http:
//sca.uwaterloo.ca/Mutopia.
Randriamahefa, R., Cocquerez, J. P., Fluhr, C., Ppin, F., and
Philipp, S. (1993). Printed music recognition. In Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR’93), pages
898–901.
Rebelo, A., Capela, A., da Costa, J. F. P., Guedes, C., Carrapatoso, E., and Cardoso, J. S. (2007). A shortest path
approach for staff line detection. In AXMEDIS ’07:
Proceedings of the Third International Conference on
Automated Production of Cross Media Content for
Multi-Channel Distribution, pages 79–85, Washington, DC, USA. IEEE Computer Society.
INTEGRATED RECOGNITION SYSTEM FOR MUSIC SCORES
Artur Capela, Jaime S. Cardoso
FEUP and INESC Porto
Portugal
Ana Rebelo
FCUP and INESC Porto
Portugal
Carlos Guedes
ESMAE and INESC Porto
Portugal
[email protected]
[email protected]
[email protected]
[email protected]
ABSTRACT
Many music works produced in the last century still exist
only as original manuscripts or as photocopies. Preserving
them entails their digitalization and consequent accessibility in a digital format easy-to-manage which encourages
browsing, retrieval, search and analysis while providing
a generalized access to the digital material. The manual
process to carry out this task is very time consuming and
error prone. Automatic optical music recognition (OMR)
has emerged as a partial solution to this problem. However, the full potential of this process only reveals itself
when integrated in a system that provides seamless access
to browsing, retrieval, search and analysis. We address
this demand by proposing a modular, flexible and scalable framework that fully integrates the abovementioned
functionalities. A web based system to carry out the automatic recognition process, allowing the creation and management of a music corpus, while providing generalized
access to it, is a unique and innovative approach to the
problem. A prototype has been implemented and is being
used as a test platform for OMR algorithms.
1. INTRODUCTION
The impact of music in our lives can hardly be overestimated. Music is a pivotal part of our cultural heritage and
its preservation, in all of its forms, must be pursued.
Portugal has a notorious lack in music publishing from
virtually all eras of its musical history. However, whereas
most of the known original music manuscripts before the
twentieth century are kept at the National Library Archive
in Lisbon, there is virtually no national repository for the
Portuguese music from the twentieth century. Although
there are recent efforts in order to catalogue and preserve
in digital form the Portuguese music from the late twentieth century—notably the Music Information Center [10]
and the section on musical heritage from the Institute of
the Arts website [8]—most of the music pre-dating computer notation software was never published and still exists as manuscripts or photocopies spread out all over the
country in inconspicuous places.
For example, all the music composed by Jorge Peixinho (1940-1995), an internationally-renowned composer
who epitomized the Portuguese avant-garde in the 1960s
and 1970s, was never published in Portugal (few of his
scores were published abroad), and almost his entire oeuvre consists of manuscript paper [6]. Almost fourteen
years past his death, his music is already catalogued, although not published. Unfortunately, this case is not unique,
and this situation is common with other great Portuguese
composers from the twentieth century. The risk of irreversibly losing this rich cultural heritage is thus a reality.
The project “Optical recognition system for handwritten music scores” initiated in 2007 by INESC Porto and
ESMAE is the point of departure for creating a web-based
system of music manuscripts of Portuguese composers from
the twentieth century. This database will provide generalized access of a wide corpus of handwritten unpublished music encoded in MusicXML that can be accessed
remotely via the Internet. The database will not only centralize as much information as possible but will also serve
to preserve this corpus in a way that is easily accessible
for browsing, analysis, and ultimately, for performing this
repertoire, therefore help keeping the Portuguese music
alive.
The ambitious goal of providing generalized access to
handwritten scores that have never been published has been
severely hampered by the current state-of-the-art of handwritten music recognition. There are currently various
commercial OMR software solutions [4, 16, 14] and a few
open source solutions [1, 15, 2], but they are all offline
standalone applications. The existing online archives of
music scores [9, 5, 13] usually provide them in inadequate
formats—usually only as the scanned score image—for
retrieval or automatic analysis. These online archives are
mere standard websites, without facilities for optical recognition, editing and searching through the scores’ musical
content. The creation of an OMR system, integrating optical recognition, storage, search, browsing and downloading capabilities, while keeping the scores in their original
format along with their digital counterpart, would therefore be extremely beneficial.
Using this background, we present the specification and
implementation of a system integrating all the required
features. It uniquely combines OMR technology in a system, easing the conversion of scores to the Music Extended Markup Language format [12]—MusicXML—as
it is being widely adopted and meets our needs. In Section 2 the proposed system is described. We continue in
Section 3 by presenting a usage scenario. Finally, conclusions are drawn in Section 4.
2. SYSTEM ARCHITECTURE AND
IMPLEMENTATION
The system that we propose on this paper comprises the
creation of a database of music scores and a web application mainly featuring:
• Addition of music scores to the system, performing
their recognition and conversion to MusicXML in
an integrated manner, allowing the user to confirm
and correct the conversion results at the last stage of
this process.
• Complete maintenance of a fully navigable music
scores archive, including both the original version
and the digital version obtained from the optical
recognition.
• Browsing and searching the database, as well as the
MusicXML contents. Visualization, downloading
and edition of the selected music scores.
• Complete system management.
The architecture for the proposed system is based on a
client-server model. The system is intended to be accessible through the Internet. There are three different entities
present in this system, as it can be seen in Figure 1.
Web Browser
Client 1
Web Server
LAN
Repository 1
Repository 2
Repository N
OMR engine 1
OMR engine 2
OMR engine N
Internet
Web Browser
Client 2
Search engine
Web Browser
Client N
Figure 1. Generic system architecture
The Repository module stores the original scanned score,
the digital counterpart in MusicXML and all the descriptive metadata inserted by the user, as detailed latter. All
the remaining system contents, such as the user information, are also stored in this entity.
The Web Server is the user access point to the system as
well as to all of its processing modules run on the server,
encompassing the search engine and the optical recognition engine for the music scores. There is support for the
inclusion of several OMR Engines, aiming to provide the
ability to meet different needs (e.g. different music notation systems). The most adequate OMR Engine can be
chosen manually by the user or automatically by the system, by detecting the scores notation and type (i.e. handwritten or printed). Our Search Engine allows not only
generic searches throughout all of the system contents,
but it also provides the capability of searching throughout
the music scores MusicXML information in an innovative
manner. The Web Server interacts with the Repository and
with the Web Browser, which establishes the interface between the user and the system.
The user interface on a Web Browser allows the complete management of the music scores and associated metadata, as well as carrying out the system administration.
Generaly speaking, the user interface provides the user the
ability to execute all the necessary tasks to fully use the
proposed system. On the administration side it is possible
to manage the users, as well as the whole system contents
and validate new ones.
There are four user types which can access the system:
General User, Registered User, Privileged User and the
Administrator. The General User represents a visitor and
may only consult and download contents. The remaining types are registered users and according to their level
they may add/edit/remove certain contents with or without
restrictions. The Privileged User is similar to the Administrator and has full access to all functionalities, though
it cannot manage the users from its own level. The contents added by Registered Users have to be validated by
Privileged Users or the Administrator, although the later
two are able to add any contents without the need to be
validated, they are considered to be trustful users.
For content management, functionalities include the addition of music scores to the Repository, their automatic
recognition, visualization and edition, searching, and browsing. It is also possible to insert and browse information
related to the music scores—name, authors, instruments,
musical genres—providing the user with a Repository containing all the necessary information to keep a complete
music corpus. This metadata can then be used by the user
on search queries or for a more flexible browsing experience. Finally, a music work is organized into sections,
where each section is a music score which represents a
part from the whole music work. This flexible structure
allows accommodating either simple or complex works on
the system. Each score can be visualized and its representation in MusicXML edited, side-by-side with the original
score directly on the Web Browser. Both the visualization
and the edition of the music scores are done in a graphical easy-to-use editor available through the Web Browser.
Figure 2 illustrates the information recorded in the system
associated with a music work.
Figure 2. Content stored in the Repository associated
with a music work
2.1. Prototype Implementation – OMRSYS
We developed a prototype—OMRSYS—taking the system architecture shown in Figure 1 as a basis. Currently,
the prototype supports only a single Repository collocated
with the Web Server and a single OMR Engine was used
to prove this concept.
The Repository is implemented as a PostgreSQL 1 database, an open source Database Management System (DBMS).
The main reasons for choosing PostgreSQL were its native XML support, needed for creating a search engine
that allows searching on the scores’ MusicXML counterparts stored on the database. This is an important aspect
as it is a major feature for the proposed system. It is
also a very mature and well documented popular DBMS.
Another important feature is the great ease of integration
on the framework we have selected to develop the prototype, which we will describe next. MySQL 2 is the default DBMS used on the framework chosen for developing
the Web Application, but it lacks on XML support. Other
DBMSs were also considered but PostgreSQL was the one
that best fitted our needs.
The development of the Web Application was supported
on Ruby on Rails 3 . Rails is an open source, full-stack
framework for developing dynamic database-backed web
applications according to the Model-View-Control (MVC)
pattern. It is an almost complete platform which requires
only a DBMS and a server. These features present us with
a suitable choice for supporting the development of our
prototype. Ruby is the programming language at its core,
a flexible and powerful Object Oriented language, with a
vast array of powerful characteristics. Another strong advantage is the database manipulation, as it is greatly simplified, which is ideal to develop a system of this kind.
Other frameworks [17, 7] were considered but fell short
compared to the features of Ruby on Rails.
The Web Server selected for the prototype was the Apache HTTP Server 4 with Mongrel 5 to execute the Web
Application. This solution is suggested by Ruby on Rails
and was found suitable, being both efficient and open source.
The Search Engine developed for this prototype allows
the usual queries on the database contents, although the
groundbreaking search through the MusicXML contents
is still not possible at this stage.
The OMR Engine on the Web Server is the module
responsible for the automatic recognition of the submitted music scores. We initially adapted it from the open
source OpenOMR project [15]. The OMR Engine performs an automatic conversion of a submitted music score
to a digital easy-to-use representation, the MusicXML format. This digital format allows representing sheet music
by its musically-relevant parts, sections, phrases and motives, thus easing the access to the relevant portions of the
score while browsing that score in a computer monitor.
1
http://www.postgresql.org
http://www.mysql.com
3 http://www.rubyonrails.org
4 http://httpd.apache.org
5 http://mongrel.rubyforge.org
2
Figure 3. The OMRSYS user interface
Simultaneously, MusicXML enables the retrieval of relevant musical information for analysis, thus facilitating
certain types of computational analysis to be performed
on a corpus of scores. Nevertheless, it also provides an
adequate way to restore old sheet music, preventing them
from oblivion. The other OMR applications we have analysed were left aside because they were either less complete than OpenOMR at the time or they were commercial
solutions. Our main goal at this stage was to prove the
concept.
The User Interface has a great impact in the user experience and is divided into several sections. There is the
authentication, title and quick search sections on the upper
portion of the screen, followed by the middle and largest
portion which includes the main menu and the contents
area. The main menu works as a two-level expansible
menu and allows access to all the system’s functionalities
by grouping them in a logical manner. Similar functionalities follow similar designs to keep the interface consistent,
intuitive and easy to learn. Some of the functionalities are
the common Create/Read/Update/Delete (CRUD) and the
listing of the database contents based on a chosen criteria,
all following a familiar behaviour. The main differences
and most unique aspects rely on the submission and the
update of music scores, which is discussed in Section 3.
As an interface example, Figure 3 shows the Graphical
User Interface (GUI) being used for browsing the music
scores available in the Repository. Both the Privileged
Users and the Administrator have additional options in the
main menu for validation purposes.
The MusicXML Editor for this prototype is still implemented as a plain text editor and viewer, but already showing side-by-side the original score with its MusicXML
counterpart. However, the development of a fully integrated graphical MusicXML editor is being pursued. Such
editor would allow a higher level and intuitive edition and
could be developed for example in Flash, in the likes of
MusicRain [11], an online interactive sheet music viewer.
The main purpose for the editor in this initial prototype
was to give the end-user the possibility to at least view
and edit the music scores in MusicXML. The existing editors and visualizers [11] usually have a high maturity level
but they are offline applications.
Figure 4. Score submission scenario
The Digital Rights Management control is done in two
ways: the acceptance of a license agreement at the registration process and the validation of submitted music
scores by a Privileged User before they become available
on the system.
tions. An additional benefit of the automatic conversion of
the music score to MusicXML is the possibility of encoding the manuscript score in MX format, an XML-base,
multi-layered format for music representation [3]. MX
synchronizes several layers belonging to the description
of a piece of music, e.g. an audio recording and score of
the same piece.
A system of this kind promotes the creation of a full
corpus of music documents, promoting its preservation
and study. This project will culminate in the creation of
a repository of the handwritten scores, accessible online.
The database will be available for enjoyment, educational
and musicological purposes, thus preserving this corpus
of music in an unprecedented way.
Acknowledgments
This work was partially funded by Fundação para a Ciência
e a Tecnologia (FCT) - Portugal through project
PTDC/EIA/71225/2006.
5. REFERENCES
3. USAGE SCENARIO: SCORE SUBMISSION
When the insertion of a music score in the system is requested, the user inserts the metadata associated with the
music score, of which some is optional—name, year, description, etc—and associates it with one or more authors
and a musical genre. Each section of a music work has
to be associated with the instruments present on the music
score. In the last submission step, after the insertion of the
requested metadata, the user submits the various pages for
each section, as illustrated in Figure 4.
After validating the inserted data, the user then triggers
the automatic recognition process by calling a suitable
OMR engine. Afterward, an overview of the submitted
score is shown by listing its contents allowing the user to
view the result of the automatic process on the built-in editor side-by-side with the original scanned image, offering
the user the possibility to manually correct the automatic
results. After confirming the results and making the necessary corrections, the user then finalizes the music score
submission by accepting it. If the score was submitted
by the Administrator or a Privileged User it then becomes
immediately available on the system; if it was submitted
by a standard Registered User it is kept on queue for validation and will only become available once a user with
administration privileges validates it.
4. CONCLUSION
The proposed system offers a complete solution for the
preservation of our musical heritage. It includes an optical recognition engine integrated with an archiving system
and a user-friendly interface for searching, browsing and
edition. The digitized scores are stored in MusicXML, a
recent and expanding music interchange format designed
for notation, analysis, retrieval, and performance applica-
[1] AOMR2. [Online]. Available:
http://www.bzzt.net/
∼arnouten/wiki/index.php/Gamera#AOMR2: omr toolkit
[2] Audiveris. [Online]. Available: http://audiveris.dev.java.
net
[3] A. Baraté, G. Haus, and L. Ludovico, “An XML-based format for advanced music fruition,” in Proceedings of the
Third Sound and Music Computing Conference, 2006, pp.
141–147.
[4] Capella-scan. [Online]. Available:
http://www.
capella-software.com/capscan.htm
[5] Classical Sheet Music and MIDI Files. [Online]. Available:
http://www.music-scores.com
[6] Delgado, Cristina, Machado, Jorge, and J. Machado,
Catálogos da obra de Jorge Peixinho, ser. José Machado
(ed.) Jorge Peixinho: In Memoriam. Lisbon: Caminho,
2002.
[7] Google Web Toolkit. [Online]. Available: http://code.
google.com/webtoolkit
[8] Institute of the Arts. [Online]. Available: http://patrimonio.
dgartes.pt/?lang=pt
[9] The Lester S. Levi Collection of Sheet Music. [Online].
Available: http://levysheetmusic.mse.jhu.edu/
[10] Music Information Center. [Online]. Available: http:
//www.mic.pt
[11] MusicRain. [Online]. Available: http://musicrain.us
[12] The MusicXML Format. [Online]. Available: http:
//www.musicxml.org
[13] The Mutopia Project. [Online]. Available: http://sca.
uwaterloo.ca/Mutopia
[14] OMeR. [Online]. Available: http://www.myriad-online.
com/en/products/omer.htm
[15] OpenOMR. [Online]. Available: http://sourceforge.net/
projects/openomr
[16] SharpEye Music Reader. [Online]. Available: http:
//www.visiv.co.uk
[17] Tacos. [Online]. Available: http://tacos.sourceforge.net
A CONNECTED PATH APPROACH FOR STAFF DETECTION ON A MUSIC SCORE
Jaime S. Cardoso∗ , Artur Capela† , Ana Rebelo‡ , Carlos Guedes§
ABSTRACT
The preservation of many music works produced in the past entails their digitalization and consequent accessibility in an easy-tomanage digital format. Carrying this task manually is very time consuming and error prone. While optical music recognition systems
usually perform well on printed scores, the processing of handwritten musical scores by computers remain far from ideal. One of the
fundamental stages to carry out this task is the staff line detection. In
this paper a new method for the automatic detection of music staff
lines based on a connected path approach is presented. Lines affected by curvature, discontinuities, and inclination are robustly detected. Experimental results show that the proposed technique consistently outperforms well-established algorithms.
Index Terms— Music, optical character recognition, document
image processing, image analysis
1. INTRODUCTION
The Universal Declaration on Cultural Diversity adopted by the General Conference of UNESCO on 2001 asserts that cultural diversity
is as necessary for humankind as biodiversity is for nature, and that
policies to promote and protect cultural diversity thus are an integral part of sustainable development. Being music a pivotal part
of our cultural heritage, its preservation, in all of its forms, must
be pursued. Frequently, the preservation of many music works entails their digitalization and consequent accessibility in a format that
encourages browsing, analysis and retrieval. In fact, many music
works produced during the last centuries still exist only as original
manuscripts or as photocopies. The digitalization of these works
is therefore a highly desirable goal. Unfortunately, the ambitious
goal of providing generalized access to handwritten scores that were
never published has been severely hampered by the actual state-ofthe-art of handwritten music recognition. The manual process required to recognize handwritten musical symbols in scores and to
put them in relationship with the spine structure of the score is very
time consuming. This justifies the research around the definition of
reliable optical music recognition (OMR) algorithms.
Staff line detection is one of the fundamental stages of the OMR
process, with subsequent processes relying heavily on its performance. The reasons for detecting and removing the staff lines lie
on the need to isolate the musical symbols for a more efficient and
correct detection of each symbol presented on the score.
The detection of staves is complicated due to a variety of reasons. The handwritten staff lines are rarely straight and horizontal,
∗ INESC Porto, Faculdade de Engenharia, Universidade do Porto, Portugal, email: [email protected]
† INESC Porto, Faculdade de Engenharia, Universidade do Porto, Portugal, email: [email protected]
‡ INESC Porto, Faculdade de Ciências, Universidade do Porto, Portugal,
email: [email protected]
§ INESC Porto, Escola Superior de Música e Artes do Espectáculo, Portugal, email: [email protected]
and are not parallel to each other. For example, some staves may be
tilted one way or another on the same page or they may be curved.
These scores tend to be rather irregular and determined by a person’s own writing style. Moreover, if we consider that most of these
works are old, the quality of the paper in which it is written might
have degraded throughout the years, making it a lot harder to correctly identify its contents.
In this paper a method for the automatic detection of staff lines
based on a connected path approach is presented. The proposed
paradigm uses the image as a graph, where the staff lines result as the
connected path between the two margins of the image. Our previous
work [1] was a first effort to explore this concept. We now present
a complete, principled solution, with a strong experimental validation of the proposed approach. This introduction is concluded with a
brief review of the work done in this area. In section 2 the proposed
algorithm is described. In section 3, the proposed algorithm is evaluated experimentally using a well-known dataset of music scores.
Finally, conclusions are drawn and future work is outlined in section 4.
1.1. Related Works
The problem of staff line detection is often considered simultaneously with the goal of their removal, although exceptions exist [2, 3].
The simplest approach consists on finding local maxima in the horizontal projection of the black pixels of the image [4]. These local
maxima represent line positions, assuming straight and horizontal
lines. Several horizontal projections can be made with different image rotation angles, keeping the image in which the local maxima
are bigger. This eliminates the assumption that the lines are always
horizontal. An alternative strategy for identifying staff lines is to use
vertical scan lines [5]. More recent works present a more or less sophisticated use of a combination of projection techniques to improve
on the basic approach [6].
Fujinaga [7] incorporates a set of image processing techniques
in the algorithm, including run-length coding (RLC), connectedcomponent analysis, and projections. After applying the RLC to
find the thickness of staff lines and the space between the staff lines,
any vertical black runs that are more than twice the staff line height
are removed from the original. Then, the connected components
are scanned in order to eliminate any component whose width is
less than the staff space height. After a global deskewing, taller
components, such as slurs and dynamic wedges are removed.
Other techniques for finding staff lines include the application of
mathematical morphology algorithms [8], rule-based classification
of thin horizontal line segments [9], and line tracing [10, 11]. The
methods proposed in [2, 3] operate on a set of ‘staff segments’, with
methods for linking two segments horizontally and vertically and
merging two segments with overlapping position into one.
In spite of the variety of methods available, they all suffer from
some limitations. In particular, lines with some curvature or discontinuities are inadequately resolved. The dash detector [12] is one of
few works that try to handle discontinuities. The dash detector is an
algorithm that searches the image, pixel by pixel, finding black pixel
regions that it classifies as stains or dashes. Then, it tries to unite the
dashes to construct lines.
2. A CONNECTED PATH APPROACH FOR STAFF LINE
DETECTION
A staff line can be considered as a connected path from the left side
of the music score to the right side. As staff lines are almost the only
extensive black objects on the music score, the path we are looking
for is the shortest path between the two margins if paths (almost)
entirely through black pixels are favoured. More formally, let s and t
be two pixels of the image and Ps,t a path over the image connecting
them. We are interested in finding the path P that optimizes some
predefined distance d(s, t). This criterion should embed the need to
favour black pixels.
In the work to be detailed, the image grid is considered as a
graph with pixels as nodes and edges connecting neighbouring pixels. The weight of each arc, w(p, q), is a function of pixels values
and pixels relative positions. A path from vertex (pixel) v1 to vertex
(pixel) vn is a list of unique vertices v1 , v2 , . . . , vn , with vi−1 and
vi corresponding to neighbour pixels. The path cost is the sum of
each arc weight in the path n
i=2 w(vi−1 , vi ).
As mentioned before, a staff line corresponds to a path from (almost) the left margin of the image to (almost) the right side of the
image, (almost) always through black pixels. If the weight assigned
to an edge captures the intensity of the path of the adjacent pixels,
finding the best path between a point s on the left margin and a point
t on the right margin translates into computing the minimum accumulated weight along all possible connected curves connecting s and
t:
d(s, t) = min
w(p, q).
(1)
At the end of this process, the minimum value of the last column in
C will indicate the end of the minimal connected staff. Hence, in the
second step one backtrack from this minimum entry on C to find the
path of the optimal staff.
2.1. Algorithm outline
Assume one wants to find all staff lines present in a score. This
can be approached by successively finding and removing the shortest
path from the left to the right margin of the score. The removal
operation is required to ensure that a staff is not detected twice2 .
Consider the music score presented in Figure 1(a). In Figure 1(b) the first 11 shortest paths are traced. This example shows
that music symbols placed on top of staff lines do not interfere
with the detection of the staff lines. Moreover, the example also
makes clear that slight skewed scores do not pose any problem to
the proposed approach.
P
X
Ps,t
Staff lines are best modelled as paths between two regions Ω1 and
Ω2 , the left and right margins of the score. The shortest path between
two regions Ω1 and Ω2 is defined as a path Ps,t , with s ∈ Ω1 and
t ∈ Ω2 and cost equal to
d(Ω1 , Ω2 ) =
min
s∈Ω1 ,t∈Ω2
d(s, t).
(2)
One may assume that staff lines do not zigzag back and forth,
left and right. Therefore, one may restrict the search among connected paths containing one, and only one, pixel in each column of
the image1 . Formally, let I be a n × m image and define an admissible staff to be
s = {(j, y(j))}n
j=1 , s.t. ∀j |y(j) − y(j − 1)| ≤ 1,
where y is a mapping y : [1, · · · , n] → [1, · · · , m]. That is, a staff
line is an 8-connected path of pixels in the image from left to right,
containing one, and only one, pixel in each column of the image.
Given the weight function w(p, q), one can define the cost of
n
a staff as C(s) =
i=2 w(vi−1 , vi ). The optimal staff line that
minimizes this cost can be found using dynamic programming. The
first step is to traverse the image from the second column to the last
column and compute the cumulative minimum cost C for all possible
connected staff lines for each entry (i, j):
P
8
>
<
C(i, j) = min
>
:
1 This
C(i − 1, j − 1)
C(i − 1, j)
C(i − 1, j + 1)
+
+
+
w(pi−1,j−1 ; pi,j )
w(pi−1,j ; pi,j )
w(pi−1,j+1 ; pi,j )
assumption imposes a maximum detectable 45 rotation degree.
(a) Skewed staff lines with music (b) The first shortest paths between
symbols.
left and right margins.
Fig. 1. An exemplificative example.
Our first naive effort to apply the shortest path foundation to staff
line detection in [1] did it by computing the shortest path between
two pixels at the same height on the left and right margin. That
approach was not robust enough to tilted scores, leading the detected
paths to jump between consecutive staff lines.
Nonetheless, two main issues are still visible with the current
methodology and need to be conveniently addressed. A criterion is
needed to stop the iterative detection of the shortest paths (staff lines)
and the initial and the final parts of a path should be trimmed.
2.2. Proposed Algorithm
To detect the staff lines, the proposed overall algorithm starts by estimating the staff space height, staffspaceheight, and staff line
height, stafflineheight. These lengths are used as reference
lengths on subsequent operations. Robust estimators are already in
common use: the technique starts by computing the vertical runlengths representation of the image. If a bit-mapped page of music
is converted to vertical run-lengths coding, the most common blackruns represents the staff line height and the most common white-runs
represents the staff space height [7].
After estimating the reference lengths, the proposed approach
applies the main step of the framework, by successively finding
the shortest path between the left and right margin, adding the
path found to the list of staff lines and removing it from the image. The weight w(p, q) was experimentally set to w(p, q) =
2 + (Ip + Iq )/255,
√ with Ip , Iq ∈ {0, 255}, for pixels in a 4neighbourhood or 2 times that value for 8-neighbours. The
removal operation sets to white the pixels on a vertical strip of
height= 2×stafflineheight, centred on the detected staff
2 We implemented the removal operation by setting to white the pixels on
the detected staff; image resizing could be a valid alternative [13].
line. To stop the iterative staff line search, a sequence of (arguably)
sensible rules is used to validate the last found path; if it does not
pass the checking, the iterative search is broken. Two validation
rules were applied, both assessing features with respect to the first
detected staff line (assumed to be the most perfect one). If the last
path does not have a percentage of black pixels above a threshold, the
search is broken (a threshold of blackperc = 0.8 of the percentage
of black pixels in the first staff line was used in the experiments).
Likewise, if the shape of the last detected path differs too much
from the shape of the first detected path (measured as the average ydistance between both paths, after removing the means), the iteration
is broken. A threshold shapediff = 4×staffspaceheight
was experimentally selected.
After the main search step, detected staff lines are post-processed.
Although true staff lines never intersect, the above algorithm may
occasionally create intersecting lines. That may be due to a local
low quality of a line, leading the shortest path to jump between
consecutive lines; the next iteration will then follow the remaining
segments, intersecting with the previous detected line. To preclude
such final, undesired state, lines are post-processed to remove intersections. That is easily and efficiently accomplished by, for each
image column, sorting on y the pixels of the detected lines and assigning the i-pixel to the i-line. After this simple process, lines may
touch but they do not intersect. Each retained line is then trimmed at
the beginning and at the ending. As visible in the previous example
(refer to Figure 1(b)), before meeting with a staff line, a path travels
through a sequence of white pixels. Likewise, after the end of the
staff line, the path goes again through a sequence of white pixels
until it meets the right margin of the image. In order to ignore all of
these white pixels, the initial pixels of the path are discarded until
a run of at least blackrun black pixels are found in the path. In
the same way, all pixels of the path after the last occurrence of a run
of at least blackrun black pixels are discarded. A threshold of
blackrun = 2×staffspaceheight was used on the experiments. Finally, lines are smoothed with a standard average low-pass
filter. A window size of 2×staffspaceheight was selected on
the experiments.
3. RESULTS
This section provides experimental results obtained on a set of
scores. Although the assessment of a new staff detection algorithm
may be done by visually inspecting the output on a set of scores—as
adopted on [1]—, here we support the comparison on quantitative
measures. The test set adopted for the qualitative evaluation of the
proposed method is the one presented in [14]. The test set consists
of ideal scores to which known deformations can be applied. The
distortions range from rotation and curvature to typeset emulation
and staff line thickness variation—see [14, 15] for more details. In
total, 630 images were generated from the originally perfect scores.
To conveniently measure the performance of a staff line detection
algorithm, we considered two different error metrics: the number of
false positive staff lines and missed to detect staff lines.
To evaluate these metrics, we start by computing the average Euclidian distance between each reference staff line and each actually
detected staff line; then we solve the matching problem on the resulting bipartite graph by minimizing the assignment cost (= distance).
Only pairs with average error-distance bellow the staff line height
were considered correctly matched (the other pairs were assumed to
originate from a false positive staff line being matched to an undetected true staff line and were therefore unmatched). Now the two
metrics result as the number of unmatched detected staff lines (false
positive) and unmatched reference staff lines (missed to detect).
The proposed algorithm was compared with the three methods
considered in [14] for staff line detection3 . As Dalitz’s algorithm
performs significantly better than the two other algorithms evaluated
in [14], we have only included Dalitz results in subsequent figures.
It is important to state that the comparison reports only to staff line
detection algorithms, not to the removal phase. That explains the
need to introduce the aforementioned metrics, while not adopting
the metrics introduced in [14] for assessing staff line removal.
The effects of the different deformations over the respective parameter ranges are shown in Figure 2. With respect to the distortions
considered, our connected path based approach is the most robust
and clearly outperforms the Dalitz algorithm. In fact, the performance of our approach is almost independent of intensity of the deformation, for the range of values considered. This performance gain
is even more noteworthy as the Dalitz algorithm is receiving as input
the correct number of lines per staff, while the proposed approach
does not rely on that information. Had not this been the case, the
differential between both would have been much larger.
In summary, these experiments show the strengths of the presented algorithm. Despite being based on a simple and intuitive
underlying principle, the performance of the proposed algorithm is
quite competitive. Moreover, the results are prone to be improved
even further by elaborating the stopping criterion of the iterative
search or the post-processing rules, while leaving intact the main
principle of the method.
4. CONCLUSION
The first challenge faced by an OMR system is staff line detection.
This first task dictates the possibility of success for the recognition
of the music score. In the case of handwritten music scores, the
existing solutions are far from presenting satisfactory results.
In this paper, a new algorithm for the automatic detection of
staff lines in music scores was proposed. The connected path approach for staff line detection algorithm is adaptable to a wide range
of image conditions, thanks to its intrinsic robustness to skewed images, discontinuities, and curved staff lines. The handwritten staff
lines are rarely straight and horizontal, and are not parallel to each
other. Some staves may be tilted one way or another on the same
page or they may be curved. While current approaches apply a chain
of heuristics to correct these undesired imperfections, the connected
path algorithm is naturally robust to these challenging conditions.
The proposed approach is robust to broken staff lines (due to lowquality digitalization or low-quality originals) or staff lines as thin
as one pixel. Missing pieces are automatically ‘completed’ by the
algorithm.
Acknowledgments
This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) - Portugal through project PTDC/EIA/71225/2006.
5. REFERENCES
[1] Ana Rebelo, Artur Capela, Joaquim F. Pinto da Costa, Carlos
Guedes, Eurico Carrapatoso, and Jaime S. Cardoso, “A shortest
path approach for staff line detection,” in Third International
Conference on Automated Production of Cross Media Content
for Multi-channel Distribution (AXMEDIS 2007), 2007.
3 The
source code is available upon request to the authors.
false staves (connected path)
missed staves (connected path)
false staves (Dalitz)
missed staves (Dalitz)
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
error
error
0.8
1
0.9
0.5
0.5
0.6
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
−5
0
Angle (degrees)
5
(a) Rotation.
0
0.02
false staves (connected path)
missed staves (connected path)
false staves (Dalitz)
missed staves (Dalitz)
0.7
false staves (connected path)
missed staves (connected path)
false staves (Dalitz)
missed staves (Dalitz)
error
1
0.9
0.1
0.03
0.04
0.05
0.06
0.07
Amplitude/staffwidth
0.08
0.09
0.1
0
0.02
(b) Curvature.
0.1
1
false staves (connected path)
missed staves (connected path)
false staves (Dalitz)
missed staves (Dalitz)
0.8
false staves (connected path)
missed staves (connected path)
false staves (Dalitz)
missed staves (Dalitz)
0.9
0.8
0.7
0.7
0.6
0.6
error
error
0.06
0.08
Rate of whitened pixels
(c) White speckle.
1
0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0.04
0.1
2
2.5
3
3.5
4
4.5
Maximum deviation n
5
(d) Line y-variation.
5.5
0
6
(e) Shortest path (left) and Dalitz (right) results for rotation
(angle=-4◦ ).
2
4
6
8
10
Maximum vertical shift nshift (for ngap = 10)
12
(f) Typeset emulation.
Fig. 2. Effect of different deformations on the overall error rates. See [14] for parameter details.
[2] Hidetoshi Miyao and Masayuki Okamoto, “Stave extraction
for printed music scores using dp matching,” Journal of Advanced Computational Intelligence and Intelligent Informatics,
vol. 8, pp. 208–215, 2007.
[3] Mariusz Szwoch, “A robust detector for distorted music
staves,” in Computer Analysis of Images and Patterns, pp. 701–
708. Springer-Verlag, Heidelberg, 2005.
[4] Dorothea Blostein and Henry S. Baird, “A critical survey of
music image analysis,” in Structured Document Image Analysis, Baird, Bunke, and Yamamoto (Eds.), Eds., pp. 405–434.
Springer-Verlag, Heidelberg, 1992.
[5] N. P. Carter, Automatic Recognition of Printed Music in the
Context of Electronic Publishing, Ph.D. thesis, Departments of
Physics and Music, University of Surrey, 1989.
[6] D. Bainbridge, Extensible Optical Music Recognition, Ph.D.
thesis, Department of Computer Science, University of Canterbury, Christchurch, NZ, 1997.
[7] Ichiro Fujinaga, “Staff detection and removal,” in Visual Perception of Music Notation: On-Line and Off-Line Recognition,
Susan George, Ed., pp. 1–39. Idea Group Inc., 2004.
[8] Ignacy Gawedzki, “Optical music scores recognition,” Tech.
Rep., 2002.
[9] J. V. Mahoney, “Automatic analysis of music score images.
B.Sc thesis,” 1982.
[10] D. Prerau, Computer pattern recognition of standard engraved
music notation, Ph.D. thesis, Department of Computer Science
and Engineering, MIT, 1970.
[11] J. W. Roach and J. E. Tatem, “Using domain knowledge in
low-level visual processing to interpret handwritten music: an
experiment,” Pattern Recognition, vol. 21, no. 1, pp. 33–44,
1988.
[12] I. Leplumey, J. Camillerapp, and G. Lorette, “A robust detector for music staves,” in Proceedings of the International
Conference on Document Analysis and Recognition, 1993, pp.
902–905.
[13] Shai Avidan and Ariel Shamir, “Seam carving for contentaware image resizing,” in ACM Transactions on Graphics
(SIGGRAPH 2007), 2007, vol. 26.
[14] Christoph Dalitz, Michael Droettboom, Bastian Czerwinski,
and Ichiro Fujigana, “A comparative study of staff removal
algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004.
[15] Christoph Dalitz, Michael Droettboom, Bastian Czerwinski,
and Ichiro Fujigana, “Staff removal toolkit for gamera(20052007,” .