text_database_design

Transcrição

SPEECHDAT
PORTUGUESE
TELEPHONE SPEECH DATA COLLECTION
1. Contact Person and Team
This database was collected by Portugal Telecom in the scope of the SpeechDat European Project. The
task of designing and post-processing the database (together with this report) was subcontracted to
INESC. The design of the collection platform and the speech data collection itself was the
responsability of INESCTEL.
Contact persons at INESC:
Isabel Trancoso
[email protected]
Luís Oliveira
[email protected]
Institution:
Address:
Fax:
Ph.:
INESC
R. Alves Redol, 9
1000 Lisbon
Portugal
+351 1 3145843
+351 1 3100268
Contact persons at INESCTEL:
Joaquim Azevedo [email protected]
Nuno Beires
[email protected]
Institution:
Address:
Fax:
Ph.:
INESCTEL
R. Gonçalo Sampaio, 329, 2º Dto.
4150 Porto
+351 2 600 2726
+351 2 607 0615
Contact person at Portugal Telecom:
Rui Chaves
[email protected]
Institution:
Address:
Fax:
Ph.:
Portugal Telecom
R. Entrecampos, 28, 9º
1700 Lisbon,
Portugal
+351 1 500 3271
+351 1 500 3215
INESC´s annotation team included the following people:
Eloi Jorge
Luís Ramos
1
- Contact Person for this report
Marta Gouveia
For the design of the database, INESC had some linguistic support from Drs. Céu Viana and Isabel
Mascarenhas (Center of Linguistics of the University of Lisbon), and received a small set of sentences
from Dr. Amália Andrade, from the same institution. We gratefully acknowledge their support, together
with the one of the daily newspaper PÚBLICO, from where we have selected the majority of
phonetically rich sentences of our database.
2. Number and Structure of CDs
This database is contained in 3 CDs.
The set of CDs contains the recordings of 1001 calls, sampled at 8Khz, 8-bit A-law format, made via
digital line (ISDN).
The file formats and headers follow the SAM recommendations (header files separated from signal
files).
Compression of signal files has been made using gzip.
The file names are 8 characters long plus 3 characters extension, following the specifications of
SpeechDat Deliverable D.1.4.1.
DDNNNNTT.CCX
where:
DD
NNNN
TT
CC
X
Database identifier, e.g. A0 for the fixed telephone database
Session number (0000..9999)
Type of utterance, e.g. S1 for the first read phonetically rich sentence
Country code, e.g. PT for Portugal
File type marker, e.g. O for Orthography, Z for compressed speech signal
Each call is contained in one directory and comprises 40 compressed speech signal files and 40
orthography files. About 20 calls are missing 1 or 2 items and about another 200 calls have up to 5
items (most frequently only one) in which the speaker says nothing and includes therefore only
background noise.
The directory structure is 5 levels deep:
/DB_TYPE/CD_NO/BLOCK_NO/SESSION_NO/SIGNALS
where:
DB_TYPE
CD_NO
BLOCK_NO
SESSION_NO
SIGNALS
Type of database, e.g. FIXED0PT
CD identifier, e.g. CD00
Block number (first 2 digits of session number)
4-digit session number (not sequential), e.g. 0005
signal file directory.
Thus, there is a directory for each call, and no directory has more than 100 entries.
3. Structure of each call
2
Each call comprises 2 parts: a first part in which the caller should provide spontaneous answers to 9
questions and a second part in which he/she is asked to read a list of 33 items.
3.1 Spontaneous items
The spontaneous speech questions are the following:
Está pronto a começar?
Por favor, diga o seu nome:
Diga o seu número de telefone:
Qual a data do seu nascimento?
Qual a cidade (ou distrito) em que passou a maior
parte da sua infância?
É do sexo masculino?
Está a usar um telemóvel?
Está a usar um telefone sem fios?
Que horas são?
Are your ready to begin ?
Please say your name
Say your telephone number
What's your birthday ?
In which city (or district) have you spent the largest
part of your childhood?
Are you of the male sex ?
Are you using a mobile phone ?
Are you using a cordless phone ?
What time is it ?
Two of these items were not included in the CD-ROM (name and telephone number), for the sake of
confidentiality and were only used for later contacting the callers in order to give them the lottery
prizes.
The remainding 7 items comprise:
•
•
•
•
4 yes/no questions (corpus codes Z1 and Q1-Q3)
1 spontaneous date (D1)
1 spontaneous time (T1)
1 region name (P1)
Answering “YES” (SIM) is not the most frequent away of giving an affirmative answer. Hence, in
answer to the first question (Are you ready to start?), many speakers said the corresponding form of “I
am” (Estou). This question had the corpus code Z1, as only three yes-no questions were mandatory.
3.2 Read items
The spontaneous part ends with a request to read the identification number in the prompt sheet:
Diga o número de identificação da folha anexa (4-digit sequence).
After this the system prompts for each of the 32 items in the prompt sheet, which comprises (besides the
id number):
•
•
•
•
•
•
•
•
•
•
•
3 natural numbers
1 isolated digit
1 credit card number
1 telephone number
2 money amounts
2 dates
1 time
6 application words
3 spelled words
3 word spotting phrases
9 sentences
The following ordering of the read items was adopted :
3
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
sentence 4
5-6-digit number
word spotting phrase 1
spelled word 1
sentence 6
time
application word 1
large money amount
sentence 2
credit card number
3-digit number
sentence 8
date 1
aplication word 2
spelled word 2
sentence 1
telephone number
application word 3
sentence 7
date 2
application word 4
small money amount
sentence 3
application word 5
isolated digit
spelled word 3
sentence 5
4-digit number
sentence 9
application word 6
Note: the numbering of the sentences corresponds to their relative length in terms of number of words.
In order to avoid fatigue, long sentences have been inserted next to short items.
3.3 Isolated and Connected Digits
Isolated digits
The 10 isolated digits have been used:
"zero" "um" "dois" "três" "quatro" "cinco" "seis" "sete" "oito" "nove"
together with the 2 feminine versions of 1 and 2: "uma" "duas", respectively.
The two variants may be useful in answers to questions regarding feminine items such as hours
("horas") or phone calls ("chamadas"), for instance:
At what time do you want the wake-up service ?
How many calls are you going to make ?)
A que horas quer despertar ?
Quantas chamadas vai efectuar?
They are not used when the spoken digits are just replacing the telephone keyboard in the selection
from a menu of options, neither when reading a number as isolated digits.
4
The corresponding corpus code is I1.
Identification numbers
Each prompt sheet is uniquely identified by a 4-digit number, which the caller is instructed to read as a
digit sequence, although these instructions were not always followed (i.e., many readers preferred to
read the id number as a natural number).
A continuous sequence of 3 digits (000 to 999) had been generated and the fourth digit was leftappended in order to have each digit exactly the same number of times (i.e. 400 times) in the 1000
prompt sheets. The first one was numbered 9000, the second 8001, the third 7002, and so forth.
The corresponding corpus code is C1.
Telephone numbers
The distribution of the 1000 generated telephone numbers is approximately equal to the present
distribution of telephone numbers in Portugal, in agreement with the data provided by Portugal
Telecom :
6 digits: 400 numbers (40%)
7 digits: 600 numbers (60%)
As for credit card numbers, telephone numbers were read by most speakers as isolated digits, but
several grouping strategies for reading them as natural numbers have been adopted.
Credit card numbers
For generating credit card numbers, 16 digits have been used, grouped in chunks of 4. Example: 8654
3374 1250 6017.
Each digit was generated with a uniform distribution. No error detection code has been used. We
expected that most credit card numbers would be read as strings of 4 connected digits. However, many
speakers have read each 4-digit string as a natural number (e.g.. eight thousand six hundred and fifty
four, for the first quarter of the above example), or using non-standard modes (e.g. eighty six fifty four
or eighty six five four).
3.4 Natural Numbers
The vocabulary used for reading natural numbers is the following:
Digits (see above)
11-19: onze, doze, treze, quatorze, quinze, dezasseis, dezassete, dezoito, dezanove
multiples of 10: dez (ten) , vinte, trinta, quarenta, cinquenta, sessenta, setenta, oitenta, noventa
multiples of 100: cem (hundred), cento (hundred when not isolated), duzentos, trezentos,
quatrocentos, quinhentos, seiscentos, setecentos, oitocentos, novecentos
• 1000: mil
• the word e (and)
•
•
•
•
The corresponding corpus codes are N1-N3.
5
Numbers with 3 digits
1000 3-digit numbers were generated. As there are only 900 different ones (excluding the leading
zeros), the numbers between 100 and 199 have been repeated.
Numbers with 4 digits
1000 4-digit numbers have been generated. The character “.” has been used to separate thousands (for
instance, 9.999). The left digit (thousands) has been generated sequentially. The remaining ones were
randomly generated. The repetition of numbers has been prevented.
Numbers with 5 and 6 digits
100 5-digit numbers and 900 6-digit numbers have been generated. The left-most digits (correponding
to thousands, tens of thousands and hundreds of thousands) have been generated sequentially between
10,000 and 999,000. The remaining three digits cover all the numbers between 0 and 999, but in a
random order.
3.5 Money amounts
The corresponding corpus codes are M1-M2.
Small amount ( <10.000$00 )
For the digits to the left of the dollar (escudos) sign, a random distribution has been used: 500 3-digit
numbers and 500 4-digit numbers. For the decimal part, we have alternated between 0 and 50 cents
("centavos"). The vocabulary includes these two extra words and the word informmaly used for
thousands of escudos (contos), besides the vocabulary used for reading natural numbers.
Large amount ( >10.000$00 )
We have used 900 numbers multiple of 100$00 (starting at 10.000$00), 90 numbers multiple of
10.000$00 (between 100.000$00 and 990.000$00) and 10 multiples of 1.000.000$00 between
1.000.000$00 and 10.000.000$00.
3.6 Read time phrases
The database includes five types of read time phrases for which the
following expressions have been used:
a) zero minutes:
meio-dia
meia-noite
{uma|duas|...|onze} horas
{uma|duas|...|onze} da manhã
{uma|duas|...|oito} da tarde
{sete|...|onze} da noite
(noon)
(mid-night)
(1|2|...|11 hours)
(1|2|...|11 in the morning)
(1|2|...|8 in the afternoon / evening)
(7|...|11 at night)
b) 15 minutes past:
{meio-dia|meia-noite|uma|duas|...} e um quarto
(literally: ... and a quarter)
c) 30 minutes past:
{meio-dia|meia-noite|uma|duas|...} e meia
(literally: ... and a half)
d) 45 minutes past:
um quarto para {o|a|as} {meio-dia|meia-noite|uma|duas|...}
(a quarter to...)
6
{meio-dia|meia-noite|uma|duas|...} menos um quarto (minus one quarter)
d) 35, 40, 50, 55 minutes past:
{vinte e cinco|vinte|dez|cinco} para {o|a|as} {meio-dia|meia-noite|uma|duas|...}
{meio-dia|meia-noite|uma|duas|...} menos {vinte e cinco|vinte|dez|cinco}
e) Any number of minutes past:
{meio-dia|meia-noite|uma|duas|...} e {um|dois|três|...}
The following days have been used:
ontem
yesterday
hoje
today
amanhã
tomorrow
We have also included phrases indicating only minutes:
{um|dois|três|...|trinta e três} {minuto|minutos}
{trinta e cinco|quarenta|quarenta e cinco|cinquenta|cinquenta e cinco}
{minuto|minutos}
The corresponding corpus code is T2.
3.7 Read dates
The dates in the analogue format have the following form:
<day-of-the-week>, <day-of-the-month> of <month> of <year>
The weekdays are the following in Portuguese:
Sunday to Saturday: Domingo, Segunda-Feira, Terça-Feira, Quarta-Feira, Quinta-Feira, Sexta-Feira,
Sábado
The months are:
Janeiro, Fevereiro, Março, Abril, Maio, Junho, Julho, Agosto, Setembro, Outubro, Novembro,
Dezembro
Days of the month and years were prompted as digits and were usually read as natural numbers. Week
days in Portuguese (except Sundays and Saturdays) are named according to their order, Monday being
the second day of the week. It is very common in colloquial speech to omit the word "feira" when
specifying a weekday. The read dates in analogue format have been therefore generated with and
without this particular word.
The first set of dates assumed that the days of the week would be fully specified. We have therefore
generated 1000 dates between: Monday, January 1, 1996 and Tuesday, December 27, 2005, with
intervals of 3.652 days, which yields about 100 dates per year.
The second set of dates does not include the "feira" word. We have generated 1000 dates between:
Tuesday, January 2, 1996 and Wednesday, December 28, 2005, with intervals of 3.652 days. The dates
of this set never coincide with the ones of the previous set.
The corresponding corpus code is D2.
3.8 Spelled words
We decided to select the 3 words to be spelled from a list of the most frequent porper names in
Portuguese. The original set included 11327 proper names. The selection criterium tried to get the most
uniform distribution of the 34 characteres (including diacritics) used in the Portuguese language. We
7
have no a priori knowledge of what is the most frequent way for spelling letters with diacritics in
Portuguese.
The selection algorithm had 3 phases: In the first step, a value expressing the need for each character
was initially set to the same value for all characters and was computed as the number of words times the
average word length divided by the number of characters in the language.
A goodness value of each proper name was then computed for each word as the sum of the ratio
between the need for each character of the word and the number of occurrences of that character in the
wordlist.
The worlist was then sorted by decreasing goodness and the first proper name is selected and removed
from the wordlist. The need value for each character is decremented by the number of occurrences of
the symbol in the selected word. The procedure was repeated until the desired number of proper names
was selected (3 times 1000).
The words to be spelled were presented in capitals, separated by commas, as in the following example:
C, O, N, C, E, I, Ç, Ã, O
Most often, diacritics were ignored by the speaker. When not ignored, the vocabulary typically used for
specifying diacritics is:
til (tilde), cedilha (cedilla), cedilhado (with tilde), acento (accent), agudo (as in Á), grave (as in À)
circunflexo (as in Â), com (with).
The following letters were used:
À, Á, Ã, Â, A, B, C, Ç, D, É, Ê, E, F, G, H, Í, I, J, L, M, N, Ó, Õ, Ô, O, P, Q, R, S, T, U, V, X, Z
Some of the diacritics occur very rarely for proper names.
The corresponding corpus codes are L1-L3.
3.9 Application words
A list of 60 words has been selected, which includes as a subset the translations of the 50 words
adopted by the consortium. The words have been divided into 6 sets of 10 words each. Each prompt
sheet includes one word from each of the subsets.
The list is the following:
8
telefonar
call
atendedor
ligar
apagar
ajuda
menu
call
delete, erase
help
menu
para trás
cancelar
conferência
fim
seguinte
gravar
repetir
stop
parar
activar
reencaminhar
número
tocar
asterisco
guardar
desligar
next
record
repeat, again
stop
stop
activate
forward
number
play, replay
star
store, save
switch
off,
deactivate
telephone
greeting
announcement
telefone
saudação
anúncio
início
exterior
informação
interior
mensagem
operador
anterior
programar
rechamar
cardinal
para a frente
saltar
transferir
answering
machine
back
cancel
conference
end,
terminate
external
information
internal
message
operator
previous
program
recall
square
fast forward
skip
transfer
enviar
responder
inserir
alterar
(back to the)
beginning
send
answer
insert
change
verificar
código
sair
criar
agenda
telefonista
lista
serviço
extensão
marcar
externa
interna
check
code
quit
create/new
agenda
operator
directory, list
service
extension
dial
external
internal
continuar
ouvir
pausa
continue
listen
pause
rebobinar
chamada
divergência
rewind
call
divert
The corresponding corpus codes are A1-A6.
3.10 Application word phrases
Three sets of 100 sentences each were designed to include application words. Each caller is requested
to read one sentence from each of the sets. Given the fact that some application words are infinitive
forms, designing natural sounding phrases including them was somewhat difficult. This was aggravated
by the fact that the designers had no experience of using automatic telephone services using speech
recognition in the Portuguese language.
The corresponding corpus codes are E1-E3.
3.11 Phonetically rich sentences
The set of phonetically rich sentences includes 100 different subsets of 9 sentences each. Each of the
subsets was selected from a large database of sentences to include at least 2 examples of each phone
and, when this first criterium has been satisfied, to include as many different triphones as possible.
A greedy algorithm has been specifically designed for this database selection, and we have also
designed software for checking which phones are missing in a sentence.
The corresponding corpus codes are S1-S9.
The original database included different types of sentences:
Selected sentences from the EUROM.1 corpus (SAM_A project) for Portuguese.
Short sentences from different corpora specifically designed for linguistic studies. Sentences designed
for balancing the corpus (some palatal sounds had little coverage in the original corpus)
9
Newspaper sentences
The newspaper sentences of the original database have been selected from the daily newspaper
PÚBLICO and adapted in order to limit the maximum number of words to 25 and to include proper
names of Portuguese origin only. The adaptation involved cutting some subordinate phrases and also
substituting some proper names by common words, in order to avoid annoying the callers because of
their eventual political, religious or sports affiliation (for instance, the name of a team of football will
be replaced by "the team").
3.12 Prompt sheet example
Fig. 1 shows a sample prompt sheet file with the items to be read.
We have designed software for prompt sheet printing. This software is based on 2 scripts who take as
input 33 sets of 1000 items each, and yield as output 1000 post-script files, each with 33 read items
(prompt sheet number + list of 32 items). The 2 scripts are:
all2sed - this script takes the 33 sets and produces a set of 1000 files with commands for the "sed" text
editor.
sed2ps - this script takes a LaTeX file named prompt.tex with the desired format, and the 1000 "sed"
files and converts them into 1000 post-script files. This conversion is done in 4 steps:
• sed editor, using the "sed" script files
• ISO-latin1 to LaTeX conversion of characters with diacritics
• LaTeX processor
• dvips
More information about these scripts can be obtained from [email protected] or [email protected].
INESC designed 1000 different prompt sheets, which were distributed by Portugal Telecom by their
employees, relatives and friends. However, only 622 different prompt sheets were actually used.
10
Fig.1. Example of prompt sheet
4. Speaker recruitment
The approach adopted for speaker recruitment involved selecting speakers among the employees of
Portugal Telecom (about 20,000). The company has a wide geographical coverage, thus guaranteeing a
good representation of many regional accents.
The speaker characteristics are stored in file SPEAKER.TBL.
The balance in terms of sex distribution was the following: 453 calls by male speakers and 548 calls by
female speakers (45%-55%).
The age distribution was the following:
11
AGE
≤16
17-30
31-45
46-60
>60
# CALLS
12
345
436
196
8
%
1%
35%
44%
20%
1%
The first age group included one fourteen-year-old person and 11 sixteen-year-old persons. Two
speakers have not mentioned their age and two others said they were born in 1996. Hence, they have
been excluded from the above statistics.
The regional dialects for Portuguese can be found in :Esquisse d'une Dialectologie Portugaise, (J. Leite
de Vasconcellos, Instituto Nacional de Investigação Científica, 3rd. ed, 1987), and in Estudos de
Dialectologia Portuguesa (L. Lindley Cintra, Sá da Costa Editora, 1983), among other references.
There is no full agreement among the various authors.We have adopted a division of continental
dialects which is in close relationship with the administrative provinces. We have, therefore, determined
the provinces, or groups of provinces for each place name.
The following distribution was obtained (names of foreign countries appear as they are written in
Portuguese):
A. Continental
302
Estremadura
274
Entre-Douro-e-Minho
100
Alentejo
93
Beira-Litoral
56
Beira-Baixa
38
Beira-Alta
26
Transmontano
19
Ribatejo
13
Algarve
B. Insular
28
8
Açores
Madeira
C. Other variants spoken in former Portuguese colonies
32
Africa
1
Brasil
1
Índia
1
Macau
D. Other countries
5
França
1
Luxemburgo
1
Austrália
Most of the speakers born in Africa (Angola, Moçambique, Guiné, Cabo Verde, São Tomé e Príncipe)
have been living in the Continent for many years, so their dialectal differences are not so easily
distinguishable. Most of the speakers born in “other countries” which were not former Portuguese
colonies are sons or daughters of Portuguese emmigrants. Speaker number 0663 is a female Brazilian,
with very striking differences in pronunciation. This was the main reason for collecting one extra call
by a female speaker (amounting to 1001 calls).
12
5. Recording platform
PT's recording platform consists of a 486 DX2 PC at 66MHz, connected to a local network and
equipped with 12Mbytes of RAM, a 850 Mbyte HD and two DIALOGIC boards:
• DTI/212 - Network Terminating Equipment
• D/81A- 8 channel voice processing board
Configured as a terminating device, the DTI/212 board connects the D/81A board's voice channels to
the E-1 network. This allows our system to act as a standalone voice processing node working in a fully
automatic non-stop mode. The system is configured to automatically bootstrap in case of long lasting
power supply failures (the system is connected to an UPS). The system is only interrupted for periodic
backup of speech files. Manual intervention is required for moving speech files to a remote backup hard
disk.
The available voice processing hardware imposes a maximum recording of 8 simultaneous voice calls.
Speech signals are recorded at 8kHz, 8-bit A-law format. Files are stored according to the file
nomenclature proposed in the "SpeechDat database format specification".
The recording software is driven by script files which guide the caller through the session and specify
parameters such as:
• prompt filename;
• speech filename;
• utterance maximum recording time
The software produces logfiles with the date, time and number of items recorded in each call.
6. Annotation
The annotation package we have used was developed by IDIAP and modified by VOCALIS and
INESC. Many of our modifications concerned changing wired-in paths, translating the vocabulary, and
adapting numerical to literal conversions to Portuguese (natural numbers, dates, times, money amounts,
etc.). Due to the fact that we had no prior information about the speaker, significant changes had to be
made in different parts of the program (session number and speaker number became just one; the whole
comments file was redesigned). We had problems with the differences in "tcl" and waves versions. We
have redesigned archive and clean-up scripts.
This annotation software produces "description files" for each spoken item (read or spontaneous), and a
comments file. We have designed software for producing orthographic labelling files with the required
SAM format from these description and comments files, which take additional information like
recording date and time from the log files maintained by the recording software.
Each call was annotated only once, with the exception of the first 10 calls which were used for training
annotators. The annotation task was done mainly by 3 annotators. Each item in the call was typically
heard twice, i.e , before and after checking the transliteration and relevant events.
The main purpose of the annotation was to provide an accurate word transcription of the utterance, with
sufficient detail about other sounds present in the utterance to help identify the quality of the speech for
training purposes. We have followed the rules appended to the software package instructions manual,
which are copied (with minor adaptations) below:
1. All transcriptions are to be provided in lower case (except letter spellings). This includes normally
capitalized words such as at the start of a sentence or proper names. No punctuation is to be used except
for hyphens and apostrophes (i.e. don't use periods, commas, semicolons or quotes).
13
2. Spelled letters are to be shown in capitals (each separated by spaces)If there is an unusual spelling
pronunciation such as /zee/ (American for 'zed'), show this as 'zee'. Otherwise use Z as usual. Similarly
/dZaI/ (Scottish pronunciation of J) should be transcribed as 'jy'.
3. Numbers are always in words, whether digit sequences or numbers
4. Abbreviations should always be represented in their full form to avoid any ambiguity. Abbreviations
such as a.m. and p.m. will be represented as letter sequences, eg four forty two A M. Abbreviations
such as UNESCO will be shown as a word (eg unesco)
5. There are 3 broad classes of sound events that need to be identified and marked. These sound events
should be marked in the transcription text corresponding to the start of the event.
filled pauses:
• We have identified 5 most common filled-pauses ([ah], [eh], [oh], [ãh] and [hum]).
• Pause ([pausa]) is used only to indicate long pauses of more than eg 1-2 seconds
speaker noises:
• noticeably loud breath intake noises are marked as [respiração]; ignore moderately quiet breath
noises
• blowing noises during speech caused by breathing loudly into the microphone are marked as [sopro]
• mouth noise eg tongue clicks, lip noises ([ruído bocal])
• throat clear ([pigarreio])
• cough([tosse])
• laugh, which is also used if the speaking voice is noticeably affected by laughing or smiling ([riso])
• for other speaker noises select the button marked ([ruídos do orador])
non-speaker noises:
• line noise; eg hum, buzz, cross-talk, crackle ([ruído de linha])
• clicks and taps; eg handset noises ([cliques])
• radio, HiFi, music, etc ([rádio])
• voices in the background; but transcribe the speakers speech if it is clear ([vozes])
• phone rings ([telefone])
• hangup noises when the speaker puts the phone down during the recording ([desligar])
• for other background noises, select the button marked ([ruídos de fundo])
other events
• truncated words ie where a word is visibly and audibly cut short (marked with a ~ immediately after
or before the word affected)
• Obviously mispronounced words which are nevertheless still intelligible (marked with a *
immediately before the word affected)
• fragments of words or segments of speech, which were started but not finished, when the speaker
corrects himself/herself are marked as [hesitação], but the audible part is not transcribed
• words or segments of speech which are unclear and cannot definitely be transcribed (marked with a
? immediately after the word)
• words or segments of speech which are unintelligible (marked as **)
• utterances with nothing said or just noises are marked with either nothing ([nada]) of just
background noise ([ruídos de fundo]).
Below, a sample orthography file is shown, corresponding to a read sentence:
LHD: SAM, 5.00
DBN: Speechdat(M)_Portuguese
VOL: FIXED0PT_02
SES: 2134
SHT: 5044
CMT: *** Speech file information ***
14
DIR: \FIXED0PT\CD02\BLOCK21\SES2134
SRC: A02134S8.PTZ
CCD: S8
BEG: 0
END: 61598
REP: INESCTEL, PORTO, PORTUGAL
RED: 12/Apr/1996
RET: 17:25
CMT: *** Speech data coding ***
SAM: 8000
SNB: 1
SBF:
SSB: 8
QNT: A-LAW
CMP: GZIP, 1.2.4
CMT: *** Speaker information ***
SCD: 2134
SEX: M
AGE: 44
ACC: Viseu
CMT: *** Recording conditions ***
REG: Viseu
LBD:
CMT: *** Label file body ***
LBR: 0, 61598, , -32256, 32256, O Presidente terminou ontem a sua visita oficial
EXT: de três dias à Islândia, tendo já regressado a Lisboa.
LBO: 0, , 61598, [vozes] o presidente terminou ontem a sua visita oficial de três
EXT: dias à islândia tendo já regressado a lisboa
ELF:
The three following extracts of label files show, respectively, the prompting (LBR) and transliteration
(LBO) fields for a spontaneous item (What is your birthdate? the twenty-fourth of april nineteen sixty
seven), a money amount, and a spelling example, where the speaker specified the diacritic (tilde):
LBD:
LBR: 0, 28383, , -32256, 32256, Qual a data do seu nascimento ?
LBO: 0, , 28383, vinte e quatro de abril de mil novecentos e sessenta e sete [sopro]
LBD:
LBR: 0, 22399, , -32256, 32256, 91.900$00
LBO: 0, , 22399, [respiração] noventa e um mil e novecentos escudos
LBD:
LBR: 0, 32959, , -32256, 32256, P, A, N, C, Ã, O
LBO: 0, , 32959, P A N C A com til O
ELF:
8. Lexicon
The lexicon includes approximately 4100 different words found in the corpus of phonetically rich
sentences, and an extra and set of approximately 500 words found in the rest of the corpus, but not in
the sentences. The SAMPA symbols adopted in the SAM_A project for European Portuguese have
been used for the (broad) phonetic transcriptions..
15
The symbols for vowels were the following (IPA number in the second column):
i
301
e
302
6
324
o
307
u
308
E
303
a
304
@
322
O
306
The top five vowels can also be nasalised (with tilde “~”).
The symbols for semi-vowels are the following:
j
394
w
321
They can also be nasalised.
The symbols for consonants were the following:
p
101
t
103
k
109
b
102
d
104
g
110
f
128
s
132
S
134
v
129
z
133
Z
135
l
155
l~
209
L
157
r
124
R
123
m
114
n
116
J
118
The transcriptions were automatically generated and manually corrected by a phonetician. Only one
pronunciation is indicated per word. It corresponds to the one spoken in the Lisbon region and most
frequently heard in the media.
The capitalized isolated letters used for spelling are also included in the lexicon, with their most
common pronunciation. Some speakers were not familiar with spelling and, instead of doing it, they
spelled the word with pauses in between syllables. These (rare) syllables are also included in the
lexicon with their default pronunciation.
16

text_database_design

Transcrição

Documentos relacionados

Portuguese American Suncoast Association, Inc

AGENDA

SurvivalGuide copiar

Language Manual

TAGNIN, S. E. O. . Colors, Collocations and Culture: A Corpus

Lisbon, August 4, 2015 [email protected] Re: Amnesty

Verbetes v3

1 Swedish Phonetic Text - Acapela Group

On the Information Structure of Expletive Sentences

1 Swedish Phonetic Text - Acapela Group

FE STA do SÓ CIO da Portuguese Am erican Suncoast Association

ZAC.PB: An Annotated Corpus for Zero Anaphora

Hong Kong Stories – Life before the War

adowt3 - francisco yuzo nakajima

bartolomeu dias diploma

PLData june 2001

a case report - Revista Paulista de Pediatria

Instituto São José Salesiano Resende/RJ

What Corpus Linguistics can offer Contact Linguistics

A320-100 200 seating map

Student Handbook 2013-2014

RIDGES Herbology - Designing a Diachronic Multi