text_database_design
Transcrição
text_database_design
SPEECHDAT PORTUGUESE TELEPHONE SPEECH DATA COLLECTION 1. Contact Person and Team This database was collected by Portugal Telecom in the scope of the SpeechDat European Project. The task of designing and post-processing the database (together with this report) was subcontracted to INESC. The design of the collection platform and the speech data collection itself was the responsability of INESCTEL. Contact persons at INESC: Isabel Trancoso [email protected] Luís Oliveira [email protected] Institution: Address: Fax: Ph.: INESC R. Alves Redol, 9 1000 Lisbon Portugal +351 1 3145843 +351 1 3100268 Contact persons at INESCTEL: Joaquim Azevedo [email protected] Nuno Beires [email protected] Institution: Address: Fax: Ph.: INESCTEL R. Gonçalo Sampaio, 329, 2º Dto. 4150 Porto +351 2 600 2726 +351 2 607 0615 Contact person at Portugal Telecom: Rui Chaves [email protected] Institution: Address: Fax: Ph.: Portugal Telecom R. Entrecampos, 28, 9º 1700 Lisbon, Portugal +351 1 500 3271 +351 1 500 3215 INESC´s annotation team included the following people: Eloi Jorge Luís Ramos 1 - Contact Person for this report Marta Gouveia For the design of the database, INESC had some linguistic support from Drs. Céu Viana and Isabel Mascarenhas (Center of Linguistics of the University of Lisbon), and received a small set of sentences from Dr. Amália Andrade, from the same institution. We gratefully acknowledge their support, together with the one of the daily newspaper PÚBLICO, from where we have selected the majority of phonetically rich sentences of our database. 2. Number and Structure of CDs This database is contained in 3 CDs. The set of CDs contains the recordings of 1001 calls, sampled at 8Khz, 8-bit A-law format, made via digital line (ISDN). The file formats and headers follow the SAM recommendations (header files separated from signal files). Compression of signal files has been made using gzip. The file names are 8 characters long plus 3 characters extension, following the specifications of SpeechDat Deliverable D.1.4.1. DDNNNNTT.CCX where: DD NNNN TT CC X Database identifier, e.g. A0 for the fixed telephone database Session number (0000..9999) Type of utterance, e.g. S1 for the first read phonetically rich sentence Country code, e.g. PT for Portugal File type marker, e.g. O for Orthography, Z for compressed speech signal Each call is contained in one directory and comprises 40 compressed speech signal files and 40 orthography files. About 20 calls are missing 1 or 2 items and about another 200 calls have up to 5 items (most frequently only one) in which the speaker says nothing and includes therefore only background noise. The directory structure is 5 levels deep: /DB_TYPE/CD_NO/BLOCK_NO/SESSION_NO/SIGNALS where: DB_TYPE CD_NO BLOCK_NO SESSION_NO SIGNALS Type of database, e.g. FIXED0PT CD identifier, e.g. CD00 Block number (first 2 digits of session number) 4-digit session number (not sequential), e.g. 0005 signal file directory. Thus, there is a directory for each call, and no directory has more than 100 entries. 3. Structure of each call 2 Each call comprises 2 parts: a first part in which the caller should provide spontaneous answers to 9 questions and a second part in which he/she is asked to read a list of 33 items. 3.1 Spontaneous items The spontaneous speech questions are the following: Está pronto a começar? Por favor, diga o seu nome: Diga o seu número de telefone: Qual a data do seu nascimento? Qual a cidade (ou distrito) em que passou a maior parte da sua infância? É do sexo masculino? Está a usar um telemóvel? Está a usar um telefone sem fios? Que horas são? Are your ready to begin ? Please say your name Say your telephone number What's your birthday ? In which city (or district) have you spent the largest part of your childhood? Are you of the male sex ? Are you using a mobile phone ? Are you using a cordless phone ? What time is it ? Two of these items were not included in the CD-ROM (name and telephone number), for the sake of confidentiality and were only used for later contacting the callers in order to give them the lottery prizes. The remainding 7 items comprise: • • • • 4 yes/no questions (corpus codes Z1 and Q1-Q3) 1 spontaneous date (D1) 1 spontaneous time (T1) 1 region name (P1) Answering “YES” (SIM) is not the most frequent away of giving an affirmative answer. Hence, in answer to the first question (Are you ready to start?), many speakers said the corresponding form of “I am” (Estou). This question had the corpus code Z1, as only three yes-no questions were mandatory. 3.2 Read items The spontaneous part ends with a request to read the identification number in the prompt sheet: Diga o número de identificação da folha anexa (4-digit sequence). After this the system prompts for each of the 32 items in the prompt sheet, which comprises (besides the id number): • • • • • • • • • • • 3 natural numbers 1 isolated digit 1 credit card number 1 telephone number 2 money amounts 2 dates 1 time 6 application words 3 spelled words 3 word spotting phrases 9 sentences The following ordering of the read items was adopted : 3 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • sentence 4 5-6-digit number word spotting phrase 1 spelled word 1 sentence 6 time application word 1 large money amount sentence 2 credit card number word spotting phrase 2 3-digit number sentence 8 date 1 aplication word 2 spelled word 2 sentence 1 telephone number word spotting phrase 3 application word 3 sentence 7 date 2 application word 4 small money amount sentence 3 application word 5 isolated digit spelled word 3 sentence 5 4-digit number sentence 9 application word 6 Note: the numbering of the sentences corresponds to their relative length in terms of number of words. In order to avoid fatigue, long sentences have been inserted next to short items. 3.3 Isolated and Connected Digits Isolated digits The 10 isolated digits have been used: "zero" "um" "dois" "três" "quatro" "cinco" "seis" "sete" "oito" "nove" together with the 2 feminine versions of 1 and 2: "uma" "duas", respectively. The two variants may be useful in answers to questions regarding feminine items such as hours ("horas") or phone calls ("chamadas"), for instance: At what time do you want the wake-up service ? How many calls are you going to make ?) A que horas quer despertar ? Quantas chamadas vai efectuar? They are not used when the spoken digits are just replacing the telephone keyboard in the selection from a menu of options, neither when reading a number as isolated digits. 4 The corresponding corpus code is I1. Identification numbers Each prompt sheet is uniquely identified by a 4-digit number, which the caller is instructed to read as a digit sequence, although these instructions were not always followed (i.e., many readers preferred to read the id number as a natural number). A continuous sequence of 3 digits (000 to 999) had been generated and the fourth digit was leftappended in order to have each digit exactly the same number of times (i.e. 400 times) in the 1000 prompt sheets. The first one was numbered 9000, the second 8001, the third 7002, and so forth. The corresponding corpus code is C1. Telephone numbers The distribution of the 1000 generated telephone numbers is approximately equal to the present distribution of telephone numbers in Portugal, in agreement with the data provided by Portugal Telecom : 6 digits: 400 numbers (40%) 7 digits: 600 numbers (60%) As for credit card numbers, telephone numbers were read by most speakers as isolated digits, but several grouping strategies for reading them as natural numbers have been adopted. The corresponding corpus code is C2. Credit card numbers For generating credit card numbers, 16 digits have been used, grouped in chunks of 4. Example: 8654 3374 1250 6017. Each digit was generated with a uniform distribution. No error detection code has been used. We expected that most credit card numbers would be read as strings of 4 connected digits. However, many speakers have read each 4-digit string as a natural number (e.g.. eight thousand six hundred and fifty four, for the first quarter of the above example), or using non-standard modes (e.g. eighty six fifty four or eighty six five four). The corresponding corpus code is C3. 3.4 Natural Numbers The vocabulary used for reading natural numbers is the following: Digits (see above) 11-19: onze, doze, treze, quatorze, quinze, dezasseis, dezassete, dezoito, dezanove multiples of 10: dez (ten) , vinte, trinta, quarenta, cinquenta, sessenta, setenta, oitenta, noventa multiples of 100: cem (hundred), cento (hundred when not isolated), duzentos, trezentos, quatrocentos, quinhentos, seiscentos, setecentos, oitocentos, novecentos • 1000: mil • the word e (and) • • • • The corresponding corpus codes are N1-N3. 5 Numbers with 3 digits 1000 3-digit numbers were generated. As there are only 900 different ones (excluding the leading zeros), the numbers between 100 and 199 have been repeated. Numbers with 4 digits 1000 4-digit numbers have been generated. The character “.” has been used to separate thousands (for instance, 9.999). The left digit (thousands) has been generated sequentially. The remaining ones were randomly generated. The repetition of numbers has been prevented. Numbers with 5 and 6 digits 100 5-digit numbers and 900 6-digit numbers have been generated. The left-most digits (correponding to thousands, tens of thousands and hundreds of thousands) have been generated sequentially between 10,000 and 999,000. The remaining three digits cover all the numbers between 0 and 999, but in a random order. 3.5 Money amounts The corresponding corpus codes are M1-M2. Small amount ( <10.000$00 ) For the digits to the left of the dollar (escudos) sign, a random distribution has been used: 500 3-digit numbers and 500 4-digit numbers. For the decimal part, we have alternated between 0 and 50 cents ("centavos"). The vocabulary includes these two extra words and the word informmaly used for thousands of escudos (contos), besides the vocabulary used for reading natural numbers. Large amount ( >10.000$00 ) We have used 900 numbers multiple of 100$00 (starting at 10.000$00), 90 numbers multiple of 10.000$00 (between 100.000$00 and 990.000$00) and 10 multiples of 1.000.000$00 between 1.000.000$00 and 10.000.000$00. 3.6 Read time phrases The database includes five types of read time phrases for which the following expressions have been used: a) zero minutes: meio-dia meia-noite {uma|duas|...|onze} horas {uma|duas|...|onze} da manhã {uma|duas|...|oito} da tarde {sete|...|onze} da noite (noon) (mid-night) (1|2|...|11 hours) (1|2|...|11 in the morning) (1|2|...|8 in the afternoon / evening) (7|...|11 at night) b) 15 minutes past: {meio-dia|meia-noite|uma|duas|...} e um quarto (literally: ... and a quarter) c) 30 minutes past: {meio-dia|meia-noite|uma|duas|...} e meia (literally: ... and a half) d) 45 minutes past: um quarto para {o|a|as} {meio-dia|meia-noite|uma|duas|...} (a quarter to...) 6 {meio-dia|meia-noite|uma|duas|...} menos um quarto (minus one quarter) d) 35, 40, 50, 55 minutes past: {vinte e cinco|vinte|dez|cinco} para {o|a|as} {meio-dia|meia-noite|uma|duas|...} {meio-dia|meia-noite|uma|duas|...} menos {vinte e cinco|vinte|dez|cinco} e) Any number of minutes past: {meio-dia|meia-noite|uma|duas|...} e {um|dois|três|...} The following days have been used: ontem yesterday hoje today amanhã tomorrow We have also included phrases indicating only minutes: {um|dois|três|...|trinta e três} {minuto|minutos} {trinta e cinco|quarenta|quarenta e cinco|cinquenta|cinquenta e cinco} {minuto|minutos} The corresponding corpus code is T2. 3.7 Read dates The dates in the analogue format have the following form: <day-of-the-week>, <day-of-the-month> of <month> of <year> The weekdays are the following in Portuguese: Sunday to Saturday: Domingo, Segunda-Feira, Terça-Feira, Quarta-Feira, Quinta-Feira, Sexta-Feira, Sábado The months are: Janeiro, Fevereiro, Março, Abril, Maio, Junho, Julho, Agosto, Setembro, Outubro, Novembro, Dezembro Days of the month and years were prompted as digits and were usually read as natural numbers. Week days in Portuguese (except Sundays and Saturdays) are named according to their order, Monday being the second day of the week. It is very common in colloquial speech to omit the word "feira" when specifying a weekday. The read dates in analogue format have been therefore generated with and without this particular word. The first set of dates assumed that the days of the week would be fully specified. We have therefore generated 1000 dates between: Monday, January 1, 1996 and Tuesday, December 27, 2005, with intervals of 3.652 days, which yields about 100 dates per year. The second set of dates does not include the "feira" word. We have generated 1000 dates between: Tuesday, January 2, 1996 and Wednesday, December 28, 2005, with intervals of 3.652 days. The dates of this set never coincide with the ones of the previous set. The corresponding corpus code is D2. 3.8 Spelled words We decided to select the 3 words to be spelled from a list of the most frequent porper names in Portuguese. The original set included 11327 proper names. The selection criterium tried to get the most uniform distribution of the 34 characteres (including diacritics) used in the Portuguese language. We 7 have no a priori knowledge of what is the most frequent way for spelling letters with diacritics in Portuguese. The selection algorithm had 3 phases: In the first step, a value expressing the need for each character was initially set to the same value for all characters and was computed as the number of words times the average word length divided by the number of characters in the language. A goodness value of each proper name was then computed for each word as the sum of the ratio between the need for each character of the word and the number of occurrences of that character in the wordlist. The worlist was then sorted by decreasing goodness and the first proper name is selected and removed from the wordlist. The need value for each character is decremented by the number of occurrences of the symbol in the selected word. The procedure was repeated until the desired number of proper names was selected (3 times 1000). The words to be spelled were presented in capitals, separated by commas, as in the following example: C, O, N, C, E, I, Ç, Ã, O Most often, diacritics were ignored by the speaker. When not ignored, the vocabulary typically used for specifying diacritics is: til (tilde), cedilha (cedilla), cedilhado (with tilde), acento (accent), agudo (as in Á), grave (as in À) circunflexo (as in Â), com (with). The following letters were used: À, Á, Ã, Â, A, B, C, Ç, D, É, Ê, E, F, G, H, Í, I, J, L, M, N, Ó, Õ, Ô, O, P, Q, R, S, T, U, V, X, Z Some of the diacritics occur very rarely for proper names. The corresponding corpus codes are L1-L3. 3.9 Application words A list of 60 words has been selected, which includes as a subset the translations of the 50 words adopted by the consortium. The words have been divided into 6 sets of 10 words each. Each prompt sheet includes one word from each of the subsets. The list is the following: 8 telefonar call atendedor ligar apagar ajuda menu call delete, erase help menu para trás cancelar conferência fim seguinte gravar repetir stop parar activar reencaminhar número tocar asterisco guardar desligar next record repeat, again stop stop activate forward number play, replay star store, save switch off, deactivate telephone greeting announcement telefone saudação anúncio início exterior informação interior mensagem operador anterior programar rechamar cardinal para a frente saltar transferir answering machine back cancel conference end, terminate external information internal message operator previous program recall square fast forward skip transfer enviar responder inserir alterar (back to the) beginning send answer insert change verificar código sair criar agenda telefonista lista serviço extensão marcar externa interna check code quit create/new agenda operator directory, list service extension dial external internal continuar ouvir pausa continue listen pause rebobinar chamada divergência rewind call divert The corresponding corpus codes are A1-A6. 3.10 Application word phrases Three sets of 100 sentences each were designed to include application words. Each caller is requested to read one sentence from each of the sets. Given the fact that some application words are infinitive forms, designing natural sounding phrases including them was somewhat difficult. This was aggravated by the fact that the designers had no experience of using automatic telephone services using speech recognition in the Portuguese language. The corresponding corpus codes are E1-E3. 3.11 Phonetically rich sentences The set of phonetically rich sentences includes 100 different subsets of 9 sentences each. Each of the subsets was selected from a large database of sentences to include at least 2 examples of each phone and, when this first criterium has been satisfied, to include as many different triphones as possible. A greedy algorithm has been specifically designed for this database selection, and we have also designed software for checking which phones are missing in a sentence. The corresponding corpus codes are S1-S9. The original database included different types of sentences: Selected sentences from the EUROM.1 corpus (SAM_A project) for Portuguese. Short sentences from different corpora specifically designed for linguistic studies. Sentences designed for balancing the corpus (some palatal sounds had little coverage in the original corpus) 9 Newspaper sentences The newspaper sentences of the original database have been selected from the daily newspaper PÚBLICO and adapted in order to limit the maximum number of words to 25 and to include proper names of Portuguese origin only. The adaptation involved cutting some subordinate phrases and also substituting some proper names by common words, in order to avoid annoying the callers because of their eventual political, religious or sports affiliation (for instance, the name of a team of football will be replaced by "the team"). 3.12 Prompt sheet example Fig. 1 shows a sample prompt sheet file with the items to be read. We have designed software for prompt sheet printing. This software is based on 2 scripts who take as input 33 sets of 1000 items each, and yield as output 1000 post-script files, each with 33 read items (prompt sheet number + list of 32 items). The 2 scripts are: all2sed - this script takes the 33 sets and produces a set of 1000 files with commands for the "sed" text editor. sed2ps - this script takes a LaTeX file named prompt.tex with the desired format, and the 1000 "sed" files and converts them into 1000 post-script files. This conversion is done in 4 steps: • sed editor, using the "sed" script files • ISO-latin1 to LaTeX conversion of characters with diacritics • LaTeX processor • dvips More information about these scripts can be obtained from [email protected] or [email protected]. INESC designed 1000 different prompt sheets, which were distributed by Portugal Telecom by their employees, relatives and friends. However, only 622 different prompt sheets were actually used. 10 Fig.1. Example of prompt sheet 4. Speaker recruitment The approach adopted for speaker recruitment involved selecting speakers among the employees of Portugal Telecom (about 20,000). The company has a wide geographical coverage, thus guaranteeing a good representation of many regional accents. The speaker characteristics are stored in file SPEAKER.TBL. The balance in terms of sex distribution was the following: 453 calls by male speakers and 548 calls by female speakers (45%-55%). The age distribution was the following: 11 AGE ≤16 17-30 31-45 46-60 >60 # CALLS 12 345 436 196 8 % 1% 35% 44% 20% 1% The first age group included one fourteen-year-old person and 11 sixteen-year-old persons. Two speakers have not mentioned their age and two others said they were born in 1996. Hence, they have been excluded from the above statistics. The regional dialects for Portuguese can be found in :Esquisse d'une Dialectologie Portugaise, (J. Leite de Vasconcellos, Instituto Nacional de Investigação Científica, 3rd. ed, 1987), and in Estudos de Dialectologia Portuguesa (L. Lindley Cintra, Sá da Costa Editora, 1983), among other references. There is no full agreement among the various authors.We have adopted a division of continental dialects which is in close relationship with the administrative provinces. We have, therefore, determined the provinces, or groups of provinces for each place name. The following distribution was obtained (names of foreign countries appear as they are written in Portuguese): A. Continental 302 Estremadura 274 Entre-Douro-e-Minho 100 Alentejo 93 Beira-Litoral 56 Beira-Baixa 38 Beira-Alta 26 Transmontano 19 Ribatejo 13 Algarve B. Insular 28 8 Açores Madeira C. Other variants spoken in former Portuguese colonies 32 Africa 1 Brasil 1 Índia 1 Macau D. Other countries 5 França 1 Luxemburgo 1 Austrália Most of the speakers born in Africa (Angola, Moçambique, Guiné, Cabo Verde, São Tomé e Príncipe) have been living in the Continent for many years, so their dialectal differences are not so easily distinguishable. Most of the speakers born in “other countries” which were not former Portuguese colonies are sons or daughters of Portuguese emmigrants. Speaker number 0663 is a female Brazilian, with very striking differences in pronunciation. This was the main reason for collecting one extra call by a female speaker (amounting to 1001 calls). 12 5. Recording platform PT's recording platform consists of a 486 DX2 PC at 66MHz, connected to a local network and equipped with 12Mbytes of RAM, a 850 Mbyte HD and two DIALOGIC boards: • DTI/212 - Network Terminating Equipment • D/81A- 8 channel voice processing board Configured as a terminating device, the DTI/212 board connects the D/81A board's voice channels to the E-1 network. This allows our system to act as a standalone voice processing node working in a fully automatic non-stop mode. The system is configured to automatically bootstrap in case of long lasting power supply failures (the system is connected to an UPS). The system is only interrupted for periodic backup of speech files. Manual intervention is required for moving speech files to a remote backup hard disk. The available voice processing hardware imposes a maximum recording of 8 simultaneous voice calls. Speech signals are recorded at 8kHz, 8-bit A-law format. Files are stored according to the file nomenclature proposed in the "SpeechDat database format specification". The recording software is driven by script files which guide the caller through the session and specify parameters such as: • prompt filename; • speech filename; • utterance maximum recording time The software produces logfiles with the date, time and number of items recorded in each call. 6. Annotation The annotation package we have used was developed by IDIAP and modified by VOCALIS and INESC. Many of our modifications concerned changing wired-in paths, translating the vocabulary, and adapting numerical to literal conversions to Portuguese (natural numbers, dates, times, money amounts, etc.). Due to the fact that we had no prior information about the speaker, significant changes had to be made in different parts of the program (session number and speaker number became just one; the whole comments file was redesigned). We had problems with the differences in "tcl" and waves versions. We have redesigned archive and clean-up scripts. This annotation software produces "description files" for each spoken item (read or spontaneous), and a comments file. We have designed software for producing orthographic labelling files with the required SAM format from these description and comments files, which take additional information like recording date and time from the log files maintained by the recording software. Each call was annotated only once, with the exception of the first 10 calls which were used for training annotators. The annotation task was done mainly by 3 annotators. Each item in the call was typically heard twice, i.e , before and after checking the transliteration and relevant events. The main purpose of the annotation was to provide an accurate word transcription of the utterance, with sufficient detail about other sounds present in the utterance to help identify the quality of the speech for training purposes. We have followed the rules appended to the software package instructions manual, which are copied (with minor adaptations) below: 1. All transcriptions are to be provided in lower case (except letter spellings). This includes normally capitalized words such as at the start of a sentence or proper names. No punctuation is to be used except for hyphens and apostrophes (i.e. don't use periods, commas, semicolons or quotes). 13 2. Spelled letters are to be shown in capitals (each separated by spaces)If there is an unusual spelling pronunciation such as /zee/ (American for 'zed'), show this as 'zee'. Otherwise use Z as usual. Similarly /dZaI/ (Scottish pronunciation of J) should be transcribed as 'jy'. 3. Numbers are always in words, whether digit sequences or numbers 4. Abbreviations should always be represented in their full form to avoid any ambiguity. Abbreviations such as a.m. and p.m. will be represented as letter sequences, eg four forty two A M. Abbreviations such as UNESCO will be shown as a word (eg unesco) 5. There are 3 broad classes of sound events that need to be identified and marked. These sound events should be marked in the transcription text corresponding to the start of the event. filled pauses: • We have identified 5 most common filled-pauses ([ah], [eh], [oh], [ãh] and [hum]). • Pause ([pausa]) is used only to indicate long pauses of more than eg 1-2 seconds speaker noises: • noticeably loud breath intake noises are marked as [respiração]; ignore moderately quiet breath noises • blowing noises during speech caused by breathing loudly into the microphone are marked as [sopro] • mouth noise eg tongue clicks, lip noises ([ruído bocal]) • throat clear ([pigarreio]) • cough([tosse]) • laugh, which is also used if the speaking voice is noticeably affected by laughing or smiling ([riso]) • for other speaker noises select the button marked ([ruídos do orador]) non-speaker noises: • line noise; eg hum, buzz, cross-talk, crackle ([ruído de linha]) • clicks and taps; eg handset noises ([cliques]) • radio, HiFi, music, etc ([rádio]) • voices in the background; but transcribe the speakers speech if it is clear ([vozes]) • phone rings ([telefone]) • hangup noises when the speaker puts the phone down during the recording ([desligar]) • for other background noises, select the button marked ([ruídos de fundo]) other events • truncated words ie where a word is visibly and audibly cut short (marked with a ~ immediately after or before the word affected) • Obviously mispronounced words which are nevertheless still intelligible (marked with a * immediately before the word affected) • fragments of words or segments of speech, which were started but not finished, when the speaker corrects himself/herself are marked as [hesitação], but the audible part is not transcribed • words or segments of speech which are unclear and cannot definitely be transcribed (marked with a ? immediately after the word) • words or segments of speech which are unintelligible (marked as **) • utterances with nothing said or just noises are marked with either nothing ([nada]) of just background noise ([ruídos de fundo]). Below, a sample orthography file is shown, corresponding to a read sentence: LHD: SAM, 5.00 DBN: Speechdat(M)_Portuguese VOL: FIXED0PT_02 SES: 2134 SHT: 5044 CMT: *** Speech file information *** 14 DIR: \FIXED0PT\CD02\BLOCK21\SES2134 SRC: A02134S8.PTZ CCD: S8 BEG: 0 END: 61598 REP: INESCTEL, PORTO, PORTUGAL RED: 12/Apr/1996 RET: 17:25 CMT: *** Speech data coding *** SAM: 8000 SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 CMT: *** Speaker information *** SCD: 2134 SEX: M AGE: 44 ACC: Viseu CMT: *** Recording conditions *** REG: Viseu LBD: CMT: *** Label file body *** LBR: 0, 61598, , -32256, 32256, O Presidente terminou ontem a sua visita oficial EXT: de três dias à Islândia, tendo já regressado a Lisboa. LBO: 0, , 61598, [vozes] o presidente terminou ontem a sua visita oficial de três EXT: dias à islândia tendo já regressado a lisboa ELF: The three following extracts of label files show, respectively, the prompting (LBR) and transliteration (LBO) fields for a spontaneous item (What is your birthdate? the twenty-fourth of april nineteen sixty seven), a money amount, and a spelling example, where the speaker specified the diacritic (tilde): LBD: CMT: *** Label file body *** LBR: 0, 28383, , -32256, 32256, Qual a data do seu nascimento ? LBO: 0, , 28383, vinte e quatro de abril de mil novecentos e sessenta e sete [sopro] LBD: CMT: *** Label file body *** LBR: 0, 22399, , -32256, 32256, 91.900$00 LBO: 0, , 22399, [respiração] noventa e um mil e novecentos escudos LBD: CMT: *** Label file body *** LBR: 0, 32959, , -32256, 32256, P, A, N, C, Ã, O LBO: 0, , 32959, P A N C A com til O ELF: 8. Lexicon The lexicon includes approximately 4100 different words found in the corpus of phonetically rich sentences, and an extra and set of approximately 500 words found in the rest of the corpus, but not in the sentences. The SAMPA symbols adopted in the SAM_A project for European Portuguese have been used for the (broad) phonetic transcriptions.. 15 The symbols for vowels were the following (IPA number in the second column): i 301 e 302 6 324 o 307 u 308 E 303 a 304 @ 322 O 306 The top five vowels can also be nasalised (with tilde “~”). The symbols for semi-vowels are the following: j 394 w 321 They can also be nasalised. The symbols for consonants were the following: p 101 t 103 k 109 b 102 d 104 g 110 f 128 s 132 S 134 v 129 z 133 Z 135 l 155 l~ 209 L 157 r 124 R 123 m 114 n 116 J 118 The transcriptions were automatically generated and manually corrected by a phonetician. Only one pronunciation is indicated per word. It corresponds to the one spoken in the Lisbon region and most frequently heard in the media. The capitalized isolated letters used for spelling are also included in the lexicon, with their most common pronunciation. Some speakers were not familiar with spelling and, instead of doing it, they spelled the word with pauses in between syllables. These (rare) syllables are also included in the lexicon with their default pronunciation. 16
Documentos relacionados
Language Manual
using space or full stop (not comma). In order to achieve the right pronunciation the grouping must be done correctly. The rules for grouping of numbers are the following: • Numbers are grouped in ...
Leia mais