电竞酒店装修效果图: MODELLING PRONUNCIATION VARIATION SOME PRELIMINARY RESULTS
MODELLING PRONUNCIATION VARIATION: SOME PRELIMINARY RESULTS Mirjam Wester, Judith Kessens, Catia Cucchiarini, Helmer Strik In: H. Strik, N. Oostdijk, C. Cucchiarini and P.A. Coppen (eds.) Proceedings of the Department of Language of those cases 46.0% and 48.7% were alternative variants. In 10.2% (original) and 10.6% (new) of the total number of words one of the multiple variants was chosen. From these data we can infer that, on average, one of the alternative variants is chosen 134 in about 45% of the possible cases, and in 8-10% of the total number of words. However, most variants will only differ in one phone from the canonical form. A comparison of the two transcriptions of the training corpus (i.e. the canonical forms versus the transcriptions obtained with forced recognition) reveals that they differ in 6,594 of the total 318,774 phones (2.1%). This seems to be one of the reasons why the effects on recognition performance are far from dramatic. Adding variants to the test lexicon increases confusability, which could also be one of the reasons why there was not a great deal of improvement in the recognizer’s performance. In the tests in which the multiple pronunciation lexicon was used 48% of all variants in the test lexicon (1341 entries) never occurred in the test corpus. 19% of all entries in the lexicon were alternative variants which were never chosen. In 5% of the cases the canonical form of a word was never chosen but, instead, an alternative variant was chosen, and 24 % of the entries in the lexicon were words which never occurred in the test corpus, neither the canonical form nor an alternative variant of those words was ever chosen. This is partly due to overcoverage of the rules but also to the fact that a lot of canonical forms in the test lexicon have been added for application specific purposes. There are, for example, quite a number of station names and time indicators which do not occur in the test corpus but which must be contained in the test lexicon because they are considered to be of utmost importance for the application. In other words they may not have occurred yet but they could very well occur in the future and as the CSR is part of a system for a public transport information service, it must be able to recognize all station names and time indicators as they are crucial for the success of an enquiry. In order to gain more insight in these data, we compared the four versions of the CSR. First we determined for each version of the CSR which BS contained an error. Subsequently, for four of the six logical combinations of the CSR (those in which only one factor changes, while the other is kept constant, i.e. SS-SM, MS-MM, SS-MS and SM-MM) the BS containing errors were compared. The results of these comparisons are shown in Table 3. Table 3.Comparisons of the performance of the four versions of the CSR. CSR 1SSMSSSSM CSR 2SMMMMSMM same errors1630159210891066 other errors364400836844 improvements5481123123 deteriorations3942148124 net result+15+39-25-1 135 From Table 3 it appears that a considerable number of utterances contain a recognition error in both CSRs, either the same (row 3) or a different one (row 4). Furthermore, there are cases in which a better solution is chosen (improvements, row 5). However, since in an almost equal number of cases a worse solution is chosen (deteriorations, row 6), the two effects balance each other off and the net result (row 7) is small. This neutralization effect explains why no considerable changes in the error rates were observed in Table 1. It is well-known that including alternative pronunciation variants leads to some sort of trading relation between improving performance (by covering part of the variability in speech) and deteriorating it (by increasing the confusability between the entries in the lexicon). Based on the fact that only 2.1% of the phones differ between the two transcriptions of the training corpus and the results shown in Table 1, it could be concluded that the use of multiple pronunciations during training has little consequences for the recognition process, for instance, because the acoustic models hardly change. However, comparison of columns 4 and 5 with columns 2 and 3 in Table 2 reveals that varying the phone models produces more changes than varying the test lexicon. A comment on this may be in order. Using multiple variants for testing simply means that the CSR can choose from among a greater number of possibilities for each word. Put differently, the variations in the system occur at the word level and concern only a limited number of words. When multiple variants are used for training, on the other hand, they produce different acoustic models. In other words, in this case the variations occur at the phone level. Since all words in the corpus are made up of phones, the effects of variation modelling during training are likely to be more pervasive. Further inspection of Table 3 also reveals that, in spite of the greater number of changes in columns 4 and 5, the net result is negative, while in columns 2 and 3 it is positive. In other words, the fewer changes in columns 2 and 3 successfully conspire to achieve better recognition results, while the net result of the larger number of changes in columns 4 and 5 is a deterioration. A final remark concerns the number of utterances in which there is room for improvement. It appears that 4,038 of the 6,276 utterances are recognized correctly in all four systems. Since 1,066 utterances contain OOV words they can never be recognized correctly. Therefore there is only room for improvement in the remaining 1,172 utterances. With this in mind no dramatic changes in recognition performance can possibly be expected. 4.Discussion and conclusions In the previous section we examined the results of an experiment aimed at determining the contribution of pronunciation variation modelling to improving the performance of our CSR. One of the things we have learned from this experiment is that forced recognition as it was implemented in this method is a useful instrument to identify possible errors in the transliterations and in the lexicons and to spot the utterances that, for some reason, present insurmountable problems to automatic speech recognition. Studying these sentences in further detail is certainly worthwhile. Furthermore, in 90% of the cases this forced recognition procedure selects the correct pronunciation variant. As far as the main goal of this experiment is concerned, i.e. establishing whether the applied method is suitable for improving the performance of our CSR, we can conclude that there are no reasons to assume that this is not the case. As a matter of fact the observed 136 changes are in line with those reported by other researchers. The only problem seems to be that in our research the variations are very small. In this respect it may be instructive to consider the following facts. First, the statistics concerning the material may have played an important part in limiting the effect of pronunciation modelling on recognition performance. It should be borne in mind that an alternative variant was chosen in only 8-10% of the cases. Moreover, in most of the cases the alternative transcriptions differed in only one phone from the canonical form. In connection with this, no more than 2.1% of the phones were changed as a result of variation modelling. Furthermore, in only 1,172 sentences was there room for improvement. Finally, another factor that should not be overlooked concerns the phones involved in the rules under study. Since the four rules concern phones that are very frequent in Dutch and in the material under study (in the training corpus /n/, /t/ and /?/ are the three most common phones), there are so many occurrences of these phones, that the impact of variation modelling is likely to be limited. If we consider all these aspects, it is not surprising that recognition performance hardly improved. Moreover, it is important to point out that our research is at an early stage and that a number of things that we intend to do have not been done yet. For instance, in this experiment we have confined ourselves to within word variation, whereas modelling variation above the word level may be even more important (Cremelie and Martens, 1995). Second, since only four rules were investigated, only a small part of the variation in the material could be covered. However, it is our intention to expand the set of phonological rules so as to maximize coverage. Another factor that might be responsible for the limited impact of pronunciation modelling on recognition performance and that we have not controlled yet is overcoverage, that is the fact that the rules selected generate a great number of variants (19% of the total lexicon) that are not present in the corpus. This was to be expected because no pruning of variants whatsoever was carried out. The reason for this is that in this phase of our research we did not want to exclude variants that might turn out to be useful at a later stage. Since we opted for overcoverage, this should be considered when analysing the results. It is obvious that in the future we intend to examine pronunciation variants more critically, before including them in the lexicon. More attention will be paid to the variants that are indeed present in the corpus. In addition, the frequency with which they occur will also be investigated, so that a probability count can be attached to each variant. In the light of these considerations it is therefore legitimate to conclude that the results of this experiment are promising, in spite of the limited increase in recognition performance. Acknowledgements This work was funded by the Netherlands Organisation for Scientific Research (NWO) as part of the NWO Priority Programme Language and Speech Technology. The research of Dr. H. Strik has been made possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences. 137 References Booij, G.E. (1995), The phonology of Dutch. Oxford: Clarendon Press. Cohen, M.H. (1989), Phonological structures for speech recognition. PhD dissertation, University of California, Berkeley. Cremelie, N. and J.P. Martens (1995), On the use of pronunciation rules for improved word recognition, Proceedings EUROSPEECH’95, Madrid, 1747-1750. Lamel, L.F. and G. Adda (1996), On designing pronunciation lexicons for large vocabulary, continuous speech recognition, Proceedings ICSLP 96, Philadelphia, 6-9. Shriberg, E., E. Wade and P. Price (1992), Human-machine problem solving using spoken language systems (SLS): factors affecting performance and user satisfaction, Proceedings Speech and Natuaral Language Workshop, Harriman, New York, 49-54. Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1996), Localizing an automatic inquiry system for public transport information, Proceedings International Conference on Spoken Language Processing (ICSLP) ‘96, Philadelphia, 853-856. Strik, H., A. Russel, H. van den Heuvel, C. Cucchiarini and L. Boves (1997), A spoken dialogue system for the Dutch public transport information service, to appear in International Journal of Speech Technology.