· ij ˇˆ ˘ ˘ˇˆ˙˝ ˇˆ˙˝˛˚˜ ˚!"#$%˚!"#$% ˙˝˛˚˜&’˝(˝ ˛ ˚!"#˙˝˛˚˜˚!"#˙˝˛˚˜ ˚ˇ˛ˆ
˘ˇˆ - JST · ˘ˇˆ˙˝˛˚˜ !"#$%&’()*+,-./012"$ˆ3 ˜ 456789%:2; 0ˆ?@; ˆa- ˜ b cd ef...
Transcript of ˘ˇˆ - JST · ˘ˇˆ˙˝˛˚˜ !"#$%&’()*+,-./012"$ˆ3 ˜ 456789%:2; 0ˆ?@; ˆa- ˜ b cd ef...
���������� ����
������������
������ !"#$��%
&'��(
��� )*"+,
���������� �����������
��������� �!�"#���� �!��"
��������
���������� ����� ������� � ���� ����������������� !"#$%&'()
*+,-./0#123456789:$%&'()!";<#=>#?=�@ABCDEFG
=HIJ#!";<KL9MN@AEOPQRSTNUV:WXYZ[\N]R^_`\$
a'bc(,d%#efN9ghijklHmnNopqr@DstuvYklH
wx\N9yz{|}~��#@A*+KL#�#�j:� !"Y����\��:
�No�@A��#� �Cj���l�"Y����0N�Q����lHW9��N
9���Q@AD"#��N���� ��:�z:r@������"#�@ PQR�D¡
�H¢£¤¥¦!"§ |()¨(-#©ªP«¬N�:� �C#®¯°(,d%D«±
�lP²N:"#�@Dst�����!";<³oQ_´µKL¶#·fN`¸�12
D]�=H
12#¹º9:$%&'()N���� !"#»B�¼!½@A:¾¿@A:"À#Áv
½DÂ�ÃPG�: GÄ#r@³� ���ÅN§·�l!"�Æ#¥ÇÈÉD]RHÊ
¼N:Ë�¼DÌÍ#o�¹º³ÎÏ#� ]Ð�ƶÑÒ&%Ó�lÔPN�:yÕ#�
@A*+KLDÖ�P�lH
�×�"@A#��ØoQ_N�����YÅ#�RNÙÚl�
�4���#_ÛØÅ#�RN� ��DÜpl�
�Ý�µÏ#ÞßØ��àá#âãäPo�#åæä
PQRç�#è¨éNêQ�:!"o��Û#+ëìP~éí¯î/ì#12D]Q:
�ï�¢£¤!"|()¨(-#ðñò©ªò«±
�ó�ôñ\õöN�l÷ø{()ì
�ç�o�¹ºò÷øo�@A#ÑÒ&%Ó
�ù�� *+,-./ò$a'bc(,d%úû#üfì
#ù�#~34(ýDþ�;Ú��12D]�=H��9�ï�� j¡�H
Ô#KL9!"|()�¼�"Y����0#��µÏ#ÂP�l� �Cò ��zòr
@�j#@ADÈÉòf�lH!";<KL#;:hi�oQ_j�"#�@0DÜplH
G=Y��:WYÅ#�RN"D��� DÜp:Å#�RN!"@A�¼� #���0D�
��l�:PQRW# G_#12�¼����¹ºD���lH
�12# �jkl������� !"#$%&'()*+,-./0#¹�9:$a'b
c(,d%úûKLD9�>:WP��#!"×%)(�6(-#k_:WN�GQ@A�
�#Ä�DÞßG:yz{|}~��NêQ���\���Yj�=HʼN:Ô#<��P
G��¼�= �!"|()¨(-9� o�õ:� �ìõ:ü!"+õê�#!"$
õN§G�%&\N¬D'�Q12()DEOj�=H
�*#�mnNopq0r@DstuvYklÔPN�Q�:+,N9!";<#Õ#-
.Ò3PG�"N�@D-�lÔP:��./01D"N������lÔPY �PG�2
p¼�=HG�G:�ü#"Y����3495¦#ÔP�Y¼�r@��0D9l�N6p
lH
���
����� � ����� ������������������� !"#$%&'�(
)*+,-./012"$�3�� 456789%:2;�0�<=>?@;��A-�� B
CDEF� 1'GHIIJKL��>#:KL67MNOPQ)RS<=TUJV
�H�
�
§ � !"$(÷-#«±N���:ʼN� !"78#��j$(9ìÊ��Ql
@A9o�@A#��¼::Ë�P;�N:<=\N>%®(®é�!"@AY?v��
�D��ÔPD+�G=HÔ#>%®(®é@AY¡�@#9� �#r@��NPÅ�¼::
A�³B«�Å#"\��P²N:C�ÄP#WXDE³§ FG�¼H��lIGÊ�Å
#@AD��\NsJjQlHÔ#@A9o�\!"78#��0j9�K:LMNU\�
OjÜp¼��QlHPè¨é#�"#��jÜp¼�l@A0#Q|éì³�þ�DRS
�lÔPY:�12345678# \P��=H
Ë#=>:n�l�"Nr@0#��¼::��\$a'bc(,d%Q|é#T�12P�
�=HË#U�:?=�!";<Q|é³VW×%)�6(-KL#XY:¢£¤§ Z!
"$(÷-�¼#� n[\]#~éí¯î/#^_N@��=HIJ#� n[9!½j
k�=Y:�`:+� «MxDn[PG:$(÷-�¼"#��DRS�l=>:��³�
zD\]abPG�n[\]D]R_CjklH
�`:óc¬#;<deP���QlY:Ë#ô;ìYfg#hiP�lHo�\@ADÜ
pl#9IJÛ �jk��#^_�lk���jk:@Ç@ADÜpl#9§ Z°%5%
�Chakai�DÏNef�lÔPP�lH
�345678#!"$(÷-#m�nÔGU�D«¬�lP:opZ� ���q��rst��� �u
@A�P kpZ� �sqq��t @Ç@A�#óc¬N«¸luvYklPà�=HopZ� 9m�
nÔG@A#����:��@A�jv«���D��j�lY:kpZ� #;:¾¿@
���������� ���������
���
Aò"À@AY�QP�����: ����D��j��Q;Y�QH��:@Ç@A
P9wmNxyÊ���Qn!�¶:Ýz:p(:{Q|�#� !N}�l@AHkpZ�
9:m�nÔG=� n[#~�«�¼Q#�P�lH
Ô#�R�@Ç@AD���l=>:!";<#VW#?=�×%)�6(-#T�Yj�:
��VWj9�K:��VW#ÞßN�l!";<ÄÛDEFG�QlHÔ#×%)�6(-
9��#��¼:: ����B«òA��Å�P²N�ÄDE�WXDEò§ FG���
Å�#VWN��� \� #äÀò¾¿�ÆY�}é)ìÊ�:FG³��Nhi��
n[Y|()¨(-�¼\]Ê�lHË#� ��9;��.�-8BC���=�jk��
@: \§ ��DÜplKLjklH
¬�N�l��D�lP:k�����9��\Nóc¬#78D;�NÜ��lH��:
�_9� #":�_9$%&'()#@AHó���#KLDef�lP:Ë#��¼�
#"�¸G��¸¼��QY:�`#KL#efN���"@AP$%&'()@AN«¸�
�78D�¸lÔPYj�lH� !"#;@;�N:"Yóc¬#@ADÜ��lP�p
¼�lHIJ#!"*+KL#efN�lPo�\@AG��¸¼����=Y:�`:�
��345678#KLN���!"78Y��l�_#@A@*+j�l�RN��=Ho
�@AP²N@Ç@A@*+Ö�P��=H
$a'bc(,d%úû#STD9�>:WX�¸Yef�lKL#��¼::4�Ò8³�
-)Ñ(c~�(�-N@efj�lKLjk�RH=Ppq;<KL9:�(���Å#e
fj9@A#EOP²N§ \Z$a'bc(,d%@Ö�N�lHËG�Ë�9:�� j
��Q@ADÜ�j�l�RN�:oQ_´µKL9 ¡Ä#��@Qb)�lKLP�
�RH
�123456789¢�#�£Xj��K#W#��N���:yz{|}~��¶#K
LEFP²N:WX#�"N�l$a'bc(,d%0#+�#�j95¦jklU�D�=G
=�@G��QY:o�¤õò!"$õ«¥PG�?=�'_DEFG=12jk�=Hfg
@Ô#12Y¦�:%§¨NêQ���K#12�Y�Expressive Speech Processing0D.(
ÑPG�©ª#¯(9NõJjQlH
�����
������������ �����������������������
�� !"#$%&'(")*+,-�.�/0123�45��6 78"9:;
;�<'%�=>0? ���@AB����CDEFGH��IJKLM�N��
O�����P����N�QB�RST?UV>WX�YZ[F\]>@^_�\]
`aEFIJKLM�N��b
GcdEefgh�ij0kl�mnE;IoXp��ef��&q���
rjstu�vwZ�xZ���u�y��z{|ef}~ghP0?�ef��
�����������b����������@���ef&q�����Z���y�ef��
���ef����9���@�����&q���CD��c� ¡vw��¢
£¤¥¦§�¨©Xf|0ªl�«nX¬��@�!®$¯°±²"³´&q��b
���������� ��������������� �� �
���
�µ¶·¸Z�Z�x�ef'"%¹K)vw�º»@�¼½¾¿&q���À
ÁZ�CD���y����Â������Ã8��ÄÅ#"�@�ÆÇ����&q
���bÈÉÊbË��ÌÍÎÏ��Ð����ÑÒXmnÓ�ÆÇÔÕÖ×@p�GHE0�
�@� Q����ØÙX�g�ÚÛ@ÜY±)9ݶÞ��b�CD��c�
¡vw��¢£¤¥¦§��ßà���ákÓ@âã�ef!Ã)S��äh���;�
��åÝäh�efghæefçèé«'"%¹K)�äh@� Q�b
?UêàE0��ëì������<�(±²"@}~_�díSî0kl��Z
ïðñòefó�!Ã)�S��ôe�}~0? ��f���@î·_���ef
çè�¬�� ��;���åÝ�ñòó������ ��õ��º»@ABQ�
.�ö÷���øù�����XúûE0�ü�R�ý��W�þU�6��E���
�F��ÚÛ��^Ó@Â��[Q�ñòF��E��1���ó���ef@
`p��ü�ü��Ùøù�Q0Ã8�'ݱ¹M@� Q���5��efgh�i
j�-@d�_?U�.��-���QB����efgh�{|k�XúûE0�ó
������õ�����ghvw���X0}lQ��0��� �R!�efWX
p�f@��3_����F3�þ Q3"Y��@#$�vw���XF Q��þ
U�ó�ef%,&'"<<�F!®$¯°±²"@ 0���^(�N��6�
¼)����!ÓEF��@ã_.X�N��efgh0X ���QFf�*�á@
î·p+�pQ�b
�.�������þU¬'I��efgh�ijXp���ef�,��;��
�åÝ�+�@���f�-� ¨©���0kl���_��ef�����X
G0��QFR.ákÓ@/�Wó��efghá��0>012_���õ����
�ó�ef���!®$¯°±²"³´���¼½¾¿���ÆÇ��
�����X�üù�¨©��¬���������ÆÇ�������
X±)9Ýßà��g��±)9ݶÞ�����p���}3@45���.X
XG0�� 6@¾7pQ�b
.��p�GHE08ùü�[Q��}3X���9ER¼½��WXG0�Y�E
R�ý��W�#:0������9Ä�N��.��ý��0?Uó��;ü�
����6X<[5�V�1=�����6�!��� �}��/5�¼)F���
��#:@�Y�>���?4@Aøù'BCBX;���åÝ@�g�p�DE
�efghXøFU"Yõ��efghvw%Fþü�[Q�ó��efghtG F
há��A������0?ùF�R¼)�0>W0?�Ró��!®$¯°±²"
³´IM%'W�+�%�[��ý��@��_�'"%¹K)XF�HI%J
$�[Q�b
b
b
b
b
b
���������� ���������
���
������
���� �������
z{|ef}~ghP0?�ef���Expressive Speech����@Üë
�1��������
��ÖKLM@£N§befghvw� 6XF�efgh±)9ݪ?Oef���f
|PQ�R¹MBK;�+��£Î§b2�Fef��@/�ef%S)�S��
ô�TT�£�§bghef�UV�0Znp�WX_��b
���������� ���������������������
�����!"*+#§«P�l��N9:!"o�N�¬#@Ajkl:��ò�z�
Å#÷øo�:r@ò �ä�Å#o�@AY�p¼�:Ë�¼DRS�l®�8¯6~Yu
vP�lHËÔj:Ô�¼#�ÆD@�=!"D;<�l®�8¯6~PG�:.�-8�¼#!"
;<°���± ���tpt�p� ����²,-./P:�MPG�³��´#�!"Dµ�l!"�ÆD��
!"N¥ÇÙ��l:"ÀÙ�®�8¯6~#T�D]�=H
���,-./PG�:y¶Àj·¸�RSDÖ�P�l!";<,-./PG�:½�\]
Z!";<_Cjkl�jk��P«±;<ÄÛ���ko¹j�_CDþ�;Ú�=!";<,
-./Dº<G=H�jk��P���ko¹j�_CDô;�lÔPN�:¥¦!"N»Q¥¦äD
@�;<!"N�=��Y¼:!"�ÆD·¸NRS�lÔPYÖ�P�lH
Ù�!"#¥¦ä`�#=>NDFW(Dynamic Frequency Warping¼Ç\½»�¾Ù�²
D¿V�lÔPj:yÀz��¥¦ä#yQ"ÀÙ�YÖ��®�8¯6~PG�ü�G=H�
=:Ô#"ÀÙ�KL9«Á\�ÆYÙ�§«P�l=>:¢Â\�¾¿ä#¥ÇÙ�Ä
ÛPG�:;+� �¼�l¾¿$(÷-DfQ�:VWÊ�=ÃÄ÷)(%PÅ@»QÃÄ÷)
(%DÙ�ÆP�l!"#$(÷-�¼\]G:Ù�ÇP�l!"#$(÷-�¼Ë�N§·
�l÷)(%DfQ�¾¿äD¥ÇÙ��lÄÛDXYG=H
�� � !"#��$%&'(��)*+,*�-./01/23�
fg#!"@A*+KLNêQ�:;<!"NÅ#�R�c¬#��D�=�l#Yu
vjkl�DXYG:È� �N�lÉ[#³��´!"�¨N:��#|()¨(-Dð
ñ:©y:ÊËD]�=H
�4 � 56789:;<��)*+,*=�
Z��òÌ�®×¯%Íé �N�lZ̳��´!"H"ÀÙ�ü!NuvP�l|(
)ÎPG�Z��:Ì�!"P@ �ÏN�Ä«Ð�Ñ«#|()DÒÓ�Ô:�©yG=H
�44 � >?@A!BCDE�F5G8��)*+,*=�
!"@A*+NêQ�9½»��äP"ÕÖÇ�°ÃIJP9NUjklP×ÞG�!"#
RSD]RÔPYÖ�jklY:�ØÙN9ÃÄ#ÙÇ9ÚÛ[Ü#ÙÇDÝR=>½»�
�äP@DÞ�lH�=:ßQ!"³àQ!"NêQ�9!¾#R��Ú#Îáj�!Ù
ì#��NâYklHË#=>:yQ"³ãQ":ßQ"³àQ"Dy¶ÀN;<�l=>
N9¥¦� #yQ"³ßQ"�ÅD©yG�!"�ÆD«±G=:Ë�D!"|()¨
(-PG����,-./Nef�lÔPYuvjklHËÔj:äc¬#¢Â\Nµ�l¾¿°É
[#�":yQ!":ãQ!":ßQ!":àQ!":yKßQ!":yKàQ!":ãKß
���������� ��������������� �� �
���
Q!":ãKàQ!"²D@�!½®ø%-!"|()D©yG=HË�å�#|()9ÑæÑ�
ç�¼�:æÔ#Óä�è()#!"D©yG=H
�444 � H%&IJKLMNOE��)*+,*=�
!"o��¬#��#è�\�@#PG�r@��Yé´¼�lH;<!"NêQ�ê
�#r@��Yëìj��q¬íjklHËÔj:"ÀÙ�ÄÛN���ê�#³��´!
"Nr@Dëì�lü!:���#!"|()¨(-PG�#¬íä#ü!DÖ�P�l=>
N:îï� #!"|()PP@N�/0�.#0�ðG�0Dñ>��"Ê�=!"|()D
Ë�å���Xòz:æÔ#Óä�è()#.�-8³��´N�©yG=H
�4P � QR�;S;��)*+,*=�
.�-8³��´Nê¸l!"#�ÆP:¥ó� Nê¸l§ !"Pj9: �Y§
�NêQ�¯øÒ7-G�Ql�ôõG�Ql�#�ö:�=÷¼�N G�Ql�=Å=ÅGK
G�Ql�#�ö|:�ª��ÆD��P�p¼�lHIG�³�Q;<!"#ü�#=
>N9Ô�¼#�Æ#«±YuvjklP�p¼�lHËÔj:æÔ#Óä�è()N�l¥ó
� N�l§ !"D~ø�X©yG=H
�ê:Ô�¼#|()¨(-D���,-./Nþ�ñt=>N9:!½GùD¥Çø¨¯%
Ó�lKLYuvjklHËÔj!"#¥Ç~ø×%{%8ÀzD`��l=>N:!"´µ
KLjfQ¼�l �h·#KLYefÖ�jk�=H
�T � �����UVWX�
T�G="ÀÙ�®�8¯6~N�:Ù�ÆP�l �#!":Ù�ÇP�l �#!"
Ë�å��«#!"|()DúË�lÔPjê�# �# �äÙ�YÖ�P��=H�=:
®×¯%Íé �#Z��òÌ�!"DfQ= �äÙ�ü!�¼:Ù�£ûõöNZ��
!"|()DfQ:Ë#Ù�£ûDÌ�!"NhfG=;N@ �äÙ�YÖ�jklÔ
PY¡Ê�=H
=�G:É[#³��´!"�¼r@Dñ>=!"¶#¥ÇÙ�ü!j9:
r�üp�� �t�ýrþ�DÂúPG=�Sü!j9Pr@!"#Ë�N»�¸¼��Q=@##
MSü!j9r@��#Ù�#í�9¬�j9���=H!"#r@��Nê¸l?v�
v½jkl¾¿äD¥ÇÙ��lKL#uväY¡�Ê�=H
�=¾¿\N���!"|()¨(-D:�jk��N¾¿RS#=>#g*+PG�
���ko¹j�DfQ=���,-./#!"|()¨(-PG�fQlÔPj:�ª�¾¿#;<
!"Dy¶ÀNH<�lÄÛDEFG=HË#&N:ÉW�l¾¿N·�=½�\]D�D
�f�lÔPj:y¶À�;<!"YH<j�lÔPD:yQ":ãQ":ßQ":àQ"#
;<!"NêQ�Í�ü!D]Q¡G=H
YZ�[�
�����!"*+,-./NuvP�l®�8¯6~PG�:½�\]Z!";<_C
�jk��Ny¶À�!"«±���ko¹j�Dþ�ñJ����°���tpt�p� ����²,-./P:
¹��°¹sý���s����tý������ü²ê�#�Ã�°���sr�� Ã��ý�����s� ���¼Ç\½»�¾
�²NÂ�K"ÀÙ�KLDT�G=H
�=:12NfQl!"|()PG�:°�² ®×¯%Íé �!"|()¨(-:°��² ¾¿\
���������� ���������
���
N�?ìG=!½®ø%-|()¨(-:°���²r@DÔ>=!"|()¨(-:°��²¥ó�
§ !"|()¨(-D©y:ÊËG=H
ËG�:"ÀÙ�®�8¯6~N�:!"#���ko¹j�«±N����¼�=½»��ä
°r�üp�� �t�ýr÷ø{()²#Ù�N�:°;+o�jõöPÙ�D]R;#��¼::õö
|()#o�PÙ��l!"#o�PYµ�l;N@² �ä#Ù�Y¬íäN]Ú�l
ÔPY�´Ê�=H�=:Ô�Dr@��#Ù�Nhf�l;#¬íäD� G=Y:«Á
\�Æjkl½»��ä#Ù�#�j9Í�\N¬��r@Ù�!"9�¼�::§�
\�Æjkl¾¿\�Æ#Ù�YuvjklÔPY¡�Ê�=H
�=:���¾¿�[NßQ!"�¼[NàQ!":�=:[NyQ!"�¼[
NãQ!"�j�D��;<!"#y¶ÀìD:���#|()¨(-D��ì�lÔPP:É
W!"#¾¿N·���jk��#½�\]D�DÙ��lÔPN���ü�G=H
�����������������
;<!"Nê¸l��#�¢N�Q�9:Ô#��%§¨jD"Y[Ny��=HÔ�
9ñ��FG#`�N�$(÷-¨(-!";<*+KLY�»�@#P��=ÔPP:!
"§ ,-./³���úûY�ü\�@#P�;<!"N§G�@�'(Ñ%×%)�6
×-PG�yz�¶ÀY���l�RN��=ÔPN�lH�`:r@³�z�Å#��D;<
!"NìplÄÛPG�:�Óé(3#<�jkl:É[#!"N§G�"ÀÙ�D��_
Û:j��°j������s��������ü¼��Ñé$�Q|é²DfQ�;<�l_Û:½�\];
<_CN���c¬#ø¨éDëìG=!"|()¨(-DÊË�l_ÛYklHË�å�
#ÄÛN�áYklY:�12j��="ÀÙ�KL9VWP�l@#Y!"|()jkl
=>:�ª�!"o��¬#�ÆYefÖ�jklPQR�áD��H�=Ù�N&G�!
¾@ADu:G@uvPG�Q=>:!¾*+Nn��l;<!"#�9��j�lH
�����!";<D �G=:�Óé(3#"ÀÙ�KL:��� °���tpt�p� ����²É
W#yä�ì#129: ��)-��:®¯~�¯(��Dü��l=>#�'(Ñ%×%)
�6×-PG�5¦£!Ê�l@#jklH�12#12<�Yüf,-./¶"#Ê��q
¬fä9yK:$J#@A��Nê¸lIG�³�Q@A�%#wx&PG�@�ßYkl
@#jklH
�Óé(3#<��PG�:
ò"ÀÙ�®�8¯6~
òPc#�����!"|()¨(-:ê�#:Ë#©yP|()¨(-ÊË#=>#>¯
ݯ
Dé´lÔPYj�lH�N!"|()¨(-9|()�fND�l'~D(=�)*+
Ö�jk:;12«¥#�,#=>N��j�lH
���������� ��������������� �� �
���
���� �������
!"�Û+ëò!"��#�.í¯ì
(1) �� !"#���
O���ëì�dí��Y�ef���Eçè@�ùø0p�efEZ[+\]
p����^_EF¬�@¾7_�.X0N Q�pøp�̀ a0N���b0?U�
ëì�dí@?UcÙ�pQ+�0de_�.XXF Q�fg0�O���Ö
%�îhE}~@ciQef���j�¬�"�@kQ���LMX_�%�.ü
��������¼½¾¿��X+�F�+��NU���ÚÛ}Ü@�l0
_�m^%N Q�fn0���9Ä@}Ü_�xF@lo�[Fø QQB���
�ïð@pZ�[q�̂ _EF¬�"�?U+gy��j0�Ù@r�Q}~@
d�st�@8Fø Q�fu0���+vw_x0�Y�ef�^_E¬�"��
yz��IJKLM�:h_�0�{|Fdí�N�X�Yé«0:p�N$��ü
@}~_�X���E�þXþU�F�+�0F �pþY��Ó%Z[�X��sü
Q�b
`����0?U�?UcÙ�süQdíXp��Y�ef0p�p�Jùü���
9E^(�}�X����@�ùø0_�.X@d�pQ�..����9E^(X��
RN�WR$XWF��¹:8 R�ü�W0Jùü�?YFe��[��pF
��N��.üù��j0�dpQ��`a��b0?��b
� 9�)M��[�X�Y¼½�DE�����%ójXp��Q�j����
�0NU�����1=@\�_�.X�����������}3��@A%
���[��b
� ef��0ç���j�NUF%ù�����EF¼½ �Þ0����efg
hvw�����.üþ��X��º»sü�.X%Fø Q%�Y�ef��
��0Jùü�+��NU�ñòFefgh0��ø�F�^(�N��b
b .�dí0óp��̀ a� ��9Ä01_���@�F��23�hI@���
.X%�[Q�b
È¡ ��¢+v0£Ye��[��p���b
ÈÈ¡ e��[��p�}�X��b
ÈÈÈ¡ ¹:8*Þ�Y�^¤b
È¥¡ ¹:8*Þ�mVÓb
¥¡ ¹:8*Þ�¾�E^¤b
¥È¡ ¹:8*Þ%<[50¦$�§Ib
b
� �������������������
b ñò����p�p��g¨+vpQ��@©ªp� U«_.X%N��QX$��
¬5���0���þ_®�?YF���N��DE.ü���¯°�(8X\$
ùü�[Q�+pQ�F�(8�N�Fù�efghvwX�±�1=+F��pø
p�.ü%±ùø�á²EF+��N�Fù��YX+�$F��F³Fù�.�´�
���������� ���������
���
¢+v%<[50óp����@<3¾$@T$�µ¶@¦$�·�Ó%N�øù�N
���.���¸F(8X\$ùü���«p�R5�¹º0���þ_W�X
�W�?YF¢+v�»¼@ó�!Ã)�½¾ó��øù¿°p�©ªy�R5W�
y}��e¨ÀÁ�@ÆÇpQ���HI�)½¡Â¡Ã0ã_?Y0�©ªy�e¨�¢
+v�+��«p�+�[��sü�%����g��¢+v��Y%�¼0Z[�.
X%Äø Q�.�.X����¢+v��©ª0£Ye��[��p%�Q�F�e
¨E^¤�©ª0?�ÅÆ ÇÈÉ�����-Fùq��p5�N�´���á²0
+XÊ����·�Ó@ãËp����X30���ÌÍ��ÎEÏÐ��@XÑ�D
0.�?YFá²@Þ��.X���p5���¯°X<[5����Â�ÒT@T$
�µ¶@¦$�V�HÓ�ñòF'"%8L±²"�Ö�0@Ap���+�X\$ù
ü��b
b
0.5
1
1.5
2
2.5
T1 T2
Mea
n N
orm
alized
D
ura
tio
n
Repetitions (N =
38)
Repairs (N = 12)
� W�X��Y�Z[\]^_`_ababcde�Kfg3]^_`hbie-�j�klgm�R]n�eopqrst%uv
�w��xHt ��y )uv�w��z{opqJV�I�1'G0HnX )klg|L}~�R
opqrstHklgm�R)�Z[\�B-��-q��������f�H�
\\] ��^_`aD�bcdef�
�-j'ÉÊ�=!#.�qG#/�\���9:� 0T1�N)¼�Q�RN2Ú
�lHËÔj:N $(÷-�ÁæÁ3 �4ø5��D§«N:!#.�qG#6&D78\N
«±G=H�Á�æ�æ9:X9:n[�9:j;i¼�=� n[�§j#�#[Ü�+�� ò
ÇÛò<�òe=�PË�¼#�§j#Q(ø#[Ü�+Q(ø�òÇÛò<�òe=�>PN:
.�qGYH�l�;D¶?Ï�¥U�òë@��N«¸�¡G=@#jklHAB��ÆP
G�:Cn+��¼�l� n[j#n+Q(ø�#.�qG�ê@N�j(0��D(0�Å
#E¦?ò3 �µ�:Fn+��¼�l� n[j#e=Q(ø#.�qG�ê@NMi
Ô?G�:H� n[e=j#n+Q(ø�#.�qG�ê@NI?�Y�QÔPY'�J
�lHPKN:CPF9!#.�qGP3 v�P#DÞD¡��l@#jk:3 !"#
¥¦�;<NêQ�3 v�D�K�lÔP#?väD¡G�QlH
���������� ��������������� �� �
���
�� W�X�XY������3��,]�eH��]�cda_da��ci����d�abcd��ci�e���������L�
�]�bd��_��� dbabh����¡_�bh����¢bdh���e���L�?":��]�bd��_� dbabh��¡_�bh��¢bdh�e�H�
���3£��3>Gf���3-�f��£V�H�
�
\\\] �gG*hi�j;kl�
!#.�qGN: GÄ#� LÉPC�Ä#� +�#úËDÊplMNDìp:×%
)ø7,d%DO÷N�l��YklP�l�¼:;�#ÔPY�k#0�p(P0�Å#QÚPl
�}ø(N@opl#j9�Q�Hü&:3 !"�¼�}ø(DQRG�u\S«#�DT
ª�lP:C�Ä#� +�#UVYW¢G:�>�C��¼Q@#N�lÔPYAXÊ��
QlH¥¦�!";<,-./YC�ÄNP��C�³�Q� PQRÔP@ �PG�s>l
�¼:�}ø(#�R��u\v½#hi��fÛ@Y«N�K�Z�jklHËÔj:PK
N�}ø(#�fP3 v�P#DÞD[l=>:3 Gù#\Ê��g#�]#á^#ò
z�PGù_gj#�}ø(#c¬ò`zP#DED«±G=HË#U�:�Á�æ�ÁN¡��RN:
\Q3 Gù#_gaÅ�}ø(#HnbYyQc`Y´>¼�=HÔ#c`9PKN�°(
80�°(0jABjk:�~>0�®>0j9'¼����=HÔ#ÔP9:�}ø(#c¬P3
v�P#�DD¡�@#jk:�-P;�N:³93 !"#¥¦�;<NêQ�3
v�D�K�lÔP#?väD¡G�QlH
�
0.0
10.0
20.0
30.0
40.0
50.0
Bn0 Bn1 Bn2 Bn3
�
eto
e
ano
sono
ma
�W�X�WY�¤�¥¦§�%¨dy£©ª«��¨dW£©ª§f0�¬®¯:"%°"±²°"²³´²µ´²
M"0��,]�eH§f¤�¥¦J)�g|-¶°"±·¶°"·£��3>Gf��
\m] �gG*hi�nop�
�-jJ�´=�°(80�°(0�~>0�®>0�Ñ(09@�P@`z#yQ�}ø(jklH
G�G:Ô�¼#RV#Å�D�KfQ:Å�Dk�fQ�Q�9:��#ÎWâYklHÎ
�����������
�
�
�
�
�
�
�
�������� ��������� ������� �������
���������
������
�������
�����
�����
������������
�
�
�
�
�
�
�
�������� ��������� �������� �������
���������
������
�������
������
�����
���������� ���������
���
ä\�!";<,-./j9:Ô#�R�ÎWä@�K�luvYk�RHÔ#�R�ÎWäD
«±�l=>N:N $(÷-�dÑ3 ��# �>PN�e Ñ c¬#�}ø(#�f`zD
�ÉG:§·«±N���Ë�¼#+ÜDf>=H� Á�æ�ø9g �±æÕƾN@P�K+Üj
kl�Á ÕÆ�N�lh/��b9 äÄ�di�H� Á�æ�ø �¼�°(0P�~>09��\��Ú�_
D�lÔP:��ÚV:�°(0D�f�l �9�~>0#�fYj�K:"§N�~>0D�f
�l �9�°(0#�fYj�QÔPYÚ�lH§·«±N�l-$~N@P�Q=7ø-)(
«±#U�:ù�# �7ø-)(YÈÉÊ�=H«+[Ü�¼:Ô�¼9Ë�å�:�°(80
k:�°(0k:�~>0k:�Ñ(0kN§·�lP�p¼�lH�°(80�°(lk9Óä �D
�Ks�:�Ñ(0k9Òä �D�KstHÔ#�R�ÎWä@¥¦�!";<,-./Nm
ì�l#j9�Q�H
��
����
��
����
�
���
�
���
���� �� ���� � ��� � ��� �
����
����
�����
���
��
�
���
��
��
� W�X�¸Y� ®¯:"¹Fº»-���E¼½%�¾� X ¿ÀÁÂ�0HòÄÅÆ)¬�Ç1'GHÈ
+)�E¼½$!³-É�Ê�Ë:$Ì"¼½ÍÎJV�H¶°"±·Ï�¶°"·Ï�¶³´·Ï�¶M
"·Ï�Ç-ÐÑ��H�
�
m] �gG*hi�qr@kl�
klc#�}ø(Y\Q3 Gùj�fÊ�lÔPDÇN¡G=Y:ÔÔj9:�}ø(�f
#v�D���N[l=>N:� §j#¹�\v�DXYG=HÇ]12j:� G�R
P�l�YÈn�aÅ:oÛj�}ø(YfQ¼�³�QPQRÍ'YklHN $(÷-�d�
3 �D§«N:X9:n[#oÛN�}ø(Ykl;P�Q;PjË#n[#ÈnÊY
Å#�RNµ�l�D�Z=HÈnÊ9Ë#� Ns��lQ(ø�ò��ò�Á�N���ñ
�=HË#U�:�Á�æ�ÑN¡��RN:Q:�#ÂúDfQ=;j@�}ø(#Ç]�l�
#aRYÈnjklÔPYÚ��=H
eto
e
anoma
sono
���������� ��������������� �� �
���
17.9
9.2
4.0
14.6
7.4
3.8
0.0
5.0
10.0
15.0
20.0
morae words phrases
mean num
bers in IP
U
with
without
� W�X�ÒY�ÓÔ®¯:"ÕÖ]�ba×��ba×c�ae-�������q�rst%?":ز�LزÙ
ÚØ0HfÛ��Ü-�f�É�®¯:"£V�Ý/£w�q£qfH�
�
ʼN:Á�MÁò<=Áòp?Á�#oÛj#�}ø(#HnbDÁ#�Ê����#D�P
G�34Ò8G=PÔ�:�Á�æ�q#�RNar�Bj»sj�lÔPYÚ��=°�t �4ä²H��:
�Èn����Y�Q�� aÅoÛj�}ø(YfQ¼�lbYyQH³9:�}ø(#
�f9: GÄN� LÉ#úËDÊplMNDìp�QlÖ�äYyQHj9:�}ø(#
�f9C�ÄNP��@í�\�#��R�H
�
0
10
20
30
40
50
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Number of words
%
filler ratio y = 27.82+0.52x (r = .79)
� W�X�ÞY�Úq�%�LØ0ß�ÚÔJ®¯:"��,Hgà)áâgàHLØãfw�Ýä
ÓÔJ®¯:"£Ffå��,£æfH�
m\] �gG*hist_uAvwxy[�
�}ø(#�fYC�ÄNìplí�DXu�l=>N:"+õ\ü!D]��=H�}ø
(#�fY GÄ#v¼�#� _w#"#jk:C�ÄYË#í�D�xG�Ql�¼:
�}ø(#�fDyº�lÔPjC�Ä#� +�Nv¼�#ÙìYH�l9:jklH�-
#U��:�}ø(9Èn�� #oÛjfQ¼�³�QH@GÔ#c`NC�ÄYB�Q
�Q�:� +�NefG�Ql�¼:�}ø(#æ`N���g¦�l� #ÈnÊDkl
òzz{j�l9:jklHËÔj:|¡�#}Ûj#e*NI��~��#�BD\]�
���������� ���������
���
lhi�j:e*#ÈnÊPÇ]�l�}ø(#¬�DyºG�:�!�#\]#"·�
XD� G=H� Á�æ�4 �:Èn�e*DuvP�l;N9:�}ø(YÇ]�l;#a
RY���e*#;�@¬�N"·YßQHn��e*j9Ô#�R�â9'¼��QH
Ô#ÔP�¼:C�Ä9:Ç]�l�}ø(#æ`N���Èn�GY¦KÔPDklòzz{
G�QlPQRÔPY¡�Ê�lH=�G:�(îN@;�#í�Yk:Ô#í�Dìp�Ql
#9:�}ø(#§xPQR�9tG��}ø(ò�(îYºlX����#�@G��QH
933957
928
874
862
874
800
820
840
860
880
900
920
940
960
980
fluent with filler with pause
mean reaction tim
e (m
s)
complex simple
� W�X�çY��èéêëEì�rst]íîeHïÑåð-�ñ'£uv�òó]���_dae²®¯:"ôõ
]�ba×��b��_ie²ö"÷ôõ]�ba×�`h�î_e�BHøùú�à)òó�û£üý��B�þù)���
�BHüý�òó1���G��B+�ñ'k-®¯:">ö"÷£ôõG����ÇëE£�
���H�
�
YZ�[�zdJ�
��#12<�9��#�RN�P>¼�lH
Cb 3 !"j9:�}ø(³!#.�qGPQ�=�u\v½Y�KfQ¼�lH
Fb Ô�¼9:¹�\v�#��¼::3 v�N\K��Ê��QlH
Hb Ô�¼9: GÄ#� LÉPC�Ä#� +�#úËDÊplMNDìp:WX;�
#¥¦�×%)ø7,d%#ü�N��G�QlH
�b 3 !"#;<³C�³�Q�C�ÄNMNDìpl�!";<#=>N9:Ô�¼#
�u\v½#hi��fYuvjklH
�
(2) ���������� ��
�}ø(³!#.�qG�Å9��ä°���qüý����²P��Ê�:»�%§¨j� Yy
���QlHG�G:Ë#aPJÅ9Â�\�e*NPÅ��=@#�:!"´µ,-./j#
�QN�TDk�=@#jk:!";<,-./j/�\NefG�RPQRST�¼#@#
9�QH�12#<�D��N!";<,-./N·fj�lÚ¸j9�QY:Ë#�ßP
Ö�äN�Q�9Y«N¡�=@#P2RHPKN:IJ#!";<,-./�t��tpt�p� ����
���t�r�Y�n�#!"ì0PQR�#��¸j,TG��=#N§G�:Õ�è#!";<
���������� ��������������� �� �
���
9��P#¥¦�×%)ø7,d%#�j#!";<D ��ÔPN��RHËÔj9:3 v
�D�KG=!";<³C�³�Q�C�ÄD�KG=�!";<YgÖ�#KL\hiP
�ljk�RHË#�R�_`äN§G�:�Óé(3#<�9ʼ�l�,#=>#Â�D
ìp�QlPQplHn�#�ù�¼3 ³C�Ä#æ`D�EPG=�ùN��É�ÔPN
���:!";<#·f«¥9��\N�YlP£!j�lH
�
���� ���������
¨©Xf|0ªl�«n�¬��b b
This research group is concerned with the modeling of prosodic and voice-quality
information that modifies the interpretation of an utterance or expresses speaker-
specific states or relationships.
������������
Speech communication has an important linguistic component, but it is also
characterized by the expression of paralinguistic and extra-linguistic information.
Traditionally, linguistic research in general, and speech technology research in
particular has been restricted to the study of text-based information, related primarily to
the expression of propositional content, and has considered the paralinguistic and
extra-linguistic content to be ‘not part of the message’. It is a relatively unknown and
much under-studied component of the linguistic code.
In the present research, we are aiming to produce basic speech technology for general
use in an Advanced Media Society, and consider the expression of personal attitudes and
relationships to be as much a part of the message as is the propositional content that can
be equivalently transmitted in the form of text. That is, we are not so much concerned
with text-based linguistic information, as with the additional affective information that
distinguishes a spoken utterance from its written counterpart.
This research group is concerned with the modeling of prosodic and voice-quality
information that modifies the interpretation of an utterance or expresses speaker-specific
states or relationships. We started out with the intention of adding ‘emotion’ to
computer speech, but soon realised, as a result of our research, that emotion plays only a
small part in everyday spoken interactions and that the expression of ‘affect’ (which
includes more general indications of e.g., personality, mood, politeness, and discourse
intention) is far more common.
Figure 1 illustrates a speech utterance (in this case, taken from read speech) that has
been labelled for prosody in the traditional manner using the ToBI conventions. It
marks the accent peaks and phrase-boundaries alongside an orthographic transcription
but shows almost nothing (apart from an indication of the syntactic structure and
phrasing) about how it has been said.
In the case of read speech, the speaker is concerned primarily with expressing the
content of the text, and often has no direct relationship with the listener and no personal
commitment to the content of the utterance. However, this is not the case with
conversational speech, where the speaker and listener are usually in direct contact, and
the speaker is personally motivated and has defined relationships with the listener.
Whereas speaking-style has only stylistic relevance to read speech, in conversational
interactions the manner of speaking (i.e., the how) is as important as (and often more
important than) the content of an utterance (i.e., the what). Non-verbal ‘grunts’ and
laughs are common in conversational speech, and they reveal much about the speaker
while contributing little to the flow of propositional information.
���������� ���������
���
Figure 3.3.1. A Japanese speech utterance labelled for prosody. The top row shows
the fundamental frequency (pitch), the second row the speech waveform (power) and
the third row the associated ToBI labels which describe the accents and phrasing. The
fourth row shows the text of the utterance, split into words, and the bottom row shows
the ToBI break-indices that mark the prosodic linking between the words.
Although we first attempted to
label our speech corpus for
emotion (using the Feeltrace
software illustrated in Figure 2)
this proved to be a difficult
exercise, as most of the utterances
varied little in their emotional
expression. The differences
were better described in terms of
the 3 dimensions of voice, speech,
and speaker (see table left) using
features to describe not just the
emotion, but also the quality of
the voice, the intentions of the
speaker as determined from the single utterance under observation, and the (possibly
different) intentions of the speaker as determined from the long-term context of the
discourse. The 3-way labeling indicated where the surface-level features differed
from the deeper ones (e.g., when pretending, acting, recalling, or quoting) for
subsequent analysis (see Figure 3).
In addition to speaking-style labeling, we also performed discourse-act labeling of a
portion of the utterances. Since none of the discourse-act label-sets that we
encountered in a search of the literature was general enough for our conversational
speech data, we formed a combination set and labeled the utterances accordingly.
Weekly meetings were held to agree on a consistent set of labels that adequately
described the speaker’s intentions while being both general and concise. Phonemic
���������� ��������������� �� �
���
and syntactic/semantic annotations were performed using the public-domain software
Julius and Chasen. These labels constituted the perceptual component for a
subsequent statistical training against acoustically-derived features.
Figure 3.3.2. Feeltrace labelling of a speech utterance. The four quadrants indicate
the primary emotional spaces, and the listener indicates the emotional colouring of an
utterance by (a) marking a point within the circle, and (b) writing a descriptor on the
line. The 2 dimensions of valency and activation are widely used for this feature.
���������� �� ��� ���
Figure 3.3.3. An excerpt from the speaking-style labels, showing voice, speech, and
speaker attributes for the sample word ‘��� ’ . Pitch patterns are marked in
addition to the paralinguistic features of the utterance for later training.
���������� ���������
���
Figure 3.3.4. Principal Component Analysis shows the relation between perceptual and
prosodic features for (a) improvement of the label set, and (b) mapping between
paralinguistic and acoustic parameters for recognition and synthesis.
As the example in the figure below illustrates, the same word spoken by the same
speaker can have several different meanings according to when and how it is used, and
to whom. By use of statistical modeling (such as that illustrated in Figure 4), we have
learnt the mappings between the acoustic and paralinguistic features for a small number
of highly ambiguous words in order to facilitate the automatic labeling of the
conversational speech corpora, and to provide ways of accessing appropriate segments
from the corpora for use in concatenative conversational speech synthesis (see below,
on joint work with Group 7).
We had hoped to automate the
paralinguistic feature-labelling of the
main ESP speech corpora, but for a
large number of utterances there is a
strong dependency between text and
acoustics, and this work has not yet
been fully automated. However, this
insight led to a new categorisation of
the corpus transcriptions into those
representing utterances for which the
text alone is sufficient (I-type), and
those for which prosodic information
is essential to their understanding
(A-type).
This ‘I/A’ categorical split later formed the foundation for our new ‘conversational
speech synthesis’ interface technology, to distinguish content from affect-bearing fillers.
���������� ��������������� �� �
���
bFigure 3.3.5. Main findings from the prosodic analysis: NAQ shows the voice-quality
setting, F0 the pitch range. It has long been known that voice pitch varies according
to the listener, but we show here for the first time that vocal settings also vary in
The second main finding of our research is illustrated in Figure 5. This shows that
the voice is controlled according to the context of the discourse, and that we adjust not
just the prosody of our speaking styles (F0, timing, and loudness) but also the phonation
style (voice-quality settings) as well. This may appear common sense, but the fact has
not yet been integrated in any speech technology or prosodic analysis. It is due to the
unconstrained nature of the recordings in or corpus that we have been able to show for
the first time that phonation style is
(perhaps consciously) controlled in
conversational speech. Similar
findings were obtained for
politeness levels and discourse act,
indicating that (a) these features can
be recognized alongside the text of
an utterance for a more intelligent
‘understanding’ of human speech by
machine, and (b) that this level of
control must be introduced into
speech synthesis if it is to carry the
same information as a human voice.
Accordingly, we now distinguish the ‘affect-carrying’ (A-type) utterances, from the
‘information-bearing’ (I-type) utterances, and make use of prosodic information
(including voice-quality) to further label the A-type utterances in our corpus.
���������� ���������
���
�Figure 3.3.6. Categorising speech according to discourse event, speaker-state, and
speaker-listener relationship. Discourse events (E) are considered directional and can
function to give or to get A-type of I-type information according to the speaker-state (S)
and listener-relation (O) framework. This is used for input to the synthesis engine.�
Finally, by linking these findings, we proposed a new input framework for
conversational speech synthesis. The equation ‘U=E|(S,O) can be read as ‘utterance
(or ‘speaking-style’) equals discourse event (E), given the self/other relationship
contexts’, where S indicates ‘self (e.g., in a good mood, interested in the conversation,
etc.,) and O indicates ‘other’
(e.g., speaking to a friend in a
friendly environment, etc.,).
Only by the slow process of
manually labelling several
hundreds of utterances for
speaking style and discourse
act information did this
framework become apparent,
though in hindsight it appears
to be simple common sense.
The implication of this finding
for speech technology is that a
new affective (paralinguistic)
level of information processing
must be included if a computer
is to be made aware of (or to mimic) the information that is present in human
conversational speech, whereas current speech technology is only sensitive to a
text-based linguistic level of information.
���������� ��������������� �� �
���
Figure 3.3.7.� The CHAKAI notebook-computer interface for the AESOP conversational
speech synthesiser. Users click buttons to select Self, Other, and Event features of the
utterance, which can then be modified by the sliders on the left of the screen if required.
Final selection is by the activated smiley-faces (lit in orange) at the top, which indicate
the available speech variants for this utterance in the corpus.�
Although in this research project we are not concerned with processing text-based
linguistic information, preferring instead to focus on the paralinguistic information that
distinguishes a spoken utterance from its written text, we have been forced to categorise
many of the A-type utterances into similarity classes for our discourse-level labelling
(e.g., ”Hello”, “Hi!”, and “Good morning”, are all ‘greetings’ and so can be treated
equivalently, as alternatives, to be selected according to the S/O criteria). This event-
based clustering has resulted in a prototype interface for a conversational-speech
synthesiser using iconic representation of discourse acts instead of typed text as input.
A-type conversational utterances are selected in the new ‘Chakai’ interface (figure7)
from a matrix of possible act (or Event) type icons that appear when a given
combination of Self and Other options is selected. Three clicks (and four optional
slider-settings) produce an utterance (retrieved from the database) with the appropriate
speaking-style characteristics for a given situation. Field-tests have shown that with a
talkative partner, this interface can be used to sustain a lively conversation, but since it
can only produce back-channel or affect-revealing utterances, it must be combined with
an I-type synthesis interface for real-world use as a future communication aid.
Figure 8 shows a sample of the underlying structure of the A-type synthesis selection.
���������� ���������
���
Figure 3.3.8. Sample of the linguistic (or discourse-act) structure behind the Chakai
synthesis interface. Equivalent utterances are grouped and appropriate waveform
segments are then extracted from the corpus according to the selection criteria.
� ����������� ��
�6�{|YZ�}~�����
The importance of non-verbal speech processing will grow considerably in the coming
years as the limits of current speech recognition approaches become apparent.
Machines that can not understand what a person is saying may be able to process how
that same thing is being said, and as international travelers well know, it is often
possible to understand much about a human interactive situation with little or even no
knowledge of the language being spoken. One example of the application of this
non-verbal speech processing technology is the SCOPE “Robot’s Ears” project at ATR:
���#�w\@AÉ712T���Rz�����²j9:WXDQb)�l�%�KLP
G�#12Y1���QlH��:����t���s�� #�(/�(5�¼H
��12hij9± WPW#§ ×%)ø7,d%DQb)(G Ë#�0#��D³�Jl
KLDXYT��l¡o�\��Ds>: � ��ò ��z\@A�ÅDàá�lKL
jk $a'bc(,d%@A*+~éí¯î/³Â�KL#T�D \P�l¡ ��z#
129MN ×%.¯56%8é(/FGD¹�G zË\�|()©ªò«±ò*+~éí¯
î/#XYD§«P�l¡Õ�z9 348)×3#T�P¢£ Ť�z9 |()¨(-P
*+~éí¯î/#¥TD ��¡wx\N9 ¦_Çd¢õý(/9 �8/~Ò3j ~
§@Aò!"@A#ÈÉP«± ¨©¢õý(/9 �>P#~§P!"@A�¼#o�
\òo�\$a'bc(,d%@A#«± k�� #ªÒ81ý(/9!"@A#��¼ �
���������� ��������������� �� �
���
��ò�z#@A#«± {|}~1ý(/9 � #�0#��Nê¸l ��«P i
�,#RS#Q|é ê�#� #�0#��Nê¸l GÄòC�Ä#+�zò�zòr
@\��#��#Q|é#¹�D¬�l¡Ë�¼#©PG� ªÒ81Y®�\°(56%
8KL³,-./#ô;ì �� �Y³>l0~éí¯î/#12T�D]R¡Ë�¼#U
�94�Ò89�> �-)Ñ(c~(�ÅN·fj��R¡0
��12è��:î< �q�zÐ�d�z�
¯°j9:����������� P¬s12345678Y±��¼�²G=H
�HUMAINE (Human-Machine Interaction Network on Emotion) is a Network of
Excellence in the EU's Sixth Framework Programme, in the IST (Information Society
Technologies) Thematic Priority IST-2002-2.3.1.6 Multimodal� Interfaces.
The HUMAINE Network (Contract no. 507422) started on 1st January 2004, and is set out
to run for four years. 27 partners from 11 countries participate in the network. HUMAINE
aims to lay the foundations for European development of systems that can register, model
and/or influence human emotional and emotion-related states and processes -
'emotion-oriented systems'. Such systems may be central to future interfaces, but their
conceptual underpinnings are not sufficiently advanced to be sure of their real potential or
the best way to develop them. One of the reasons is that relevant knowledge is dispersed
across many disciplines.�(http://emotion-research.net/)�
�
����[���������
For the provision of information, such as in station-announcements or newsreading,
currrent speech synthesis technology is probably adequate. However, it is
unacceptable (at best, uncomfortable) for use in conversational situations where
two-way speech-interaction is required. The immediate application of this technology
is therefore in speech synthesis for interpersonal applications as a communication-aid,
or in humanoid robots and similar interactive devices that interact with (but do not
necessarily inform) human beings.
Of considerable importance, too, is the impact of these findings in Academic Circles;
until recently, ‘prosody’ was considered as including pitch information first of all, then
timing information, and thirdly loudness. Voice Quality was not generally considered
to be a relevant prosodic parameter. However, our findings show that it needs to be
included as the fourth prosodic parameter, and thereby open up a new field of research
which will have repercussions in technological applications as well.
�
�������������w��x��y[�
If machines can become sensitive to the feelings of people, as revealed by changes in
the way that they speak, then we can imagine a much softer, friendlier environment for
the future. Instead of people having to adapt to machines, the machines will be able to
adapt to the people. It is trivial to substitute an attention-getting speech-utterance for a
beep, but it will also be necessary for machines to sense when to use which; and an
essential ingredient for this is an awareness of and an ability to process affect as
expressed through differences in use of the human voice.
���������� ���������
���
������� ��� �� �
���������� �
���������������������������
�� !"�#$% �&'()*+,-.�/0
12!"3 45!67�
89:;�
<=>?�� ����@0A!"�� �B"�#$% CD<
=>?3 45!67�8EF;�
GHI�� ���<= �JK��LM��NO�0PQ!�RS
TSUVW0X#GHIYZ3 45!67�8U;�
[\]^_ �� �
[\]^_ `a b cd!"ef&'()$gh,
�� ij�YZ$k�X# 3I`Pl �B
67�
8Em�
n.o>?�� �
��%pqf#��$n.o>? �q>rs
truvtrwxt,y$z{%pYZ3 4
5!67�
8|m�
�} �� �
~����0 ���^��� �q��>?� ��#��
,$%��#�+}��\_[���\�� �.o�>?� ��
#��,$%CD3 45!67�
8E;�
�_^���}��^� �� �
����)���X���)�I����$���0
��$�S2$�� �D��I�3 45
!67�
8�m�
��� � $¡�r¢£S,y$ ���X� $¤�r¥
�$YZ3 45!67�8:FF�
�¦§¨�� ��¦§$©��ª«¬$j0��"$YZ
3 45!67�8®m�
�¦¯°�� �±I���²�$³´rµ¶012!��¦$·�
0�l#¯°3 45!67�8:m�
�
� ������� ������ ������������ ���� ���
��!� "#$%&�'()*+,-�����
!"×%)(�6(-KL#12
������������
�Óé(3j9:.�-8!";<#?ÄÛjkl$(÷-¨(-!";<_CN³ G:
Ë#ÄÛNfQ¼�l!"$(÷-ðñN$´DT¼�ÔPN�WX¼GQ!";<,-.
/DüµG:Ë�Dþ�ñJ�$a'bc(,d%úû,-./Düµ�l��D¶�=HÔÔ
j9:°�² r@��YÖ��!";<#12:°æ²¨·¸¹º»¼��Å# �ÔPY½º
�W#=>#$a'bc(,d%úû,-./#üµ:°Á² Z[� D¾¿G:÷øo�@AD
��j�l$a'bc(,d%úû,-./ðñ#=>#� «±:À¢ÏG�AX�lHÁ:
�Á#°�²ê�#°æ²9:�AXm#ÂN¡�ë�:ÃÄÅÆ°æÄÄø²D�"N�P>=H
�
�
���������� ��������������� �� �
���
H%$�s�f#�����YZ
�12912bÒ7ò�Ç%¨éý(/¯(ÈPÉ·ý(/#{%®(Y�ää4��²;j
�>�Q=345678j:æÄÄÄ�N������345678YÊ]Ê�=g:�345678#
+FPG��>¼�=H��N*ZlËr@�./ð�#r@!"|()¨(-#¹�PËr
@!"ê�#;<!"D� G=ÌJü!:Ër@N§·�l!";<,-./#¬º�
j9345678�²��#<�jklH
�� ��@#H%����u�����
r@��9$a'bc(,d%NêQ�?v���D¶��QlHr@DÜplÔPYj�l
!";<,-./9 �ÔP³r@��Y½º��Í�#��¼::Î[�NP��@¬í
�$a'bc(,d%Ä�P<�lH�Áj9EFG=r@��YÖ��!";<ÄÛN�
Q�*ZlH
�12j9r@!";<,-./Dü��l=>N:÷ø{()ì#uvY�K:Ï�:!
"#¥¦äDÐ�j�l�Ñ�%&�BÉ7Â�KL12á�k���jT�Ê�=$(÷-¨
(-!";<,-./�jk��DfQ:��G=Qr@#!�\�ÆD�KsJ�¢£¤�
!"|()¨(-Dº<G:Ë#��¼E¦#=>#!"½�D\]G:Ë#!"»BD_
EE¦�lPQRÄÛDEFG=HÒ=�KQ¸qWX#!"Ë#@#D���78*+D+
i]Ú:N��G=Qr@#�ÆD;<!"N"#j�lH�Á�ø��NEF�lr@!";
<ÄÛ#��D¡�H�_C#eT9��G=Qr@#!"|()¨(-Dº<j��q
Å#�R�r@;<!"j@;<j�lTNklH
Á:�AXmj9:r@!"D �#r@Y"#Ê�=¹":r@;<!"D �#r
@Y"#Ê�=;<!"P�lHËG�:r@!";<P9:!";<,-./DfQ�r@
; name start(s) dur(s) zdur f0(Hz) zf0 voice
# 0.00 0.80 0.713 114.370 -0.335 0.008
g 1.24 0.07 0.145 95.999 -1.351 1.000
o 1.31 0.09 0.075 105.358 -0.817 1.000
k 1.40 0.03 -1.054 92.892 -1.845 1.000
i 1.43 0.03 -0.802 91.945 -1.298 1.000
g 1.46 0.05 -0.685 109.285 -0.475 1.000
e 1.51 0.09 0.244 134.526 ���
N 1.60 0.08 -0.207 ��
������������ �
�������
��
���
��
��� ��������
������������
������
��
������
�� ���
����
������
������
����
����
���������� ���������
���
;<!"Dº<�lÔPDÊ�H�=:!"|()¨(-P9!"»B|():ê�#:�X
@A³¾¿@A�Å:�jk��j#!";<Nuv�+Þ#!"|()DÓÔG=|()¨
(-#ÔPDQQ:!"$(÷-P9:!"|()¨(-#½P�l©yÊ�=!"»BDÊ
�HËG�:�AXmjnN$(÷-Pm��=@#9!"$(÷-DÊG:.�-8$(÷-
9�w�:N.�-8$(÷-P�e�lH
�
�� H%���* �¡��
H%�¢{�
�12j9:�ª�r@#�j@:Å#Â�r@#12N@s���ê:!"NêQ�
@� \!�\�ƳÌJ\Õ«NöQYAXÊ��Ql�.#0�/0�ðG�0D§«P
G=H�12j9:Ô�¼#Â�r@9&%�×%8�[N�Þ#r@D��@#j9�K:
Ë�¼#kHr@@7sG�QlP�plHÔ�D�Cì�l�¼q:�Á�ø�æ#�RN�lH
£¤d�>5G8�
�12#!"$(÷-#£¤N�Q�9:�jk��j#!";<N�ú\NfQ¼��Q
lk��ÑæÑ�D ÖPG=HÕÁj*Zl+óN�:¥¦äD?×�l=>!½®ø%-D6
�N�KG���=Y:U�\NÑæÑ��5#®ø%-Y�¼�=HPr@#£¤P!¾®ø
%-D�Á�ø�� N¡�H
����������������
���
�����
�����
��
��
���������� ��������������� �� �
���
(�)�
���� ��� �� �� ������
����
�� 12 495 39171 349 57
�� 15 461 40928 377 51 (��
!"# 10 426 31840 345 48
�� 12 495 39171 388 43
�� 15 461 38360 402 43 ($�
!"# 9 343 27302 383 31
ATR525 -- 525 31053 403 42
�
¥��* �
IJ#!"$(÷-9:Ë#f<N����Þ)-7DðÞG=�j!½®ø%-D�K�
l_ÛY���=HG�G:�12j9)-79ðÞ�::)(ØÒ8r@D��G�QlG³
�çDj�l�¸�Ks�:³�ÄYj�l�¸¥¦NË#r@DÙ�¦¸¼�l§xPQR
ÔPDº<# �NÚ´=Hwx\N9:.#:/:ðG�#Ër@Y�K�Ú��Ql�
çD�Í�#¥ÛÉÜ��Å�¼©ªG:ÝB�#Þß#@PNr@�DëàG�r@>P
NQ>4(ÓBC#.�-8$(÷-Dº<G=H©ªÊ�=�ç#r@D¢õH4æÔN\R
Ë]jàÏG�@¼�=PÔ�:4Ä��¨9$(÷-ðñ�#r@«¬P+áG=H
01�
~�¯%-â³ã�DHâPG�Q�QÒÓPïÔ#�"D��äj©yG�q�jå± �q��t
jk��Ù�G=H©y¶5�9©yT1�N³�ÄY)(ØÒ8r@DÙKÔPYj�l�R
� ijn3D�l�Å)(ØÒ8r@#æ.N"Y¸=H©y�@³�ÄNçèY'¼�
=�9�áG:¯øÒ7-G�@¼�=g:�=r@#æ.D]�=H
H%¦§UV�
º<G=r@!"|()¨(-#!"D���r@àÏü!D]�=H�]#��Dj
�K�l=>NËr@.�-8$(÷-#M�ç�¼ø%È/N��:�\]G§·�l!"
D0HG:¢õéHæäÔNË�å�ÑÄ�:�C���\RË]jr@DàÏÊ�=HU�
9.#dÄi:/dqi:ðG�äÁijk�=Hr@!">PNê¦�b�ÁÁ�Ái�Dëé×�PG�
À-«+Nê¸lîì#â#XÞD]�=U�:ëé×�9¬�íú�ijîïÊ�=HI
��:r@9¬�NàÏÊ�=Pàáj�lH
�¨@©ª�
Ò"Ó"P@NÂ�½»�9ðG�:.#:/#ðjyK�:«ñ«±D]�=U�:
¬�íúÑijPr@9¬�Nµ��=H÷ò(:� ßzN�Q�9Ò"Ó"XN+áG
=c`9'¼����=HÒ"Ó"#!�\�ÆD�Á�ø�æN¡�H
��������������������
���������� ���������
���
(Hz) (ms)
����%&'( ��)*��+,��
�� 255.8�52.1 67.3�29.5
�� 249.1�49.3 65.3�30.1(��)
!"# 235.7�34.5 74.6�32.8
�� 174.8�38.9 60.7�26.8
�� 161.1�36.1 58.6�25.3($-)
!"# 124.9�20.856.2�20.8
�
���������� �����
./"*01������234567�8�9:��;<"=>?"@01A/��2./
"=BC2DE*FGHIJKL*A/MNOP<QRSTUKL*MNOP2VW�X
"@Y�ZS[\Q]2D^@01_`@��abc@d0cSefgBC2DE*F&
hijkLl�BC�mn<opSq^=rsgF�
«¬�E��®¯d°±�E®¯�²³�
�12j9Â�½»�Pó¦�X�DÂúNn[\]D]�=ê�#�çD�ÄÛj
;<G:z{Ê�=;<n[Ã=Pü&N\]Ê�=n[=D� G=U�:�Á�ø�Á#Â�
½»�Ã=Y¡�É:z{ôN»Qn[Y\]Ê��QlÔPYÚ��=H
� � � � � � � � � � � � � � � � � � � �
�����
�������
���
���
���
���
���
������
��������������������
���������� ��
!"�����#$
�� ������F0�� �� ��������F0�
������
� !"#$%&'�
� �����������������()*+�
��������,-./01�2��3456789'�
���������� ��������������� �� �
���
´µWX¶·� H%¦§UV�
¢õéH�dÔN�_Cj;<G=r@;<!"�%3éDE¡G:\RË]jàÏÊ�
=H�fG=�ç9õöN�jk��j;<Ê�=�ç#��¼Ñ�Dø%È/ÈÉG:�12
jº<G=.#:/:ðG�#Pr@|()¨(-DfQ�:;ñï¢#;<!"�%3é
Dº<G=Hr@Y÷GKàÏÊ�=�;9:Ò"9:.#Ñæi:/Ñ�i:ðG�4øi:Ó"9:
.#Ñ�i:/qÄi:ðG�dæijk�=Hü!U�Dæ�Á�Ñ#r@!"$(÷-#r@àÏ¢
£P;�ÄÛ#XÞD]�=PÔ�:ÒÓ"P@Ár@!"NêQ�:¬�íú�ijr@9¬
�NàÏÊ�=HÓ"#àÞU�D�Á�ø�øN¡�H
�
´µWX¸·� r¹º»¼UV�
¢õéH�æÔD§«Nøù,-./P#� PQRBCj�çÞ�z¢£D]�=H�
,-./9�Ñ��/$#$a'bc(,d%°×9#��oýÒ3NefÊ��Ql!";<,-
./P;ú�#�����û�Ü�Ñ�¯$(#üýþæ�ÄjklH!"�%3éPG��ÑN¡��²
��Dst�ç:æ�²nYÞ�Ê�NKQÔ?#þ�;Ú�Dst�ç:Á�r@�Dst
�çD:�ÄÛP� ,-./j!"NÙ�G=@#DE¡G�9�Á>P#÷�D@���
çÞ�zD�ÉG=H�çÞ�z9�_CYäæ��ij� ,-./Yd��äijk:EFÄÛ
jº<G=r@;<!"#_YyQÞ�zY�¼�=H
´µWX½·� ¾¿UV�
EFÄÛjº<G=r@;<!"#�rzD¢£�l=>Nr@#Ô@�=�çDE¡
G�Ñ��MS¢£D]�=HÔ#ÌJü!@øù,-./P#� PQRBCj]�=H°¹�²
¢õH��øÁÔ�:ê�#:°¹æ² §« (�(�¸��ä���ì�k���¼�ÑÔ�#óÓé
(3D�!�PG�ÌJü!D]�=HQ:�#Óé(3NêQ�@:¢Tîì9�_C#
_Yøù,-./�yQ°�¹���,-./ØÔPD�´G=H
�À H%�������WÁ�
º<G=r@!"$(÷-DEFÄÛ#|()¨(-PG:ê�#�çDr@#Ô@�=
;<!"NÙ��lr@!";<,-./��sts��DüµG=H�,-./N9Ër@#�
51.1
0
20
40
60
80
100
��� �� ���
�� ���
����
������
���
�
5.9
14.3
24.4
15.6
60.0
15.6
33.3
3.3
14.4
82.2��������
����������:;�����<=01�
���������� ���������
���
Nîï!"@\]3,d%PG�àp=H;�ÒÓ �Nk��ÑæÑ�Dâ�#³�_j³J
j@¼��îï!"|()¨(-Dº<G=HüµNÇU��Ër@!"Nîï!"Dàp
=�c¬#;<!"jàÏü!D]Q:üfìNú�#�Qè¨éj¬�NàÏj�lÔ
PD�´G=H
�,-./#~�è×~¯8D�Á�ø�ÑN¡�Hüµ9×%)3¯)o������j]�=H�
,-./9�������û��jǺG:Ò"òÓ"#\]P!"#c¬#\]D]�=g.�-
8VWyºD]RÓø�}�é (�×%)�6(-�¹ o²Dº<G=HÔÔjVWÊ�=@AY
!";<SN�Ê��;<!"YÉWÊ�lH��ÚV: (�(9Ò"òÓ"#Q:��
P!"#c¬D\]G=gN.�-8DVW�lÔPj\]G=r@;<!"DÉW�lÔP
Yj�lH�=VWgN �:r@DÙ��lÔP@Ö�jklH"Px#¥óD��=W@�
ÍN;Ú�=�IVWµÜDef�lÔPj�,-./D�fj�lH�=: �Pr@@A
DëàG=.�-8DÐæ�l��@µËG==>: (�(9vzj@ÐæG= �r@@
Aë�.�-8D;<�lÔPYj�lH
�
�
� �
�
����������� �����
thiu*vw2x<S@yz{-|}~;��5�� �������2u=@&���
S�g���������������3�������5��2�."*F&� ij¡¢����
£�¤�����¥¦S�§2¨^*F&hijY�¥¦©lQªGi2«¬fgF�
����������
k��³¸5-84�}(�����Å#¨·¸º»N��Ê�lWª9:¨·��Y��Ê
�� Ç��Y�Ú��QK=>N:�ªN�x����]:×W:��:o�:����Å
Yã�G�³Y�9M���P�lHËG�:��¸N@��Yê��PB�iTGW$�
�%Dµ³G�¸�q�N lHG�G:B�iTD]RP"DÉ��K�l;Y�QH"
DÉ�ÔP:!#�@DÙplÔP:�ÖÄÖD�RÔP:Ë#M�Yj��K�lPQRÔP
�����>������:;?�@ABCDEDFG�
���������
�� ��
�����
�� ��
�����
�����
��� �
!"#�$�
���������� ��������������� �� �
���
9:o�@A�¸j�K:¥"#��:�z:
r@�ÅD��NÜplÄ�D�RÔPN�lH
º»¼�=VNP��$a'bc(,d%Yj
��K�lÔP#_YB�iTË#@#�@
�¼K�#�GQÔPjklH�129:Ô#�
R�º»¼�YB�iTD]R�N"D©y
�lÔPN���:"D���@¥"j#$a'
bc(,d%DÖ�N�l@#jk:�12
#<�9Ë#�R�WªNP��$AP�l
P�pl
�� ���* �¡��
ÂYZ�ÃÄ<�
�12j9k��¼�:%}�+&#'WD��¼��W#"D©yG=H%}&9()
*`+#qÑ,#ÒäjklH4��Nk��P-áÊ�:�`9.x��Y�]G�M/0#
HjklH��Á1NB�iTD]�=Y:©y5�:¥���9½º�Y�I��µÜ
Dµ³GZ[� 9]pl��Nk�=H$%&'()N�Q�9�GK:�`j@÷®$%#
e2äN�Q��N34ÇD]��QlH5Q�ÔPN�`9��67D����"j
�lYQ:�9"DÉ��K�lÖ�ä@klH%}&N9�8��¼"D���@¥«#"
j"#É�=$a'bc(,d%DG=QPQR9�Yk:&#!"|()¨(-Dº<�l
#P��=HÁ:�W#&ÔPÎW@A9�W#\Q9�N��W#Þ�#@PN¥�
G�QlH
¥��* �-.�
�12#M=l \9§« (�(�W#"D���!"|()¨(-Dº<�lÔP
NklY:�12j9Ë�Në¸àp�îï!"#.�-8$(÷-º<NWTDÜQ=H
�12#§«�9xYg¥ój�"@Î[��½º���Nkl=>:j�l)³tÎ
9j�Q_Y��GQHG�G�Y¼:¥ó� DÐuj�l!¾®ø%-P§«��W#
G_#�Æ�¾¿:�!:"À�Å�D0��lN9klòz#ÎPÀYuvjklHËÔj:
�12j9:M�#À!½Þ:D+���st��Ò8P%}&���:'W��# G_#
�ÆDj�l�¸0��l|()PG�:'W�Y³�;��Ql�mD³�´<)PG�
ÊfG=H�=:�jk��j9!½n[jn[\]Y]Ú�lY:;<G=Qn�³��Y$
(÷-§Ns���Ql;N9:Ë#n�klQ9��DÞ¦xPG�\]Ê�lÖ�ä
YyQHË#=>:�f`z#yQZ[Hf�³��D.�-8$(÷-PG�àp=H�
�ÚV:Õ#Ëc¬#.�-8$(÷-Dº<G=H
°s² Å=)#!¾®ø%-��Ò8�k��ÑæÑ�#RV#�æä��
°�² ¥¦�¾¿:�!P"À#@PP�l=>#'W��WYmQ=Ý>�Áød��
°�² Z[H:ê�#:Z[Húûå?f�³��#�Ò8�ä���:øÑän��
HI���������J�K��4LM�NOP'�
� � � � � � QHIR4S�TU�V�WXYZ
���������� ���������
���
�N:r@!";<N§·�l=>
N:gÀÁjAXG=r@!"$(÷-
#�çÎD��øNQ@G=.#:/:ð
G�#.�-8$(÷-Dº<G=HË#
&:'W�Yr@AVG³�Q�RN�
W#m���G#�ç@àp=HËG�:
!¾®ø%-D�L�l=>Nk��ÑæÑ
��¼ÑÄ�aÅP.�-8$(÷-NB
àG=H
��)*+,*�¡��
©y9C°DL$¤¢õ#'WD��;¢õ#®¯~�¯(��äj]Q:§«�#Î
E��DÅ�ÇNhF9GDJ�Y¼ÀZ�¸�]�=�HIó�H�12j9:ê��#
;<Yj��:�W#� �ÆD0�j�lPQR �#@PNîï!"N�Q��#Ëc
¬#.�-8$(÷-D©yG=Y:Î[�YÞ¦G�³�´�@b�X��lÎ#=>:+
S#�j@efÖ��;<!"DH<j�l�DXu�lÔPNG=H��ÚV:©yG=!
¾®ø%-�Ò8nN:'W�¥�YmQ=Ý>nN:æcþ�;Ú�:Mcþ�;Ú�#ø
c¬#|()¨(-Dº<G:Ë�å�#|()¨(-DfQ��jk��jgÀÁ#�çÞ
�z¢£jfQ=q�ç#;<G:;<U�#«±D]RPP@N�!�æÄÔ�¼<l¢£
ü!D]�=H
� �J�K!¾®ø%-��Ò8#�:
� �JæK'W��WYmQ=Ý>:
� �JÁK�J�P�Jæ#þ�;Ú�:
� �JøK�JÁNZ[Hê�#úûå?f�ò���Ò8DBàG=@#H
gÀÁ#r@;<!"¢£P;�_Ûjn[\]U�#«±D]�=U�:ñÎ\N9
�JæYz{n[Ã=P\]n[Ã=#LMY=Ê��=Y:ÌJü!#U�:�çÞ�z
9�J�K qq�äi:�JæK qÑ�øi:�JÁK äæ�4i:�JøKd4�øijk:�rz¢£j9Ñ��¢£j:
�J�Kæ�4:�JæKæ�ø:�JÁK Á�æ:�JøKÁ�æjk�=HI��:ÌJü!U��¼9�JÁP�JøY
�hG�QlPoplHʼN�JÁP�JøN�Q�9Z[f�ò���Ò8�#æÑ�D;<G:
ÅV¼#|()¨(-D���;<G=�#_Y��QÞ¦S«D|()¨(-�¼\]
j�l�D«±G=HË#U��Jø9Å��æ!½jîìÁ�Ñ:�JÁ9Å�q!½jîìæ�æjk
�=H��#U�D��p=�j:�12j9ÌJü!U�D�ÇG�:�JÁP�Jø9ê�
�#.�-8!";<N§·j�:��#�j@üµ \N9�JøYhG�QlPQRUëD
�=H
�ÅÆÇÈ*�É8ÊË��ÌÍÎÏÎÐÑÒ�\À�Ó¡�
�jk��P�Jø:r@!"|()¨(-Dàp�$a'bc(,d%úû,-./
HI������[\]�
� � � � QHIR4S�TU�V�WXYZ�
���������� ��������������� �� �
���
��sts��pko�DüµG=H�,-./9x#g¥ó�¼�¥�Yef�lÔPD�ÞG�:¼
�#Z[�³r@��Å�K�R@#D¯-8�¡NG:�)%yºj\]j�l�RN~�D
ðñG=H�,-./#~�è×~¯8D�Á�ø�q¡�H�,-./9�o�§·N@�õYÖ
�jk:��Tj9Ì�Q(9¶#i«pYj�lH�`�,-./9'W�YefG�
Y¼¢£D]��ê:^NT#~9®×-D�¸�Y¼yQ¢£D��QlH
�
��
�
�
�
�Ì ÔÕ�;&Ö×DØ GÙÚ%Û&$�Ü_x�ÅÆÇÈ*�É8ÊË��-.�EJ
��;b�
Ô��j#12j$J �ÔPYj��K�lWª#¥"Ë#@#jê�!";<,-.
/D¹��lÔPYÖ��ÔPYÚ��=H�ÁjAXG=$a'bc(,d%úû,-./#
üµj9xYg¥ó� (�(YklòzmnNVWj�l�RNyº~�D$´G=Y:
��+O#^NY���lHË#=>N9,-./#wm¹ºD (�(Y�R�u³�è
(îD¾¿�l¹ºNG:yº~�j9�Ç\N\]j�l�RNüµ�lÔPYuvjklH
ËÔj�12j9:Ë#ÔPD �NÚ´:Ë#úË��PG�:Î[�:�¼#N:§«
(�(#Z[� «±D]�=�HIÁ�ø�Á�H�u³�è(îD.�-8¨(-#�j� �
l#j9�K:� Ê�=P�#÷øo�@A@«±G=H
������������ ���
����������
�����^�� �_`ab�?cdef?�@ABCDEDFGghij�
HI ���������P�klmn�
QHIR4S�TU�V�WXYZ�
���������� ���������
���
«±9�::Î[�#Z[#� ]ÐDSOG:M�#� D�®(�l÷øo�@AD
"#G=3 ø¨éPË#�O¹ºDXYGEFG=HEFG=3 ø¨éPË#�O¹º
D�Á�ø�P¡�HË#g:§« (�(#Z[#� ]ÐDSOG:P� NEFG=3
ø¨éDëìG=HËG�:©ªG=�u³�è(îD�f`zN�«¬G:yº÷ªéN
ê¸l\]_ÛN�Q�XYG=H
«±#�j:Z[� #�j�K'�¸¼�l��Q0N�Q�:� ��D«±G=H«
±N9Óä �+Ô#� � #��¼�k(0:�RJ0:�9Q0PQR�Q:;ñïùï�
DfQ=H«±#U�:�Q#� ��N9:�RÞ0:�"S\0:�)(%Ð�0#ËcN¢
Ïj�=HË�¼P!�\�Æ�¾¿:ê�#H"À�P#DÞäN�Q�«±G=H�Á�ø�T
#�Y¡��RN:�RÞ0#��D���Q9�"S\0P�)(%Ð�0�@&ÒýYyK¢
��"j�"Ê��Q=H�=:g�Àc9�¦�XP"À#��PÊ�lUVWN�Q�9
�RÞ0#�Q�@yQ�ôD¡G:� ��P!�\�Æ#DÞäD¡�ÔPYj�=H
�����o����pqrst�
��� ������� ���
���
����
��
��
���
���� !"�#$%
&�
'()*+*
,�
-./012
3*45
67 89�:�;<=�>
?@ ABC@D
EF
GHIJKL
M*3*NO
PQRST
UVVW XY
Z[\]^"
_�`a bcd�&efgh
W)i
jkl m�ncop
qrC@D
���������� ��������������� �� �
���
���������� !"#$�%
�12N�:IJ#!";<µÜj9ºG��=ê�#�ç�¼r@��DsJ�!
"D;<�lÔPYj�=H�=:"Y�Ú�l��N¼��W#"D©yG�êKÔPN��
� �ÔPYj��K���@�ÄÛDfQ�¥"jQ�@#}�j$a'bc(,d%Yj�
lÔPYÚ��=H"Y�Ú�lÖ�ä#kl»B9�12#'W�Y����Qlk��D
stË#��K#¨·¸º»YklY:Ë��¸j9�K:ÚÛX³YZ¼�Í�Å@é´
¼�:Ô��¼#y[ì��NêQ�9:b(î@y�l@#P2Ú�lHËG�:¼�=V#
Å@\Qv�Y�¥«¼GQ$a'bc(,d%Yj�lÔP0jk:�12#<�9�ÊNÔ
#v�D(=�ÔPYj�:��¶#��z9yQH
fg#�12#,T'ñ�jklY:�345678#<�jkl�"#�@0DʼN"
#G:�y¶À�;<!"DÉWj�l�R���l^_D]��Q�=QH�=: (�(
NÖ£N,-./DEOj�l�R�XYPG�:(\Rz#�j�,-./D�$a'bc(
,d%úû�%0PG�´ÞG�@¼RÔP�ÅN]WG�Q�=QH�NÔ#´ÞN�Q�9:
è^_H�N`Q;Ú�=PÔ��¥"D©yG�Ql0PQRTj�"YÉ�Q0PQRÂúN
;áG�QPG�:´Þ9�¼��Q��RPQR}Ûj#�aYk�=Y:§« (�(Y
ef�l�Tj9�"9gÖ��½ºPQR��N*KG�:bcP@´ÞG�d�=QH
�Óé(3#fg#?=�l,TPG�9:�345678#<�D!"´µN@H�G
=$a'bc(,d%úû�%#12T�D]��Q�=QP�p�QlHwx\N9efg
h�Å#:� 9j�lYxYg¥ó�W#:Z[#� D´µG:��³�z:r@�Å
#÷øo�@A@Üp¼�l�R�!"´µDi}G:�ìj¶�Å�#�#�%Dyºj
�l,-./#¹�DXYG�QlH
�
Affirmative Reflective Turn-holding
0.5
1.0
1.5
ah,un: dur
Affirmative Reflective Turn-holding
100
200
300
400
ah,un: f0
Affirmative Reflective Turn-holding
6.0
6.5
7.0
7.5
8.0
ah,un: naq
Affirmative Reflective Turn-holding45
50
55
60
65
70
75
ah,un: power
(sec)
(relative
unit)
(dB
(Hz
� ����u�vw�xqy���()*+�z{�
���������� ���������
���
�
� �
&'(% )*+,-./�%
� +�ò� ��ò÷øo�ÑÒ&%Ó
�0�%��1�2345�67%
��¹ºÓé(3#�J# �9:��èZ�� �D�"N: GÄ#!"P�z:�
@VDU#�¸l0PQRe*\jklH�Y:Ô#e*ºâDwx\N�>lNk=��:õ
`+k�òĦ��#`iYæ`�lÔPY:345678#l£��j´µÊ�=H{%®(
X#mëDÉ��:Ô�¼#`i9Q:�@�J=JN9noj��Q:G�Gp�\�`i
jk:Ô�¼#`iD�×òq×G�e*D�>lÔP9:e*#ÀzP¬íäDár\�
Bjã�Ê�lÔPY�¼�N��=H�345678#�j��¹ºÓé(3YY«���
D�=G:IN������ ;<!"#T�N��Yl�R�e*D,T�l=>N:Ô
�¼#`iD|s×�::tG�Ô�¼#`iD�tG:Ë#�u#7DÊ�lÔP@:��¹
ºÓé(3# �PG���N[Ü�¸¼�lÔPN��=��Á�Ñ�ï�H
õ`+k�#`iP9:õ`#ô;ND�l`ijklH�ü#� !"N�Q�9!
"õ³:vwxµÜDfQ=�!Ç�y§PQ�=!"¤õ\�12Y�jk:�_:
GÄ#�z³�@VN�Q�9�fë:� «±:"+õPQ�=:$a'bc(,d%ë\�
12Y�"\jklHÔ�¼ó�#129:�vDØÙN�p��ÇÊ�:vDzÞ\N{�
��RP�l�0�vY�|() PG�´>¼�l�0�ÅRQRÔPD¡�q�¬í�p| D¡G
=ÔPN�l#�0PQ�=Â�\��p_Y¢�K6Qö��ê:ô;9x}j9�QHÊ
¼N:Ô�¼ó�#12DnNô;G=�¸j9:¢���~YH���G�RH GÄ#��
l!"N@: GÄYË#� Nñ>l�z:�@VN@\KDÚl9:#�Û\@A9:!
"¤õ\�12j@:$a'bc(,d%ë\�12j@:>K½�\��QG��¸�Q�¼
jklH�ü#e*ºâD�>�Y¼:!"¤õ\�12P$a'bc(,d%ë\�12D
s���:�Û\�@ADYM�Bj�Q�l78\��þ�D�p:¥¼#e*�þ�
D^NG�QKÔPY:��¹ºÓé(3# �PG�ðÞÊ�=H
���������� ��������������� �� �
���
� ��>�|}y~���t���LM�)�������}�����������_`ab�?cd���
�����LM�������)��:�����������W����r� ?��¡¢£��
��¤¥¦��q�����§¨��©Vª«¬���®¯LM�)°;��±W²³´µ�¶�
Ħ��#`iP9:|()©ªPe*NÝR:3ø×®,(³�§�ND�lÛ¿�#
`ijk:Ô�@:�345678D�]�l�j¢���ÍP��=H
Ô�¼#`iD�u�l=>N:æÄÄÁ�z�j9��w���12�0:æÄÄø�zN9+S¬
�j#��X�!"12�0DÏ1T�G:{%®(Xje*ºâ#����DAXG;Q:
mëDêÔ��=H
Ë#g#?,T�¼H��= �|Ø¥¦� Nê¸l!"P�z:�@V#§·D�Ù
NSO�l�j:��óT# �Y?GKH��=H
gïTH¥¦� Nê¸l!"P�z:�@V#§·De*�l�j:� GÄ9�z:�@
VD!"jC�ÄNÜpl0PQR[µ\�$a'bc(,d%�CYü9YM�@#j�QÔ
PYà�G=H�*#78\�e*�þ�#XYPÞÇÊ�lBj:IJ#�C#gËD�
R$a'bc(,d%�CDÊ�lÔPY:?=� �PG�ðÞÊ�=HÔ�9:� !�Np
�G=¯~é�-&(ý~78ë#¹�PoQ�plÔP@j�lH
góT9:gïTPÞÇ�lBj:� �Çø7)De*�lÔPjklH¥¦� Nê¸l
!�P� �zDSO�l�j: GÄPC�Ä#��\[���[:äÏ:�4:#��
(ÑéÊ�Å�P9ÏN:� �Çø7)#�KYuvgÖ�jklÔPYÚ��=Y:� �
Çø7)9Ô��jaPJÅSOÊ��Q�Q#j:Â�\�e*DêÔ�RuvYklH��
Á�Ñ�æ�
���������� ���������
���
���>�·}¸q�¹�xqº»r¼�}½q�R§¨�©Vª«��´¾©W¿À�Á�®¯lÂ)��
_`ab�?cd�ÃRÄÅ�V�´R�Æ�q��¾©�Ǹ)-v�ÈÉWÊÀË�xqº»
r¼��Èɦ²³ÌÍδµ�¶�
12¹�ü�#=>N��¹ºÓé(3Y¶�=��«¶Ø�!"P�z:�@VYÅ#�
RN§·G�Ql�0PQR`i:oQ�pl�¼�!�P� �z#§·DÅ@�KÊ+Ê�=
BjP¼plN9:Ë@Ë@!�PG�Å#�R�@#D´ÞG:� �zPG�Å#�R�@#D
´Þ�Z��0PQR`iDBfG=HË#õòj:!"¤õ\�12P$a'bc(,d%ë
\�12Dô;G:o�@A@YM�BjJñ>l�R�78\�e*�þ�DXYG:
¥¦�� |()#©ªNP@�RÛ¿�#`i#�uD9��=HʼN:|��¶#·
fÖ�äDXYG=H
�����89:;��%
��¹ºÓé(39:¥¦� Nê¸l!"P�z:�@V#§·DSO�l�j:MN
��¢T#Í'D�=HÔ�9:!"¤õò$a'bc(,d%ëòo�12DJñt78\
��þ�DXY�lPQRhiN§�l:���j#�ajklH
gïT9:¥¦� Nê¸l!"P�z:�@V#§·De*�l�j:� GÄ9�z:�
@VD!"jC�ÄNÜpl0PQR[µ\�$a'bc(,d%�CYü9YM�@#j9�
QPQRÔPjklHwx\N9��#PêjklØ°�² �N"ÀPDÚlQK��#!�9:
Ë@Ë@� GÄ9�z:�@VD!"jC�ÄNÜpl0PQR�CN����QHË�¼9
�C�Ķ#Ü���0j9�K�C�Ä#�jQ�êÔ�����lx!]Ç0jklH °��²
� GÄ9�z³�@VD!"jC�ÄNÜpl0PQR \ë\�$a'bc(,d%�C9
�ü#�¥��¥¦� � #e*NhÊ�QÔPY�ªklH=Ppq:n�D�!�N
���plPQR�«9:��R: GÄY���l@#j9�QY:U�PG�9���p_
N·��Ê����� �zY�@G�Ê�lH
gó#Í'9:¥¦� Nê¸l!"P�z:�@VDSO�l�j:�Û\�STYÔ�
���������� ��������������� �� �
���
�j�p¼��Q=��N?v�PQRÔPjklHÔ��j:¥¦� 9�Û#�¨0Nkl@
#P´µÊ���=Y:ü&N9�Û9¥¦� PiMÊ�=PÔ�Næ`�l@#j9�
QH×%8ª(,d%³"À�Å:Ê����!�Y:=PpqMÁ���Á�PQR�Û\F
GN·����DÙplPQRÔPYÚ��=H
gç#Í'9:� �Çø7)PQR:Ô��j� Ê��Q�Qv�#�'jklH¥¦�
Nê¸l!�P� �zDSO�l�j: GÄPC�Ä#��\[���[:äÏ:�
4:#��(ÑéÊ�Å�P9ÏN:� �Çø7)#�KYuvgÖ�jklÔPYÚ��
=H
gù#Í'9: GÄ#�z:�@VND�l@#jklH GÄ#�z:�@V9âã\�
@#j9�K:o��ìN���öQYklH=PpqZ��j9:�J\N9�G�PU#
�Q�Ql��"ÀY:r"D��&N@SOÊ�l�Å:�ì�¬#��-8ø.5(D
?×G=ÎÏ�fëDvüÊ�luvYklH�.0�/0�00�10#ù«)j GÄ#"@D
P¼pl�pD:!"$a'bc(,d%#�þ�Nhf�lÔPN9`iYklPoÚ�lD
��Q��Á�Ñ�Á�H
�Ï�>��}ÐÑÒÓ��Ô�Õr@Ö�«×Ø��Ù=�Ú��²³Û¶q��§¨�©VªRÜÝ
)�V�´R�Æ���ÐÑWÞßË஦µ�¶��Àák4�´R�4â)WRã�ä�0åæ®
Ë®�ç©ä�è¦���«���WVéêë �¶�
g¢#Í'9:SO#_ÛëN@DÚl@#jklHü!ä\�FGj�¼�=!"P:¥
¦�� !"P9:Ô��j�ÞÊ��Q=��Nµ�lÔPYÚ��=Hwx\N9:×%8
ª(,d%P~7�%8#DEjklHÔ��j:Z��j9×%8ª(,d%P~7�%89��
NU#DENk:�#~7�%8D×%8ª(,d%Y����ÔP9�QPQR#YÉ�jk
�=Y:Ô#É�9ü!ä\�FGj9�K5�9�l@##:¥¦�� !"N9u:G@
���������� ���������
���
5�9�¼�QÔPYÚ��=H GÄ#B�V#\ÊN���9:×%8ª(,d%Y~7�%
8D���G:ʼN;N���9:�N~7�%8Y×%8ª(,d%D����ÔP@k�
lÔPYÚ��=��Á�Ñ�ø�H
�Ï�ì�í}���[îï�&WÞ�¸q���*+}k4�´R�dÕð�?cd�ñ¼òdÕRvó
ôõ�z{Wµç���ñ¼òdÕ«�dÕð�?cd¦�©ö�÷�R�®�®¯øùR�úûü)�
îï´ýþ ���WR�ËR��¦����¸q��WRµËR�þ�®÷�¦µ�¶�
��¢T#Í'#�N:��¹ºÓé(3j9:¥¦�|()D©ª�l&N?¢��
ÍP�l3ø×®,(³�§��Å#Û¿`iD�u�l=>N:ý0�P#½��úË#
�jÈ�#Û¿mCDº<G:Ë�¼DfQ�×%��(Ñ%8Pm�nÔG�P'~D�ÚG
=H�=:Þ�°æÄÄø�²�ÅDÉ��:Ô�¼#<�D|�N·f�lÖ�äDXYG=H
���������� �����
��¹ºÓé(3#12N¬s�l12PG�9:°�²¢�l��¼#:Üô\�%�õ\
� GÔPq12:°��²�Ä��: Ý¡¢¼#:��õ\òW¬õ\�� «±:°���²£¤¥
+:¢¥¦:§¨¼#:~{¯�³��ÈD�"P�l:� #�#�Û12YklH°�²9!"
¤õò�Ûò-&(ý~78ëN�=Yl@#:°��²9-&(ý~78ëò$a'bc(,d%ëN�
=Yl@#:°���²9-&(ý~78ëP�ÛN�=Yl@#jklH��¹ºÓé(39:Ô�¼
#©12#<�DPÔ���:!"¤õò�Ûò-&(ý~78ëò$a'bc(,d%ë�j
D'��:78\��þ��!"�Û0D¹�G�RPG�QlH
fg9:Ô�¼#©12P#U#��DʼN\>:²;12DÉG��!"�Û0D�,Ê
�l�¸j�K:��ç�¶#·fDü��lzÞjklH
gï9:|��¶#·fjklH%�|�³Z��|�NêQ���:uväYªq�
�Y¼:����ü�G�Q�Q�!"o�0|�#â«ò+,ì#=>N:�#|¬Ye
fj�l�!"o�0#Â�<)Dº=QH�345678D|�N·f�lÖ�äN�Q
�9:XY#U�:�Y«Ö�jk:��uv0PQRUëD��Ql@#jklH
���������� ��������������� �� �
���
gó9:$õ�¶#·fjklHÅÔj iYÙÚ�=�: GÄPC�Ä#WXDE9Å
#�R�@#��Å:� #BD³t4�Ò8#T�N:Ô#12<�D�G=QH
gç9:�Í�úûjklHV®��Å:Ê����6@j"YÉ��QWNèÚ��:�@
VDÔ>�GDZl��#T�9:Ô#345678Y5l�¼�µÊ��Q=@#�Y:fg
@.�¦�Ô#�R�úûP#ETD9�:��\Niü�b(îD¢iNG�P�=QH
&'<%=>?@AB-./�%
?@CD�EFG�=>HIJK%
�0���89:;��%
The face to face communication is both verbal and non verbal informed; the verbal
part of the communication integrates linguistic and non linguistic information, the non
verbal part, built with face, body and voice, integrates socially learned and automatic
signs.
Since the language is precisely built to express ideas, cognates and affects, it can be
expected that a few quantity of events in speech are related to automatic expressions of
affects (real emotions as defined in the psychology field, more or less inhibited,
depending on the culture) and the massive majority of speech affective events are built
by language. Thus, from a “quantitative” view, the expressive speech is a flow of
continuous succession/superposition of affective speech rarely integrating in parallel
emotional voice. But even if to model expressive speech needs first to model the
“linguistic/phonetic” affects expressions, the rare emotional expressions events must be
finely studied: because the speaker could not inhibited it means that it refer to a
specially intense and crucial affective change in the speaker state.
The ICP participation in Crest was mainly aimed in designing dense multi-modal data,
tools and methods in order to study how emotional expressions in voice and affective
expressions in speech are carried together in the acoustic flow, how the different
cognitive levels of affects can be modelled in speech, in the phonetic, prosodic and
linguistic structure of speech. The ICP study focused on the verbal acoustic material of
expressive communication, but the corpus were carefully designed to represent together
face and voice in very precise timing relation, and verbal and non verbal parts of
interactions (specially data of “feeling of knowing”).
First step was to show that the simulation of emotions by actors is not confused by
listeners with authentic emotions. It implies that expressive speech must be absolutely
collected in authentic production. Nick Campbell’s team developed in Crest an
“extensive” systematic method to collect complete speech productions. In complement
of this approach, the ICP’s team adopted an “intensive” method, that is to collect data
which can be predicted as expected affective reactions of the subject to an elaborated
and hidden controlled situation. The relevant point of such an approach are:
- to compare on the same affective and cognitive contexts different subjects;
- to isolate emotions productions and other current affective productions;
- to study the “rare” emotional events;
- to get very high quality signals;
- to collect speech and face signals, augmented with articulatory signal (in order to
verify acoustic signal treatment), and physiological signals;
- to “trap” actors in authentic productions before they act the same situation in order
to show on large emotional variation that acted speech is discriminated by listeners
from the authentic speech.
���������� ���������
���
1. A cognitive architecture for the different affective processing in the
communication competences of a human agent
Affects species recorded with E-Wiz
The affects are expressed in speech through different cognitive levels which are
expressed in discriminated kinds of affects (moods/emotions; intentions/attitudes;
feelings) (Aubergé et al., 2003):
1.1 The automatic affective processing
The direct expression of the variations of the speaker’s emotional states, independently
of the communication purpose. Our hypothesis is that this kind of expressions,
commonly described as speech expressions, is involuntary controlled by the speaker.
The time scale is not anchored in the linguistic events space but in the space of the
events that cause the emotions. These are external to the communication context (they
can be related by loops links, but anyway considered as external in our view).
1.2 The voluntary affective processing
� Attitudes, i.e. direct expression of the speaker’s intentions, voluntarily given by the speaker in
addition to the communication purpose and directly encoded as prosodic forms (Aubergé et al.,
1997).
� Indirect expression of affects, or expressiveness, is implemented as strategies for the
instantiation of linguistic structures. It operates as a meta-control of the linguistic functions of
prosody (choice of segmentation size, emphasis, focalization, etc).
The expression stream is generated in parallel to the linguistic and meta-linguistic
stream. These two parallel time scales are however integrated in the same speech
(prosodic) material. This point is surely decisive in particular to discriminate the
communicative vs. para-communicative streams (corresponding for example to the push
and pull effects of the Scherer (2001) model).
Figure 1: The place of affective processing in the cognitive architecture of a communicant agent
The communication system, driven by communication goals, uses a set of functions
���������� ��������������� �� �
���
which are valued globally to the system. The system is a set of modules, in interactive
organisation based on co-operation between the modules, typically in a multi-agents
architecture: the specific constraints and degrees of freedom of each module can be
consequently respected. The coherence of several modules, when they encode the same
function values, is made by a rendezvous between the different agents structures for a
same function value
2. E-Wiz: the « trick and talk » plateform
The corpus collection is a key point of the experimental methodologies that are
currently used in expressive speech technologies. We have proposed a way to build
authentic corpora for the emotional level of affective speech. In preceding reports, we
have recalled the strengths and weaknesses of in vivo vs. in vitro methods for
expressive speech, and explained why it is necessary to record authentic corpora of
spontaneous speech. Since the ICP goal, inside Crest, is, in particular, to finely measure
the acoustic variations of speech expressions, we needed some methods to catch such
speech in laboratory conditions for “perfect” signal recording.
2.1. Comparable multi-speakers affective productions
Considering the three bootstrapped levels of affects expressions, it was particularly
important to collect the direct emotions expressions in freezing the attitudes and
expressiveness variability (ceteris paribus) and to collect the direct attitudinal
expressions by freezing the expressiveness.
The ‘Wizard of Oz’ paradigm, widely used for the evaluation of multimodal interfaces,
consists in the imitation by a human partner, called the ‘wizard’, of the behavior of a
complex person-machine interface. The subject believes that he communicates with a
computer, whereas the apparent behavior of the application is remote-controlled by the
wizard. For the collection of emotional speech corpus, the wizard perturbs the
application’s normal behavior, in order to induce emotional states to subjects. Moreover,
it enables to control the phonetic and linguistic contents by the use of a command
language that constraints subjects’ vocal expression.
The key point to develop such scenarios is to define applications that greatly motivate
the subjects: as a matter of fact, their implication is a decisive factor of his reactions to
the perturbations, either positive or negative.
E-Wiz is written in Java language with a client-server architecture (Aubergé et al.,
2003). It enables the user to design induction scenarios, without any particular
computer-science knowledge. The common frame of such scenarios is to simulate the
behavior of a human-machine communication system using voice recognition in order
to collect direct emotional expressions in speech. Indeed, the hidden wizard is given the
possibility to remote control the application, according to the so-called ‘vocal
commands’ produced by the speaker. The platform is subdivided into three separate
applications, including an editor dedicated to the design of scenarios. This editor
application aims at generating configuration scripts describing the whole behavior of the
client-server applications for a given scenario. Then, a server program running jointly
with a client program directly uses those scripts for the actual corpus recording.
Scenarios designed thanks to that software can handle several types of multimedia data,
such as texts, images or sounds. Images and texts can be moved by the wizard to
produce a kind of slideshow on the client side. In order to facilitate the laying of objects
among pages with the editor, particular effort has been made on proposing a
user-friendly interface. For instance, editing and word-processing functionalities have
been implemented, to enable an intuitive use of the application. Moreover, the task of
���������� ���������
���
the wizard may be lightened by making the behavior of some objects automatic. For
instance, sounds to be played may be linked with the opening of particular slides, and
objects moves may be processed on the client side to seem machine-produced. In
addition, automatic countdowns, which behavior when specific values are reached can
be predefined, may also be integrated to the slides.
2.2. The Sound Teacher application
E-Wiz scenarios developed for the collection of emotional speech are all based on the
same basic principle: subject have to interact with the computer using a command
language. The use of a strictly restricted lexicon enables us to collect different
emotional expression on the same words, in order to facilitate the acoustic analysis. The
first scenario, based on logical IQ tests, Top Logic, that was presented in the preceding
report, did not motivate the subjects enough and gave bad corpus.
Sound Teacher (see fig 2) is presented to the subject as a software enabling him to
improve his phonetic learning of languages. The subjects have been chosen to be
strongly motivated by this task. It is supposed to lie on the neuropsychological findings
of perception-action theory. It is based on the teaching of 4 vocal tract parameters
(opening, front/back, lips rounding, centralization). The subjects are trained to recognize
the parameters values when hearing vowels, and to produce them. The scenario is
organized in four steps, less to more difficult from the pretext task point of view, and
with positive to negative feedback for the Wizard of Oz task.
Figure 2: E-Wiz situation with Sound Teacher
The first step is to check the subject’s skills for production and perception of French
vowels for French subjects. An artificially positive feedback is given to the subject quite
higher than a supposed averaged score of the others subjects. Then, the subject must
���������� ��������������� �� �
���
learn vowels close to the French vowel system. The feedback is given as higher than the
five better performances of preceding subjects. He is informed that his high score
enables him to step to a phase of generalization to complex vowels. There, the feedback
becomes suddenly negative: the subject is given a score much lower than the average.
He is warned that those results are abnormal, and that his skills for vowels from the
French phonological system have to be checked again, since the Sound Teacher
software may have perturbed his competences. The last step is thus similar to the first
one, but the audio stimuli have been modified to perceptively strongly decrease the
vocalic contrasts so that the subject cannot perform the task. He is given scores as the
lowest of the preceding subjects. Some commentaries are asked regularly to the subjects,
taking as pretext a beta-version of the software. See on table 1 a summary of the
scenario.
Table 2: Sound Teacher scenario
Each recording session lasted around 50 minutes. For each session, the speech data
consist in the command words ‘next page’ (in French) repeated 50 to 60 times, and in
five monosyllabic words (to avoid timing and long-term prosodic effects) shared in the
phonological space, repeated 11 to 50 times.
17 subjects have been recorded. Part of them are professional actors, tricked in
spontaneous expressions. For them, an extra protocol has been used: immediately after
having been trapped by Sound Teacher, those subjects were asked to reproduce the
expressions of the emotional states they had been encountering during the experiment,
using actor’s methods. This task was performed both on the utterances used in the
spontaneous part and on semantically neutral sentences.
The collected emotions expressed by the 17 subjects are close to what was expected in
the pre-planed scenario: concentration, satisfaction, joy, relief, stress, anger,
discouragement, boredom, anguish. It has to be noted that highly coherent groups of
���������� ���������
���
reaction appear within subjects, surely linked to their psychological profile. The first
emotional labeling is done by the subject himself after the experiment: he is given a
VHS video tape, as well as a pre-filled grid, with the task of describing the different
emotional states he has been feeling along the experiment. This labeling is being
validated by perceptive tests, as well as the labeling of acted productions.
2.3. The experimenta l protocol
Subjects have been recorded on DAT tape in a soundproof room, with an AKG
C1000S microphone, for high quality speech recording. Some references measurements
are kept in order to validate the nature, the intensity and the time location of emotional
variations expressions:
� visual signal, that is mainly movements of the face and the upper part of the
subject’s body ;
� bio-physiological signals (heart rate, galvanic skin response, respiration,
temperature, electromyography recorded with the Pro-Comp equipment) ;
� the articulatory signals related to voice quality (for now only electroglottographic
signal, recorded thanks to the experimental platform EVA2).
These signals can be analysed in parallel to perception measurements. They constitute
the main indices of “emotional timing” to determine the instants when the prosodic
movements, qualifying the emotion expressions, must be measured. Figure 3 describes
the experimental protocol.
Figure 3: the experimental protocol
���������� ��������������� �� �
���
3. Prosody modelling
3.1. Gradual cues vs. contours characterization
Bänziger et al. (2003) come back on the problem of “emotion signature in intonational
patterns”. They recall that this idea, proposed earlier, has been discussed and tested in
parallel with co-variation models, implying gradual parameter variations,
independently for F0 values and voice quality values. The main point to be decided is
whether the processing of affective vs. other cognitive information carried by the
prosodic signal are extracted/implemented following different morphological
mechanisms. The notion of pattern, particularly intonational pattern, in which the
emotion signature can possibly be implemented, depends on the adopted theoretical
approach.
Our central hypothesis is that the perceptive separation between affective vs. linguistic
treatments comes at the end of the prosodic treatment, and not just after the “parameter
extraction”, that is before the morphological (phonological) treatment. In this idea, the
identification of the affective vs. linguistic information is precisely derived of the
prosodic morphology; the prosodic analysis can decide about the nature of the encoded
function. Following this hypothesis implies that (1) a cognitively relevant model of
prosody is a key to identify the kind of processing (emotion vs. high level cognition)
through which the information is treated after the prosodic extraction, (2) this model
must be built following some morphological laws basically the same for all linguistic
and non linguistic functions encoded in the prosodic signal.
3.2. Acoustic analysis
Word and phoneme labelling of the spontaneous and acted corpus was performed
thanks to the Praat software by a single expert. Additionally, Praat scripts were
developed to extract stimuli together with corresponding labels.
F0 contours were calculated on vowels only, located from an expert phonetic labeling.
Values were extracted by means of a prosodic editor EdiProso developed at ICP and
running in a Matlab environment. The F0 extraction algorithm counts, after a signal
filtering, the times the signal goes down to a predefined threshold, set to 10% of
amplitude for that study. Smoothed F0 contours, averaged on 32 ms frames shifted by
10 ms, were calculated from the algorithm output. Flattened contours, plotted on ten
points to enable comparisons of vowels independently of duration, were also extracted.
Vowel duration values were calculated from the phonetic labeling. Those values were
converted from time units to a percentage of variation around the mean (intrinsic)
duration of the same vowel in the corpus, thus enabling cross-vowel comparisons.
Attack and final frequency values were also extracted and used to calculate the
declination line. In order to avoid calculation errors frequently occurring on signal
limits, attack and final locations were shifted from 10% prior to the extraction of values.
Mean and standard deviation of attack, final and duration were calculated for every
emotional label.
Table 2 presents the general characteristics of the contours. It is to be noted that the
neutral contour for the acted emotions and the “nothing” contour for the authentic
emotions confirms the hypothesis of the minimal intonation (reduced
segmentation/hierarchisation, focalization) since the attacks of both are at the same
level as the speaker’s basic vocalic F0 (which is the intonation reference in our
intonation model; i.e. we define here, as an anchor point of contours, the F0 level
which is the difference between the attack and F0 mean), the shape of the contour is flat
and the declination line corresponds to the “normal” articulatory effort on such
monosyllables.
���������� ���������
���
Valence Arousal F0 level
semitones
F0 decl
semitones
F0dyn
semitones
norm dur
%
A anxiety N B 10 -1 1 -15,9
A deception N S 1 1 1,5 85,6
A disgust N B 3 0 1 142,0
A fear N B -4 6 6 14,5
A hot anger P B 15 3 3 29,2
A joy P B 11 0 1,5 16,2
A pos conc P S 10 -2 3 18,6
A pos surp N B -2 8 8 30,2
A weariness N S 8 1 1 -2,9
A sadness N B 10 -3 3 0,4
A satisfaction P S 21 -3 7 77,7
A worried N S 0 11 11 17,9
A neutral – – 0 0 0,5 1,2
anxiety/fear N B 2 7 7 -6,6
confidence P S 3 -5 6 23,4
joy/surprise P B -1 5 5 -12,6
weariness N S -3 2 2 -14,3
neg conc N S 2 3 3 -20,6
nothing – – 0 -2 2 -14,1
pos conc P S 1 -4 6 -1,3
Joy P B 1 5,5 5,5 -5,5
dec/surp P B -1,5 7,5 7,5 -26,6
anxiety N S 1 7,5 7,5 -7,5
Table 2: Characteristic values of contours, F0 level is the difference in semitones between the attack
and to the mean speaker F0, norm duration is the difference in % to 0. A emotion means acted emotion.
N is negative, P positive valence, B is big, S is small arousal, as evaluated by the speaker himself
The general dynamics of acted contours is lower (3,7 semi-tones) than the general
dynamics of authentic contours (5,2 semi-tones). The general F0 level of acted contours
is higher (6,4 semi-tones) than the authentic ones which are in average near 0 (but with
significant variation). The duration of vowels (minimal rhythm) is strongly higher for
acted speech (32% vs. –8,6%).
���������� ��������������� �� �
���
-6
-1
4
9
14
19
semitones
Acted joy
Acted positive
concentration
Acted anxiety
Acted sadness
Acted
satisfaction
Acted neutral
Figure 4: Acted satisfaction apart, the contours of acted joy, anxiety, sadness, pos concentration are
close in form to neutral.
-6
-1
4
9
14
19
semitones
Acted
disgust
Acted
weariness
Acted
deception
Acted
neutral
Figure 5: Acted disgust, weariness and deception have no specific prominence, but do not follow the
neutral basic declination line.
-6
-1
4
9
14
19
semitones
Acted fear
Acted
worried
Acted
positive
surprise
Acted neutral
Figure 6: Acted fear, surprise and worried have similar increasing with a final prominence.
-6
-1
4
9
14
19
semitones
weariness
deception /
surprise
joy /
surprise
anxiety /
fear
anxiety
nothing
joy
negative-
concentrati
on
Figure 7: Authentic deception/surprise, joy/surprise, anxiety/fear, anxiety, joy on one hand, negative
concentration and weariness on the other hand share similar shape cues.
-6
-1
4
9
14
19
semitones
confidence
positive-concentration
nothing
Figure 8: Authentic confidence and positive concentration have similar shape with a prominence in the
first third of vowel.
���������� ���������
���
The differences in shape and gradient cues (mentioned in table 1) should be interpreted
as significant for expressing some cues of emotional/mental states as labeled by the
speaker himself.
In parallel, we symbolized the kind of contours very roughly in 9 classical classes of
contours (/ ; /¯¯ ; /¯ ; /� ; _/ ; ¯¯; ¯� ; �/ ; �_; �). In parallel we
measured quantitatively other parameters. We calculated for F0 the mean, standard
deviation, range, percentiles, min/max and jitter, for other source parameters the NAQ
as well as 11 spectral parameters. The only clear effect emerging from Anova
calculation is the effect of the contour /¥ on NAQ, jitter and spectral slope. But the
choice of symbolic classes of contours is first not univocal to define from the dynamics
of the shape and second may surely not be these classical symbols: the symbolism must
include some cues which can be observed on the preceding figures (4 to 8). In particular
the relevance of the place and threshold of glissando, psycho-acoustically validated but
irrelevant for linguistic prosody, could be evaluated for emotional prosody values, in
particular when the timing is not linked to linguistic units.
3.3. Analysis of the NAQ prosodic vs. phonemic variation
Two speakers were selected on the basis of clear emotional and comparable
productions. After the segmentation of interesting stimuli from the raw corpus, the
phonetic labeling was performed by an expert. Numerous productions of those two
speakers for words supposed to be monosyllabic revealed the presence of an unexpected
schwa at their end, making those words disyllabic. Schwas were therefore also included
in analyzes, as well as other vowels.
Acoustic analyses, implemented on Matlab routines, were carried out for every
stimulus of the corpus. Fundamental frequency and intensity were estimated thanks to
algorithms developed at ICP, and were used to calculate numerous distribution
parameters: mean, standard deviation, jitter, shimmer, range, percentiles, as well as
modeled f0 contours. Moreover, spectral analyses were implemented to calculate
spectral slope, Hammarberg index and average long-term voiced spectrum on 9
frequency bands. Eventually, duration of phonemes and syllables were calculated from
the phonetic labeling.
Amplitude-based parameters have been suggested to provide a more robust method
than time-based parameters for analyzing voice quality. The most widely used among
them is the Normalized Amplitude Quotient proposed by Alku et al (2003). NAQ can be
considered as a normalization of the “declination time”, defined as,
where UP is the peak-to-peak amplitude of the glottal flow, -EE is the value of the
negative peak of the glottal flow derivative and F0 the fundamental frequency.
Automatic calculation of the normalized Amplitude Quotient was performed thanks to
the algorithm developed by Parham Mokhtari, in the Nick Campbell CREST ESP
research group. This algorithm performs a calculation of NAQ from speech signal on
automatically detected syllabic reliability centers. This enables a fully automated
extraction of NAQ values, thus providing a measurement of voice quality on unlabelled
spontaneous speech.
Gobl and Ní Chasaide (2003) have proposed to extend amplitude-based parameters to
the estimation of time-based parameters. Therefore, the open phase of the glottal pulse
can be estimated by: EE
UP
EI
UPT +=Α
2
1
π
, where EI is the value of the positive peak of the
0F
EE
UP
NAQ ×⋅=
���������� ��������������� �� �
���
�� � �� � �
glottal flow derivative. π.UP/2.EI is considered as an estimation of the glottal flow
opening phase duration and UP/EE corresponds to the closing phase duration. Therefore,
OQ is estimated by T1A.F0. The same algorithm was also used to implement the
calculation of Open Quotient from amplitude domain OQA. Moreover, the estimation of
F0 performed by the algorithm at every detected reliability center was extracted in order
to be compared to other estimations of pitch.
Electroglottography is a measurement of impedance and gives information on the area
of the vocal folds contact. F0EGG can be reliably estimated from EGG signal. Henrich
(2002) proposes an autocorrelation method between EGG signal and its derivative for
the estimation of duration of the glottal pulse open phase T1EGG and the EGG Open
Quotient (OQEGG.).
When calculated from unlabeled continuous speech, NAQ is available only on
reliability centers, i.e. vocoids as defined by Mokhtari [12]. Therefore, locations of
these reliability centers were also extracted and matched to the expert phonetic labeling
of the corpus to ensure that detected segments are actual vocoids. Table 3 presents the
repartition of reliability centers according to the phonemic labels. 68% of them are
found in vowels, and 15% in sonorants. Except vowels, the nasal consonant [n] is often
detected as a reliability center, and will hence be taken into account for further analyses.
i e a o u schwa nasal-o� n others
9.4 11.6 14.7 7.3 8.8 3.0 13.2 8.3 23.7
Table 3: Repartition (%) of the reliability centers according to phonemic labels.
Table 4 shows the mean values and confidence interval of NAQ for each phoneme.
NAQ ranges from 0.07 to 0.32, which has to be compared with Alku et al. (2003) results
obtained from five males speakers: pressed (0.08-0.11), modal (0.11-0.17) and breathy
(0.23-0.35). Mean values of NAQ seem to be higher for higher oral vowels, however
this tendency is not significant. The phoneme schwa shows a higher NAQ. This trend is
due to a clearly bimodal repartition of NAQ values. Speaker 1 adds schwa on word
endings with a high F0 and a high NAQ (0.28), which correspond to a breathy voice.
Speaker 2 produces schwas with a modal voice: NAQ values are about 0.12, as for [e].
The choice of producing or not a final schwa seems to reveal a speaker-strategy related
to speech-act expressive values. The nasal vowel [o] shows NAQ values similar to high
vowel ones. The nasal consonant [n] has NAQ values about 0.19, which can be
interpreted as a breathy voice. All differences are significant except between [n] and [e].
However, it seems unrealistic that the phoneme [n] in “Jean” is always produced with a
breathy voice, whereas the vowel [o] is not. This might be due to its final position, but
high NAQ values are also measured when [n] is followed by a schwa. A possible
explanation is that nasality produces mainly low frequencies, thus attenuating higher
frequencies and increasing the spectral slope. Both nasality and breathiness acoustically
correspond to an increase in the spectral slope induced by supra-laryngeal settings for
nasality and laryngeal settings for breathiness.
���������� ���������
���
Table 4: Mean values and confidence interval p<0.01 of NAQ for each phoneme
4. Perceptive validation of the corpus
4.1. The experimental protocol
In order to validate the emotional expression collected through such a paradigm, a
perceptive validation has to be carried out. It has to first validate the acted emotions: the
“big six”, and the emotions reported by the listener himself. The results of this test give
a first map of what listeners can efficiently perceive, and what kind of emotions cannot
be differentiated. Then, the spontaneous data can be evaluated on a pre-tuned set of
emotional category. This paper presents the results of the first step of the evaluation: the
analysis of acted emotional expressions.
Subjects
26 subjects have participated in this experiment, including 4 males and 22 females, from
19 to 45-years old, aged of 25 in average.
Figure 9: screenshot of the perception test answer page, showing the 14 emotional scales
���������� ��������������� �� �
���
The sentences proposed to listeners were extracted from the recording of one actor of
the corpus described hereunder. There are two reasons for that: first the acoustic
analyses made on the corpus are highly speaker-dependant, and the set of spontaneous
emotions reported by this speaker is quite open.
Then, two experts listeners rated all his productions, in order to select only the
best-acted performances, and to restrain the corpus for the listening test. In order to rate
each stimulus, the judges listened to each stimulus in a random order, first in audio only,
and then in the audio-video version, and gave each one a grade from 1 (very bad) to 4
(very good). Only the stimuli with a 3 or 4 grade were kept for the test. Then, a subset
of these stimuli was extracted with the following criterions:
-Stimuli were selected in order to propose a systematic variation of their length. For
each emotion, one stimulus is proposed with the following length: 1, 3, 5 and 7
syllables. This is made in order to test if the length influences the perception of
emotional expressions.
-One stimulus was selected to represent all the emotional variation (the “page
suivante” sentence), in order to test all emotional expression on exactly the same
linguistic structure.
-The 14 acted emotional expressions were tested, either the “big six” ones, and the 8
reported at the end of the spontaneous phase, i.e.: amusement, anger, anxiety,
deception, disgust, expectancy, fear, happiness, neutral, resignation, sadness,
satisfaction, surprise, worried.
This gives 70 different stimuli, presented both through an audio and an audio-visual
modality to listeners (resulting in 140 different stimuli).
The perception test was carried out in a quiet room, using a computer to play the
stimuli and to record the answers. Subjects listened to the stimuli via headphones, at a
comfortable hearing level.
They heard in a first time the audio-only stimuli, mixed in a random order (controlled
in order to avoid the successive presentation of the same sentence) different for each
listener. Then, they perceived the audio-video stimuli, in a different random order.
They always heard the audio only stimuli and then the audio-video ones, because
audio-video stimuli are used only as validation stimuli, to check if the audio expressions
match the facial ones.
When the listener heard one stimulus, he had to rate the perceived intensity of the
emotional expression for each of the fourteen labels proposed, on a scale from 0 (the
emotion was not perceived) to 10 (the emotion is very intense). In order to give his
answer, he had to use a set of 14 sliders corresponding to each emotion (cf. fig. 9).
Stimuli can only be heard once, and listeners were told to give their answer as
spontaneously as they could.
4.2. Results
The results of this perception test were compared amongst the 26 listeners, in order to
ensure the coherence of their answers. The correlation between all their answers for
each stimulus, for all pairs of listeners was calculated: all are significantly correlated
with p<.05. Once the inter-listener coherence is checked, the overall dispersion matrices
for the audio only and audiovisual condition were calculated (cf. fig 10 & 11).
4.3.General analysis
These two dispersion matrices reflect the first (and expected) result of this perception
test: results in the audiovisual condition are always equal or better than those in the
audio only condition; but the two conditions are quite coherent. A first analysis of the
���������� ���������
���
differences between the two conditions shows that :
-Disgust seems difficult to recognize in the audio-only condition, whereas the
audio-visual one is extremely efficient. These findings are completely similar to
the conclusions of Scherer (2003) and Juslin & Laukka (2003) about disgust.
However, we noticed an important difference between the accuracy of listeners to
acoustically perceive disgust: about one half of them rated acoustic disgust as
efficiently as audio-visual one, whereas the other half did not perceive acoustic
disgust.
-Listeners did not use some categories: “expectancy” is recognized as “neutral” and
“deception” is recognized as “resignation”. For these two emotions, the face does
not give more efficient information.
-Listeners in the audio-only condition mainly use the neutral category, when they
don’t “understand” the emotional expression. This happens for emotion with a low
activation, such as “expectancy”, “happiness” (the happiness played by this
listener is a very low activated one), and “resignation”.
- “Anxiety”, “worried” and “fear” are mixed together, and listeners could hardly
made differences, in both conditions. There are also some confusion between
“amusement”, “happiness” and “satisfaction”, but not systematically:
“amusement” is discriminated in the audiovisual condition, whereas “satisfaction”
is mixed between “happiness” and “satisfaction”. As it was already said,
“happiness” is reported as “neutral” in the audio-only condition, but it is
distributed between “amusement”, “happiness” and “satisfaction” for the
audiovisual one.
The better recognized acoustic emotional expressions are “amusement” (even if it is
mixed with happiness), “anxiety” (mixed with “worried” and “fear”), “anger”, “neutral”,
“satisfaction” (mixed with happiness) and “surprise”.
Figure 9: Dispersion matrix of the audio-only
condition. The rows show input emotions, and
the columns the mean answer of listeners. The
intensity of the grayscale filling each square
reflects the perceived intensity of each emotion.
Figure 10: Dispersion matrix of the audiovisual
condition. The rows show input emotions, and
the columns the mean answer of listeners. The
intensity of the grayscale filling each square
reflects the perceived intensity of each emotion.
���������� ��������������� �� �
���
In order to more precisely analyse the results of this experiment, we will group the
results of the different emotions that were not distinguished by listeners, in order to
extract the cognitively pertinent classes of vocal expression of emotion. Then will be
presented the results concerning the influence of the stimuli length on the emotional
expressions.
Figure 11: dispersion matrix for the 8 new categories obtained after grouping together the non-pertinent
ones. The rows show input emotions, and the columns the mean answer of listeners. The intensity of the
grayscale filling each square reflects the perceived intensity of each emotion.
������������������������������ ������������������������ ��������������������������
��� ����������������������������������������� �������� �� ���������������������������
����� �� �� �������� ���������������������������� �� �� �������������� ���� �������
������� �� �� ������������������������������� �������������� �� �� ���������
In order to have a more precise view of the relevant emotional expressions produced
by this speaker, we have grouped together 9 emotional labels into 3 new and more
general labels, and exchanged the answers given in two categories (“Deception” and
“Resignation”):
- “Fear”, “Anxiety” and “Worried” are grouped together in a general “Fear”
category
- “Amusement”, “Happiness” and “Satisfaction” are regrouped inside the “Joy”
category.
- “Neutral”, “Expectancy” and “Deception” are grouped in a global “Neutral”
category.
This results in a new dispersion matrix, with 8 emotional categories (cf. figure 9). The
���������� ���������
���
perceptive distinction for these categories is quite good. Thus, this set of label, and the
grouping made to obtain them is very important in the perspective of evaluating the
spontaneous data.
Influence of the stimulus’ length on the perception of emotion
The last index that has to be analysed concerns the effect of the stimuli length on the
perception of emotional expression. In order to obtain this information, we have
grouped together the answers obtained for each stimuli of a given length. This results in
4 groups of length, for the 1, 3, 5 and 7-syllables stimuli. For each group of length, the
average intensity given to each of the 14 emotional labels was calculated, in order to
test if the listeners’ answers differ from one length to another (cf. figure 10).
The correlation between the results of the four stimuli length was calculated, and all
correlations are significant (p<0.05), indicating that the length of stimuli does not
change the answer, for each emotional label.
4.4.Summary of perceptive analysis
These results are conceived as a first sorting of the collected data, as the expression of
emotion raised a lot of very basic question, such as (1) the ability of human to act an
emotion, or to perceive the difference between acted and spontaneous speech; (2) the
cognitive pertinence of each emotions' label, in one language, several language, or even
different culture; or (3) the relation between one emotion and its expression in speech
(e.g. what is the intelligibility of acoustic contours for each emotional function).
This experiment deals with the second question, by pointing out labels’ grouping, and
by rating the relative efficiency of labels and acted productions. It could also bring some
information to question 3, by comparing the acoustical analysis and the listeners'
answers. Moreover, these first results underline that the length of stimuli does not
change the ability of listeners to rate the emotional expressions.
5. The Japanese attitudes
The attitudes are directly encoded into prosodic contours are cover a very large
spectrum of affects. The can be related as well to the “Belief Desire and Intention”
premises of the theory of interaction dialog as to the linguistic or pragmatic features
defined as intentions. Attitudes not completely learned in the development of the child
before 7 years (Clément, 97), some expressions of attitudes seem universal (as the
surprise value), some are completely different from a language to another one, and some
attitudes values (that is the social concept represented) are specific to some languages.
We studied how common Japanese attitudes can be perceived and interpretated by
French listeners, naïve in Japanese. The studied attitudes have be chosen as relevant in
litterature (Schoshi, 04) :
- doubt
- evidence
- surprise
- authority
- irritation
- admiration
- arrogant / impoliteness
- serious/sincerity
- politeness/kyoshuku
- politeness
- declaration
- question
���������� ��������������� �� �
���
In order to study the eventual disturbing of stress in the French perception, the corpus
was built of distributed lexical stress on varying length utterances, as shown in fig 14
Figure 14: The linguistic corpus structure
The corpus was recorded in quiet room by a Japanese teacher.
5.1. Validation of the corpus on Japanese listeners
15 Japanese listeners held a perception test organised as a closed forced choice. The
table 4 shows that each attitude has a score significatively over the chance (chance =
100 / 12 < 9%).
Attitude χ2
Admiration 111,3 (ddl : 11, p.>.001)
Arrogant/impoliteness 579.0 (ddl : 11, p.>.001)
Authorithy 293,1 (ddl : 11, p.>.001)
Declaration 478,0 (ddl : 11, p.>.001)
Doubt 353,0 (ddl : 11, p.>.001)
Evidence 249,4 (ddl : 11, p.>.001)
Surprise 392,9 (ddl : 11, p.>.001)
Irritation 838,2 (ddl : 11, p.>.001)
Kyoshuku 127,3 (ddl : 11, p.>.001)
Politeness 453,7 (ddl : 11, p.>.001)
Question 657,9 (ddl : 11, p.>.001)
Sincerity/serious 176,5 (ddl : 11, p.>.001)
Table 4. Japanese Subjects. Results of χ2
test on the mean results of answers / attitude
(all length and stress positions) related to chance..
Identification Score (Japanese Subjects)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Déclaration
Question
sim
ple
Evidence
Irritation
Arrogance
Autorité
Doute/Incrédulité
Surprise
Adm
iration
Politesse
Sincérité,sérieux
Kyoshuku
Figure 15:Identification scores for Japanese listeners.
N˚ Stress /nb syl Phrase
1 1 /1 Me
2 1 /2 Nara
3 1 / 5 (3 +2) Narade neru
4 1 / 7 (4 +3) Nagoyade nomimas
5 2 / 7 (4 +3) Narashide nomimas
6 3 / 7 (4 +3) Matsuride nomimas
7 0 / 7 (4 +3) Naniwade nomimas
���������� ���������
���
Percentages
Input Attitudes
Total AD AR AU DC DO EV SU IR KYO PO QS SIN
AD 21,9% 0,0% 0,0% 0,0% 1,9% 0,0% 1,0% 0,0% 1,9% 4,8% 0,0% 3,8%
AR 1,9% 72,4% 5,7% 10,5% 5,7% 21,0% 0,0% 2,9% 1,9% 0,0% 1,0% 0,0%
AU 0,0% 9,5% 51,4% 2,9% 1,0% 9,5% 1,0% 11,4% 15,2% 0,0% 1,9% 1,0%
DC 4,8% 5,7% 10,5% 65,7% 0,0% 12,4% 0,0% 0,0% 2,9% 8,6% 1,9% 9,5%
DO 0,0% 0,0% 0,0% 0,0% 56,2% 0,0% 14,3% 0,0% 0,0% 1,0% 3,8% 0,0%
EV 8,6% 5,7% 17,1% 7,6% 0,0% 45,7% 5,7% 0,0% 10,5% 1,9% 1,0% 3,8%
SU 9,5% 0,0% 0,0% 0,0% 14,3% 4,8% 59,0% 0,0% 1,0% 1,9% 1,0% 1,9%
IR 1,0% 5,7% 4,8% 1,0% 13,3% 2,9% 4,8% 85,7% 12,4% 0,0% 1,0% 1,0%
KYO 11,4% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 24,8% 9,5% 1,0% 27,6%
PO 26,7% 1,0% 4,8% 11,4% 0,0% 1,0% 0,0% 0,0% 2,9% 64,8% 8,6% 17,1%
QS 0,0% 0,0% 0,0% 0,0% 7,6% 2,9% 14,3% 0,0% 0,0% 1,0% 77,1% 1,9%
Attitu
de
re
co
gn
itio
n
SIN 14,3% 0,0% 5,7% 1,0% 0,0% 0,0% 0,0% 0,0% 26,7% 6,7% 1,9% 32,4%
Figure 16: Confusion Matrix ; Japanese listeners.
AD(admiration), AR(arrogant/impolintess), AU (authorithy), DC (declaration),
DO(doubt), EV (evidence), SU (surprise), IR (irritation), KYO (kyoshuku),
PO(politeness), QS(question) et SIN(sincerity/serious)
- Very well recognised attitudes : irritation, question, arrogance, declaration,
politeness, surprise and doubt.
- Well recognised attitudes :authorithy, evidence, sincerity.
- Less recognised attitudes: Kyoshuku and admiration.
5.2. Perception of Japanese by French subjects
15 French listeners held the same perception test. The table 5 shows the score
compared to the chance: the French subjects could not discriminate all the attitudes.
Attitude χ2
Admiration 120.9 (ddl : 11, p.>.001)
Arrogant/impoliteness 117.3 (ddl : 11, p.>.001)
Authorithy 556.4 (ddl : 11, p.>.001)
Declaration 291.9 (ddl : 11, p.>.001)
Doubt 120.3 (ddl : 11, p.>.001)
Evidence 112.3 (ddl : 11, p.>.001)
Surprise 362.3 (ddl : 11, p.>.001)
Irritation 467.7 (ddl : 11, p.>.001)
Kyoshuku 273.6 (ddl : 11, p.>.001)
Politeness 135.8 (ddl : 11, p.>.001)
Question 149.3 (ddl : 11, p.>.001)
Sincerity/serious 59.0 (ddl : 11, p.>.001)
Table 5: French listeners. χ2
test on the mean results of answers / attitude (all length and stress
positions) related to chance.
���������� ��������������� �� �
���
Taux d'identification des attitudes (Francais vs. Japonais)
��
���
���
���
���
���
���
��
��
���
� �� �� �� �� �� �� �� �� �� ��� ���
������ !
���������
"#�$%��&
'�()$��&
Figure 17: Identification score of French vs. Japanese listeners
Results show that most of attitudes (authorithy, irritation, surprise, declaration,
admiration, politeness, question and evidence) were identified over chance. However,
the global score of French was lower the Japanese listeners score ( 35% for French vs .
55% for Japanese).
percentage
Chance 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3%
Input Attitudes
Total AD AR AU DC DO EV SU IR KYO PO QS SIN
AD32,4% 0,0% 0,0% 0,0% 9,5% 0,0% 1,0% 1,9% 0,0% 3,8% 1,0% 1,9%
AR1,0% 19,0% 14,3% 0,0% 9,5% 11,4% 0,0% 9,5% 28,6% 0,0% 0,0% 5,7%
AU0,0% 23,8% 70,5% 2,9% 0,0% 7,6% 0,0% 9,5% 20,0% 0,0% 0,0% 8,6%
DC0,0% 21,9% 7,6% 50,5% 1,9% 21,9% 1,9% 0,0% 1,0% 15,2% 21,9% 14,3%
DO6,7% 2,9% 1,0% 1,0% 25,7% 2,9% 21,9% 4,8% 4,8% 1,0% 3,8% 7,6%
EV10,5% 16,2% 1,9% 19,0% 1,9% 28,6% 0,0% 1,9% 2,9% 7,6% 15,2% 5,7%
SU16,2% 0,0% 0,0% 2,9% 26,7% 2,9% 52,4% 1,9% 0,0% 5,7% 1,0% 1,9%
IR0,0% 7,6% 3,8% 0,0% 0,0% 0,0% 0,0% 65,7% 41,9% 0,0% 0,0% 1,0%
KYO7,6% 1,9% 1,0% 1,9% 7,6% 3,8% 0,0% 0,0% 0,0% 12,4% 3,8% 17,1%
PO11,4% 3,8% 0,0% 5,7% 1,0% 4,8% 1,0% 1,9% 0,0% 32,4% 12,4% 19,0%
QS2,9% 0,0% 0,0% 2,9% 12,4% 2,9% 21,9% 0,0% 0,0% 2,9% 32,4% 1,9%
Attitu
de
s re
co
nn
ue
s
SIN11,4% 2,9% 0,0% 13,3% 3,8% 13,3% 0,0% 2,9% 1,0% 19,0% 8,6% 15,2%
Figure 18: French subjects : confusion Matrix
���������� ���������
���
- Very well recognised: authorithy and irritation. Moreover these attitudes have
no confusion with others.
- Well recognised :surprise, declaration, admiration, politeness, question and
evidence.
- Low recognised attitudes: arrogance, doubt and sincerity.
- Not recognised attitude: Kyoshuku.
It is not surprising that the three politeness variants of attitudes were specially difficult
for French since three social concepts are not conventionally encoded (sicerity 15%,
kyoshuku 0%).
The negative attitudes authorithy and irritation are well interpreted by French,
authorithy is even better recognised by French (70%) than by Japanese (51%). On the
contrary, the arrogance/impoliteness is bad recognised by French (12%) and very well
by Japanese (72%). The analysis of prosodic contours show that it is not because of
prosodic difference of encoding arrogance but surely because of social distance between
the interpretation of arrogance in the two cultures.
�
���������� �����
���������� ��������������� ��� !��"#$%&'()
*�+�,-.�/0123�43"�56789:;<=1>�?@��?ABCDE�F#GHI+�
�:� ��,-.�/0123�43:����JK�L�M:9�1NOPQRAST
UVWXYZ�L[\��]^:_`��Da�FbY��cMdef:gh�Tijkl
=LDmghDE�noLpEqjN�r�sUmtu�OPvwNPxyzUVW{��
|}~N�rcMde:���m�bY���b]^Wgh}���Lqj��*1��671
9@UV��]^��f:s��1�]^:�OLZ��:��Da�F�
� ���56789����L�T:��D� ��F��:��'#��������_
`]^���+���:� $%(#$ ¡¢£¢¤¢� ¥¦� §¨� %©ªª¤ £«¨¢£© � (¨¬§¦+�k*�®1¯:� �%°
#�¬£ £¢±� %©§§¦²¦� °¤³§£ +?:´���Da�F��'D�µ:¶T:·¸WST���� ¹
º»¼�)½e¾\»/¿�ÀÁ»��eÃ:ÄhLZ��:ÅÆÇ�WÈÉ�F� ��
]^:Ê.6Ë��LTr�ÌÂ���W´��L�É@A��ÁÂ�ÄhN�59)*�
:���_��L��'DÍ�F$%(D��OPQR:ÎÏ�Ð���TijM:ÑÒ�OP
QR@Ó:ÔÕ?ÎÏUV[\N.¼�ÖWÍ�F�%°D��M×LØ�ÔÕ?Ù:�Ú[
\]^Nk�Û�ÜÀ:��WÍ�F�
�:��Lgh��cM¼�)�ÝÞ�ß:��àPcMDÙá:âãâ�Nä
Ö�åæWç}�´�}èPxvwNyzUVWéê��¼�)?@�F�¼�)ë�¿�
���D´��LÔìÖ}�íî:n:cM�1��D´���ïð?@�F��ñòó�
ôõ»]^�O@ÓLmghDE�m:Da�F�
�%°��ö÷����x��øLM×��Lù��úûb:ü2¿ý�9Da�A���
àPOP¼�)Nþ�Ël��/01Lù��M:�����Wà�?}�:��LTr�
�i�qj�Dr@rF�
$%(��ö÷����x��ÎÏ:.¼�Ö:úûbü2¿ý�9Da�A���n:
ÎÏLTr��i���}�r@rF��'��ö÷����x�OPQRUV[\Nǹ
�.¼�Ö:ü2¿ý�9Daj���:_`��?´L��è@Ähf:���:þ19�
���������� ��������������� �� �
���
Ë�/01:ÇdÖWw�F�
�� ]^�OPcM:c��åæL÷���M×UV�ÎÏUV�OP�zUV@Ó
:��».¼�ÖLØ���Px�y?àPvwUVW��}��1�]^?cMdef:
ÄhW��}��è@k�Û�ÜÀN[\.�Ë��W�O��F�:]^LØ���OPQ
R:�*���8¿NcM{��R?�vw»yz»sU@Ó:s�UV:XYZ�@��
1�Wǹ�L¹!}�"#?de??mL�$z=¼%k&':UV[\]^:bT?}��
Øj�1/�%(@cMUVLØ�)Ë��1*1)�*7�¿W����F�
�
���������������
����� !"#���
Speech technology research depends to a very large extent on the quality of the data
upon which it is based. Nowadays, very little research makes explicit use of heuristic
knowledge or intuition, yet almost all research that is directed for speech technology
uses data that is artificially limited. That is, engineers and scientists base their
research on speech recordings that have been specifically prepared for technology
research, and are, without exception, constrained in the types of speech that they
illustrate. They are usually recorded in clean conditions, usually using professional
speakers to produce clear examples (though sometimes noise is added, after recording,
by use of specialised ‘noise databases’). These databases are not representative of the
speech of ordinary people.
Even the large telephone-speech and ‘spontaneous’-speech databases that are currently
being used for speech recognition research are constrained to contain clear examples of
the target speech (e.g., proper names, numbers, command sequences, etc), or are
well-rehearsed beforehand (e.g., oral conference presentations), though the recent
developments in ‘Call-Home’ data collection do make use of unconstrained
conversational data (for English). By lacking a ‘real’ context, and by not having a
‘participating listener’ present, these databases fail to represent the ordinary speech of
the ordinary person, and as such provide unrepresentative data for speech technology in
an Advanced Media Society.
Our goal in producing a 1000-hour conversational speech corpus for the JST/CREST ESP
project was to minimise these constraints so that the normal everyday speech of
non-professional speakers could be analysed for their paralinguistic characteristics.
After extensive testing, we decided to use Minidisk recorders for their unintrusive
portability, rather than the higher-quality (but heavier) DAT Walkman. Tests proved
that the speech recordings, though
ATRAC compressed, yielded data that
could be processed using standard
speech-analysis techniques.
Volunteer (paid) subjects wore a small
head-mounted studio-quality
microphone for extended periods
throughout the day to record their
everyday spoken interactions (see for
example the photographs on the right).
Others attended special premises
where they could speak to each other
over a period of months without
���������� ���������
���
face-to-face contact, using telephone links, while recording locally to DAT. We are
indebted to the generosity of these volunteers who provided us with so much speech
without restraint or embarrassment. Overcoming Labov’s famous Observer’s Paradox
in this way was our first important achievement: the recording of ‘natural’ interactive
speech.
�
Figure 3.7.1.� Transcribing the corpus. Utterance breaks were determined using a
‘yen-per-line’ principle to maximize the number of lines while at the same time
producing minimal meaningful utterance units.
�
Public-domain ‘Transcriber’ software (figure 1) was used by a large number of
volunteers to produce a written representation of the recordings. Conventions were
agreed so that the resulting text would be readable to both humans and machines,
resolving ambiguities resulting from multiple possible kanji-kana mappings, and
annotating non-verbal speech noises as well as marking non-speech sounds. This was
the most time-consuming and expensive aspect of our research. Again, we are
thankful to the many hard-working volunteers who had to listen to every word (often
several times) to provide time-stamped access to the conversational speech utterances.
The difference between fluent interactive speech
and conventional text is very great, and it
required considerable effort to remain accurate
to the acoustics while also being readable as text.
This work produced a corpus from a database of
recordings.
Previous work at ATR resulted in the CHATR
system of waveform concatenation for
high-quality speech synthesis, and given a large
speech corpus, it is thereby possible to
reproduce the voice and speaking styles of any
speaker using this method. Our next task was
���������� ��������������� �� �
���
to adapt the technology to work with a conversational speech corpus of extremely large
size (in one case, including almost 5-years worth of daily conversation from one
dedicated volunteer!). However, this required the development of many new techniques
for input and unit-selection.
Figure 3.7.2.� Producing conversational speech by using a small speech synthesizer
from the same speaker as a bootstrap device for selection from a large speech corpus.
Candidate phrases are filtered using voice-quality and prosodic characteristics.
By having the speaker also read a relatively small (one-hour) phonetically-balanced text
to produce a CHATR synthesis database, we are able to synthesise any utterance from
text input and to use that acoustic signal as a target for searching the conversational
speech database (figure 2). For common A-type utterances, several candidates are
usually found, and these are filtered by their acoustic features to select the one having
the most desirable paralinguistic features to match the intended characteristics of the
synthesised utterance. I-type utterances are not commonly repeated and require a
standard CHATR synthesis interface for their generation. This too has been
implemented for the same speaker, though it still remains to integrate the two in a
smooth imperceptible manner.
Figures 3 & 4 illustrate one of the
acoustic features used in this
filtering process: ‘AQ’ is a measure
of voice quality that correlates with
paralinguistic differences in the
meaning or intended use of an
utterance. Until now, AQ was only
measured from clear sustained
vowels, but the software that we
have developed renders it accessible
even from fluent conversational
speech. The combination
of phonatory and other prosodic
���������� ���������
���
features such as speaking-rate, pitch-range, and loudness, limits the list of potential
candidates that have an identical text-transcription to just the few that have the
appropriate intended meaning for an utterance to be used in conversational speech.
�
Figure 3.7.3.� Different phonation types. Pressed voice often sounds tense, while
breathy voice can sound much more relaxed. These characteristics are present in the
speech waveform but can be very difficult to detect automatically. Our software
overcomes this problem to produce a measure of voice quality in running speech
automatically.
�
Figure 3.7.4.� NAQ: Normalised Amplitude Quotient, proposed by Alku and adapted by
us for use with fluent conversational speech, distinguishes between different modes of
phonation. It can indicate speaker-state and listener-relationship information.
���������� ��������������� �� �
���
�
Figure 3.7.5. � Software to detect reliable centres (quasi-syllables) in a fluent
conversational speech signal. By performing a robust formant analysis, we are able to
estimate the vocal-tract parameters and to generate a speech-source measure that
indicates the amount of vocal-tension in the speech.
By estimating the vocal-tract characteristics from the speech signal, we are able to
produce a parametric representation of the speech and speaking-style features that
enables us to tag the speech utterances in a way that can be matched with the subjective
paralinguistic impressions of our human labelers. (figure 5) These features can be
used bi-directionally, both for labelling (i.e., recognition) and for synthesis (by
unit-concatenation of syllable or phrased-sized chunks of natural speech).
This work is still experimental but we are encouraged by our results and are actively
pursuing this method of unit-selection for large corpus-based conversational speech
synthesis. Figure 6 shows results for a Japanese female speaker. The inset below the
figure gives an indication of the unit size.
�
���������� ���������
���
Figure 3.7.6.� Speech-to-Speech synthesis, showing results from unit-selection based
on acoustic targets. The top signal shows a natural-speech waveform (and related
acoustic parameters). Below is an equivalent utterance generated by the new method.
�
Whereas CHATR showed that concatenative speech synthesis can almost perfectly
replicate the voice and given speaking style of a known human speaker, the present
method exceeds that performance by using not phone-sized segments for concatenation,
but in the case of A-type utterances, whole phrases. The ‘synthesis’ sounds completely
natural because it is concatenating large chunks of speech at natural pauses in the
speech, and no longer has to model phrasal prosody by rule, but can concentrate instead
on discourse appropriateness by selection from among multiple candidates.
Having improved the acoustic
quality of the speech synthesis, the
remaining task was to enable input
so that the paralinguistic features of
an utterance could be specified. In
a concept-to-speech system, these
can be generated as part of the input
and passed as markup along with the
text. However, for use in a
conversational speech synthesiser, it
is necessary to have real-time
interactive-input interface.
Keyboard input (or an equivalent
assistive input device) is still required for I-type utterances, but for the A-type
utterances, which make up approximately half the number of transcriptions, the
interface designed by the Prosody Research Team (Group 3) was implemented and
tested using both portable telephones and notebook computers.
���������� ��������������� �� �
���
�
Figure 3.7.7� The tap-to-talk speech synthesis interface offers a layered menu of
A-type utterances that can be quickly accessed by the jog-dial to provide rapid
conversational speech utterances. Text input is not implemented, but would make use
of the existing key pad. Unfortunately, although the software works in real-time, the
network delay tested using a 3-G FOMA handset introduces an unacceptable lag in
speech output (see http://feast.his.atr.jp/i for the downloadable software interface).
Figure 7 shows one implementation of
our AESOP FOMA keitai speech-synthesis
interface. The vertical bar can be
adjusted to set the ‘Self’ parameter, and
the horizontal bar the ‘Other’ parameter
(currently at 5 levels each). The
discourse ‘Events’ are selected by
selecting among the icons in screen
centre until an utterance has been fully
specified. The phone transmits the
parameterised feature-settings, and a
server sends back the appropriate speech
waveform for replay over the telephone’s
loudspeaker or earphone.
The notebook computer interface (see figure 7 in section 3.3 ) takes advantage of a
larger screen size to offer a much wider range of icons for faster utterance selection, but
employs the same underlying speech unit selection technology. Text input is
provided on the notebook, using phrasal units when available and falling back to
CHATR–style unit selection otherwise.
These novel interfaces allow fast retrieval of conversational chunks from the corpus
and enabler the user (whether human, computer, or robot) to take part in a conversation
in real time, and to express human non-verbal speech sounds fluently and effectively.
�
���������� ���������
���
�������� ��� � � ���
������� �� � � � �� � �� ��������������������
!�"#�$%&'()#�$*+'��
������� �� � ,-� �� � ,. �
)#�$%'/&0��12��34564
7849:;</����������
!��
��=���� � -> �� � �� �
")�$?+'@9:ABCD*:EFGH
/IJ�HKLMN2%OP�12���=
���
�$AB/QRDST��
��=����� �, �� � -�
URQVR�$")?%'(WVRQVR
�$")?%'DXYZY�[:�$")
?%'�;<�MN2\H&P�12���
=���
]^_`_�a��
bcd�=����� , ��� �
)#]^_`_*'�a�bcd�=��
���
efghij�$�
�������� ,� �� � ,k� �
)#%'/efghij�$�a��U/�
�����
efghij�$�
lmno���� � ��� �
")?*'/efghij�$�a��Ul
mno����apqrs/tefghij
�[R�$�a��Ulmno����
uvwxlmno��� � ��� �� �"#�$\'()#�$%*'�a� yz{�
&O+|4}~/lmno����
�����2�
uvwxlmno���� �. �� �
��p��������m�@�ap����C�/
����D���Y2uvwxlmno�
��"#�$%'()#�$+'��
y�� �$�l��� � � �����, |�
��� �R
y���$"#%'�a�(yz{&*&|4��a
���|4�R4����4URe�hw|4
��p����m������_�w� !�l
���
����� ¡�2�
�¢e�hw���� �� �� �
£¤¥/�¦��ST����§�/d�(
¨©��(ª©��(«©��(¬©��(
¨�«©��(¨�¬©��(ª�«©�
�(ª�¬©���F�¢e�hw���
)#�$*'��
������������ ������������������� ������� �!
�
���������� ��������������� �� �
���
���������� �����
������������ ������������������� !!"# �� $%&�'$ !!"()* +,
-./��01234356789:�;<=�> ��� !!"# ?�@ABCDE��FGHCIJK
�@LM567�M�NOCPQ�RS���TJK�@/7UV+W@X8MYZ[\/7
�
����������
� �������
+� IBM& A,-‘expressive speech synthesiser’WO{}�r�Fè�}���D���ü
2¿�.�/( "?}��sU:~:{�Da�F/0� IBM:1�À2�6�
(http://www.research.ibm/com/tts)3
�Most speech synthesis has a neutral, one-size-fits-all expression, regardless of what it's saying.
The new IBM expressive speech synthesizer has a range of expressions, so you can tune the
speech to fit its content. Here are some examples."
���L��IEEE : Journal �Transactions on Speech & Audio Processing"�Special
Issue on Expressive Speech Synthesis :å4W�Dùj����5{xAGuest Editor?}
�:67W8��r�F#http://www.ewh.ieee.org/soc/sps/tap/sp_issue/ess.html+
9:D��EU:FP6� ;!�56789� ECESS,#European Centre of Excellence for Speech
Synthesis.+cMde:COE:<=>?:@AA�Dr�F/0�ECESS:1�À2�6
(http://www.ecess.org/) �
�Currently the main market segments of voice driven interfaces are: Network-based Servers,
Mobile Terminals, and Consumer Devices. Network-based Servers represent the largest market
segment, dominated by interactive voice response (IVR) systems. This market is predicted to
increase from 1440 Million € in the year 2001 to 2030 Million € in the year 2006. The
number of voice driven mobile phones, the largest market segment in mobile terminals, where
new services based on speech technology are most visible, will increase from 104 Million
phones in 2004 to 252 million phones in 2005. Comparing the various speech technologies with
respect to revenue ASR is dominant. Speech output is mostly realised by recorded prompts due
to limited speech quality of the available TTS systems. In future application adequate TTS
technology will gain comparable revenue as ASR."
1�À2�6Aç�Ø�L�cM]^Wgh}è�� ¿LØ��B-z:CD�EF�FGH
I�5WBC}�r�Fiè�cMde]^LØ�CD��JB�cM"#?�Q@gDWB
CDE�?KL�r�F
�
�������������
ìM��?}���ND�äOÌÂP:åQRS��#T+#UV; WXôY3�ÎÏL
Z[}ècM\ßUV[\:$zÖ"+?�<�ß���#]^ôY+:�Corpus of Spoken
Japanese"A_?��F`x?mLcM¼�ë�¿:C4�56789�a�ècM�a�
èOPQR:~Wbr�Ó=�m�cdeDa��f�:n:f�:P}Y"¼�)Wti
���������� ���������
���
@rF�gD�+�:cM?ºhW?mLC4�� DARPA:�Lifelog"�56789A� i
�r�A*®8�jk*l1¼%1�mnLØ��i�<=>A��r@rF
�Known as LifeLog, the project has been put out for contractor bids by the Defense
Advanced Research Projects Agency, or DARPA, the agency that helped build the Internet and
that is now developing the next generation of anti-terrorism tools. ….. Each LifeLog user could
"decide when to turn the sensors on or off and who would share the data," she added. "The
goal ... is to 'see what I see,' rather than to 'see me.' "" Washington Post, June 2003.
( http://www.darpa.mil/ipto/Programs/lifelog/ )�
/0��ÎÏLZ[}ècM\ßUV[\:$zÖ"�56789:1�À2�6�3
�&':UVÖ:�oLpr�cM\ßq�r]^:$zÖAsÈDa�FtB:cM\
ßUV[\:��D��cM:ÎÏ�Auvi�wZLa�èA�ÎÏA\ßUV�x
y�vw�yz�sU?r�èý®»z\ßUV:{|L;<@��WÃè}�r��?Wé
ê��?�}~@QRÖL<�}èÎÏ:�d�@��A�<Da�F�:Ø�@����
�RS��ÎÏ:.¼�Ö��Q�ÔÕ�þ�ý¿½e?r�è_`W�É�??mL�ÎÏ
:���cMde�"#:��Z>Ww�Fiè�ÎÏ:JK�Äh?}�cMàP/¿
�Àa�r�����:��W�É�F����ÎÏ��:Oo??mL�cM\ßUV[
\]^:���@�oL��T�m:Da�F"
/0��CSJ"�56789:1�À2�6�3
����ßP}\�þ�ý¿�����ß:�OcMW;�LaTÉ���:��hUVW�
÷}èP}\���h:¼�)ë�¿Daj�2004-�:�eóL�×�?mLúû�;:
�OcM��h¼�)ë�¿L@�m:?BCi�ri�F�:þ�ý¿��UVÁ�;Â
:]^��ôYW��·¸x?}���<Í��n�<�ß���?�<Í��n���
d���A��}�E�r�äÌPÌÂ]^���������d�� z��¡n�P
}\�:\ß�»ý®\ß�¢�:£ÕL_¤��P}\�ÁÂ�:¢¥"�56789
(1999-2003):b�?}�¢¥i�E�ri�F�:�56789:[¦���@P}\�#�
OcM+WÁÂ�L[\��èÉ:_§]^W�¨���?Lù�ri�A����ßP
}\�þ�ý¿��Ù:èÉL�<©ª«@¼�)ë�¿?}�¬¤���ùj�Ù:¢
¥½��;?}��<�ß���AÔ®}�ri�F"
>¯�56789:cM¼�)�°±@��ïðDa�AÓ=�m�c�@àP²³LZ
��Pièm:D�@��M×:��@>�ü�/01Wtum:D�@rF�A
JST/CREST ESP¼�)?:;E@²´�Da�FPxvw»OP͵N²³�¶@ÓLØ�
�no�·v#LM:¸QWÍ�F��nL?��¹�@�?�A��iD:cMUV[
\]^D���:Ø�@UV�i�è�ti�r@rF
+�: TalkBank �56789?CHILDES mìM:��WÍ��r�3
TalkBank is an interdisciplinary research project funded by a 5-year grant from the National
Science Foundation (BCS-998009, KDI, SBE) to Carnegie Mellon University and the
University of Pennsylvania. Additional support comes from NSF ITR Grant 0324883 : “The
���������� ��������������� �� �
���
goal of TalkBank is to foster fundamental research in the study of human and animal
communication. It will construct sample databases within each of the subfields studying
communication. It will use these databases to advance the development of standards and tools
for creating, sharing, searching, and commenting upon primary materials via networked
computers.” ( http://talkbank.org/ )
The CHILDES system (Child Language Data Exchange System) provides tools for studying
conversational interactions. These tools include a database of transcripts, programs for computer
analysis of transcripts, methods for linguistic coding,and systems for linking transcripts to
digitized audio and video. ( http://childes.psy.cmu.edu )
�N��cM¼�)ë�¿¾\��Aº÷}�E�r�F+�: LDC (see
http:www.ldc.upenn.edu/)� 9:: ELRA (http://www.elra.info/) ?´L���D���»
GSK \ß¼½¾'� (http://www.gsk.or.jp/)� A¼�)¾\LTr���Q@��WÃè��
?W[�?}�¿À:eÃÁ:¾\:ÂÃ:ª��Aa�F
������������
����56789D�O}ècM[\]^����Lqj��*1��6719@UV��
]^��f:_§"#�1�]^:ghAª�Da�F/0�r�T:ÄW_?�F
9:D�2004-ÅÆØj�FÇÈÉz: FP6 ;!���56789� �CHIL - Computers In
the Human Interaction Loop " A � É � � r � F / 0 CHIL : 1 � À 2 � 6
(http://chil.server.de)�3
�CHIL - Computers in the Human Interaction Loop - is an Integrated Project (IP 506909) under
the European Commission’s Sixth Framework Programme. It is jointly coordinated by
Karlsruhe University and the Fraunhofer Institute. The project was launched on January, 1st
2004 and has a duration of 36 months. In total the project costs amount to more than 24 million
EUR. The vision of the CHIL project is to develop and explore a fundamental shift in the way
we use computers today. We aim to realize computer services that are delivered to humans in an
implicit, indirect and unobtrusive way. We wish to free people to interact with people. Therefore
we re-position machines to be in the background, discretely observing the humans and - like
electronic butlers - attempting to anticipate and serve their needs. Computers in the Human
Interaction Loop (CHIL) aims to introduce computers into a loop of humans interacting with
humans, rather than condemning a human to operate in a loop of computers. This will give
humans more time to do what they really like: communicate and work productively with other
humans."
+� MIT@ÓD���Intelligent SpacesÊ���":��A�Dr�3
/0�MIT�56789:1�À2�6#http://www.ai.mit.edu/projects/aire/+�
�Agent-based Intelligent Reactive Environments
A Research Group at the MIT Computer Science and Artificial Intelligence Laboratory
���������� ���������
���
aire is dedicated to examining how to design pervasive computing systems and applications for
people. To study this, aire designs and constructs Intelligent Environments (IEs), which are
spaces augmented with basic perceptual sensing, speech recognition, and distributed agent logic.
aire forms a core component of MIT's pervasive computing project, Project Oxygen"
�?´L�+�:¿)1*Ë�¯;ÂN>�8�;ÂD���Q@��A'Ì���RWI
ÍL�É��r�F
�The Meeting Recorder Project (http://www.icsi.berkeley.edu/Speech/mr/mtgrcdr.html)
Despite recent advances in speech recognition technology, successful recognition is limited to
co-operative speakers using close-talking microphones. There are, however, many other
situations in which speech recognition would be useful - for instance to provide transcripts of
meetings or other archive audio. Speech researchers at ICSI, UW, SRI, and IBM are very
interested in new application domains of this kind, and we have begun to work with recorded
meeting data.� The first stage in investigating speech recognition for meetings is to collect some
data. At ICSI, we have equipped a meeting room with a multichannel, studio-quality recording
system and have begun to collect pilot recordings of meetings, primarily between speech group
members. At the time of writing (2001 February), we have collected 40 hours of 16 channel
pilot data, and ten hours has been hand-transcribed. See this information on Meeting Recorder
data collection including both the mechanics of the meeting recorder setup at ICSI and some
initial forays into processing the recordings. The data were then transcribed, using a set of
transcription conventions designed for speed and accuracy of data input and encoding. "
������� !"�
��5{x:Îr�/0:Ø�Da�3
“I hope that ATR will be able to find funding to continue the support of this research so that
it can be implemented both in speech synthesis systems and in ambient intelligent environments.
While its applications to future speech synthesis are obvious, I also foresee useful applications
in sensor equipment for monitoring the affective states of people in intelligent spaces under the
ubiquitous-computing framework”.
“In addition to being used as a communication aid for the speaking-impaired, it is likely that
the speech synthesis component of this research will find application first of all in humanoid
robots, enabling them to communicate with humans in a more acceptable manner, softening the
interaction by increased use of non-verbal speech”.
ÌÂ]^N&'f:é���ÏÐÑÃLTr�
In the age of ‘ubiquitous computing’ and ‘ambient intelligence’ people will increasingly be
confronted with automated devices and services that are equipped with interactive speech
interfaces. Although current speech synthesis and recognition are well-tuned for linguistic
processing, they are not yet capable of processing paralinguistic information, such as
‘tone-of-voice’, and are concerned only with ‘what has been said’, rather than including (as
���������� ��������������� �� �
���
humans do without thinking) information about ‘how it was said’. Machines need to be made
aware of and sensitive to these differences in speaking-style and to the clues about the speaker’s
state and feelings that are present in human speech. The technology produced as a result of this
research will become an important component in this evolution towards an Advanced Media
Society.
åL3
ATR has been invited to join ECESS (http://www.ecess.org) as a non-European partner (the only
other such partner is the Pattern Recognition Laboratory of the Chinese Academy of Sciences in
Beijing) and we expect to see the implementation of our JST/CREST-funded technology in their
speech synthesizer. Since this project is being coordinated by Siemens and Nokia, so we
foresee strong marketing potential and immediate applications in the European Community for
this emerging technology.
���������� ���������
���
��������
���$%� � �
�
�
���������
�
� � �� ���������� �
� � � � ���������������
� �
� ��� !����
�
� � "#$%&'()��*+,� �
�
*+-./�
012�3456���
�
78��9:;<�����
�������=>� �
� �?@AB����
�
� � CD�����"#�E��� �
� � � � ���F���.G����
� �
� � �������� �
� � �ICPGrenoble� � �
� �����������
� �
� � HIJ0K�LM5NO���� �
� ������
� � � � �� !"#$%&'���
� �
� � P������� �
� ()��
� � � � *+,�-.���
� �
� �LQRSTU����
�
"#$%&'()��*+,� �
� � � � � � � � � � � � � � � /01"2345�67���
�
�
�
���������� ��������������� �� �
���
���&'(�)� �
*����+,�����
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
½¾¿4vÀhÁj�VÂ�çÄ�
ÅÆÇȶ·¯�Éʶ·Ë ÌÍÎÏ�
�Ð� 0� -ÑÒ�
��-0� ,�
ÓÔÕ�Ö×Øh�
ÙÚz ¶·¯�¶·ÛÜ� Ý��±Þ�
�Ð� 0� .ÑÒ�
��-0� ,�
ßàáâ�VÂ�çÄ�
ÅÆÇȶ·¯�¶·ãäË ¶·å�$ãæ�
�Ð� 0� -ÑÒ�
��-0� ,�
ç[èé�VÂ�çÄ�
ÅÆÇȶ·¯�¶·ãäË êÑÒ`ëì�
�Ð��0� �ÑÒ�
���0� ,�
íîïð�VÂ�çÄ�
ÅÆÇȶ·¯�¶·ãäË ñØò�óhò�
�Ð��0� �ÑÒ�
��-0� ,�
-./01�����
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
ôõö÷�øïùúû�ÇÈ
��ü���ýþ� ���Ð/�ï�
�Ð� Ñ� .ÑÒ�
��-� ,�
�����øïùúû�ÇÈ
��ü���ä<� =������
�Ð� Ñ� ÑÒ�
��-� ,�
���øïùúû�ÇÈ
��ü���ä<� ��ê_`Á_w�
�Ð�,Ñ� �ÑÒ
��-� ,�
àÔ�Å�øïùúû�ÇÈ
��ü��� B�� �$���êj¡�
�Ð� Ñ� >ÑÒ�
���� ,�
����øïùúû�ÇÈ
��ü��� B�� �������
�Ð� Ñ� >ÑÒ
���� ,�
��â�7�øïùúû�ÇÈ
��ü��� B�� �$���êj¡�
�Ð�,Ñ� �ÑÒ
���� ,�
�����øïùúû�ÇÈ
��ü����B�� ��/�� !4�Ð�
�Ð��Ñ� �ÑÒ�
���� ,�
�õè"�øïùúû�ÇÈ
��ü����B�� �$���êj¡�
�Ð��Ñ� �ÑÒ�
���� ,�
#Ó6$�øïùúû�ÇÈ
��ü����B��
���Ð%/&'(Ð�
�Ð��Ñ� �ÑÒ�
���� ,�
)õ*+�øïùúû�ÇÈ
��ü����B�� =� !4,-�
�Ð��Ñ� �ÑÒ�
���� ,�
W./0�øïùúû�ÇÈ
��ü����B�� fh`12fw/�ï�
�Ð��Ñ� �ÑÒ
��-� ,�
3Ô45�øïùúû�ÇÈ
��ü����B�� ���Ð��(Ð�
�Ð��Ñ� �ÑÒ�
���� ,�
î67�øïùúû�ÇÈ
��ü����B�� ���Ð��89�
�Ð��Ñ� �ÑÒ�
���� ,�
:;7� <=��� B�� Î>� !4,-��Ð� Ñ� >ÑÒ�
���� ,�
?@A7�øïùúû�ÇÈ
��ü����B�� �� !�
�Ð� Ñ� ÑÒ
��-� ,�
BÔCD�øïùúû�ÇÈ
��ü����B�� ���Ðê_`Á_w
�Ð�,Ñ� �ÑÒ�
���� ,�
EFGH7�øïùúû�ÇÈ
��ü����B�� ���êj¡�
�Ð�,Ñ� �ÑÒ�
���� ,�
IïJKL�øïùúû�ÇÈ
��ü����B�� bM�Ághò�
�Ð�,Ñ� �ÑÒ�
���� ,�
���������� ���������
���
íîïð�øïùúû�ÇÈ
��ü����B�� �����
�Ð�,Ñ��ÑÒ
��-� ,�
NOPK�øïùúû�ÇÈ
��ü��� B�� =�,-�
�Ð��Ñ� ,ÑÒ
��-� ,�
Q[���øïùúû�ÇÈ
��ü����B�� =�,-�
�Ð��Ñ� -ÑÒ
���� ,�
RSKT�øïùúû�ÇÈ
��ü����B�� ���Ð�
�Ð��Ñ� ÑÒ
���� ,�
�UV7�øïùúû�ÇÈ
��ü����B�� d�&'(Ð�
�Ð��Ñ� ÑÒ
���� ,�
WXE$�øïùúû�ÇÈ
��ü����B�� ��/����4�Ð�
�Ð��Ñ� ÑÒ
���� ,�
ÔWYZ�øïùúû�ÇÈ
��ü����B�� ���Ághò�
�Ð��Ñ� ÑÒ
���� ,�
EF[ð�øïùúû�ÇÈ
��ü����B�� �������
�Ð��Ñ� ÑÒ
���� ,�
�\]ì�øïùúû�ÇÈ
��ü����B�� ���$#^_�
�Ð��Ñ� ÑÒ
��-� ,�
`C(�øïùúû�ÇÈ
��ü����B�� �a^_�
�Ð��Ñ� ÑÒ
���� ,�
�bc�øïùúû�ÇÈ
��ü����B�� ê_`Á_wdÐ�
�Ð��Ñ� ÑÒ
��-� ,�
eõf �øïùúû�ÇÈ
��ü����B�� ê_`Á_wdÐ�
�Ð��Ñ� ÑÒ
��-� ,�
WgÊ �øïùúû�ÇÈ
��ü����B�� ê_`Á_wdÐ�
�Ð��Ñ� -ÑÒ
��-� ,�
23456�����
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
W�Eð� hà��� ýþ� iR��jkl��Ð� 0� £ÑÒ�
��-� ,�
mnoð� hà��� äýþ� =�d�����Ð� 0� £ÑÒ�
��-� ,�
p q� hà��� ýþ� ����jkl��Ð� 0� £ÑÒ�
��-� ,�
rÔ�q� hà��� B��� ê_`st4 !��Ð�,0� ÑÒ
���0� ��
uvw� hà��� xV:ýy iR��jkl��Ð�,0���ÑÒ�
��-� ,�
SÓzK� hà��� äýþ� iR��jkl��Ð�,0� ÑÒ�
��-� ,�
}�{é� hà��� äýþ� ê_`st4 !��Ð�,0� ÑÒ�
��-� ,�
Q[|â7� hà��� t�}~y =�d�����Ð�,0� ÑÒ�
��-� ,�
��D��� hà��� äýþ� ����jkl��Ð�,0� ÑÒ�
��-� ,�
W��7� hà��� B��� �� !��Ð�,0� ÑÒ�
��-� ,�
���7� hà��� B��� �� !��Ð�,0� ÑÒ�
��-� ,�
�Ôz7� hà��� B��� �� !��Ð�,0� ÑÒ�
��-� ,�
���������� ��������������� �� �
���
����� hà��� �B��� ��jkl����Ð� 0��ÑÒ�
��-� ,�
W��7� hà��� �B��� �� !��Ð��0���ÑÒ�
��-� ,�
ç�f�� hà��� �B��� �� !��Ð�,0� ÑÒ
���0� ,�
#��� hà��� B��� �� !��Ð��0� ÑÒ
���0� ,�
(��7� hà��� B��� �� !��Ð��0� �ÑÒ�
��-� ,�
���� hà��� �B��� �� !��Ð��0� �ÑÒ�
��-� ,�
�Ô�7� hà��� �B��� �� !��Ð��0��ÑÒ�
��-� ,�
���� hà��� B��� �� !��Ð��0��ÑÒ�
��-� ,�
�6â3å� hà��� t�}²Ë ê_`�!ãä��Ð��0��ÑÒ�
��-� ,�
�
���������� �
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
Veronique Auberge ICP Grenoble ýþ� iR/��#��Ð� 0� -ÑÒ�
��-� ,�
Albert Rilliard ICP Grenoble ¶·Ë� �����_`��Ð� 0� >ÑÒ�
��-� ,�
Anna Tcherkassof ICP Grenoble B��� ���>��Ð�,0� ÑÒ�
��-� ,�
Amandine Fouard ICP Grenoble B��� ê_` !��Ð�,0� ÑÒ�
��-� ,�
Cecile Brichet ICP Grenoble B��� ê_` !��Ð�,0� ÑÒ�
��-� ,�
Marie Caithard ICP Grenoble B��� ñØò�×dÐ��Ð�,0� ÑÒ�
��-� ,�
Aude Noiray ICP Grenoble B��� ê_` !��Ð�,0� ÑÒ�
��-� ,�
Ludovic Lemaitre ICP Grenoble B��� ê_`st4 !��Ð� 0� >ÑÒ�
��-� ,�
Sylvie Mozziconacci ICP Grenoble B��� �ì 7����Ð� 0� ÑÒ
���0� ,�
Daniel Hirst Aix-en-Provence ýþ� ��iR�¡¢ì��Ð� 0� .ÑÒ�
��-� ,�
£�m¤ Edinburgh�� B��� d�w`fjj�a��Ð� 0� -ÑÒ
���0� ,�
���������� ���������
���
7������������� �
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
¥O§�� �¦§¨��� ýþ� fh`12_w��Ð� 0� .ÑÒ�
��-� ,�
©Ôuâ� �ª§¨���«{¬�z�
¶·Ë�ê_`�>4st�
�Ð� 0� .ÑÒ�
��-� ,�
®¯7� �¦§¨��� ¶·ãäË �Ághò��Ð� 0� ÑÒ�
���0� ,�
�
������� �
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
:è°� ±¯��� äýþ� =�/|"²��Ð� 0��ÑÒ�
��-� ,�
³´âµ7� ±¯��� ¶·ãäË� d�¶·/¶·��Ð�,0� �ÑÒ�
��-� ,�
:;7� ±¯��� ¶·ãäË� Î>� !¸,-��Ð��0� ,ÑÒ�
��-� ,�
N�¹�7� `<��� B��� ê_` !��Ð��0� >ÑÒ�
���0��� �
ºõµ{� `<��� �»� ê_` !��Ð��0� >ÑÒ�
���0��� �
���� �� !
®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�
¼½¾¿½À�ÁÂÿĽ¾�VÂ�çÄ�
ÅÆÇȶ·¯
«{¬�z�
¶·Ë���ÄÅ¢ì�
�Ð�,0� �ÑÒ�
���0� ��
ÆOfÇ�VÂ�çÄ�
ÅÆÇȶ·¯
«{¬�z�
¶·Ë�d�`fóhò�
�Ð�,0� >ÑÒ�
���0� ,�
Ó�ÈjØw�VÂ�çÄ�
ÅÆÇȶ·¯
«{¬�z�
¶·Ë����
�Ð��0� ÑÒ�
���0� �
z½É��¿½ÉÊ�VÂ�çÄ�
ÅÆÇȶ·¯Ë̶·Ë ���Ð�
�Ð��0� ÑÒ�
���0� -�
ÍÂÎÏÐÑÎ� ��VÂ�çÄ�
ÅÆÇȶ·¯
«{¬�z�
¶·Ë���ÄÅ¢ì�
�Ð�,0� �ÑÒ�
��,0� ��
��«¿ÑÎÏ�Ò½ÎÏ�VÂ�çÄ�
ÅÆÇȶ·¯
«{¬�z�
¶·Ë��� !4,-�
�Ð� 0��ÑÒ
��,0���
ÚÎϽ�ÚÓ¾½Ã¿À� �Ô��� B��� ����/ÕÖ��Ð� 0� .ÑÒ�
���0� ,�
�Ð��0� ÑÒ�
�Ð��0� �Ñ�×½¾ÄÀÑÄ�
¼ÊÄØÎÏÙ¾�óÚhÛh��� Ë̶·Ë ���Ð�
�Ð��0� ÑÒ�
���0� ��
ÜÉϽ�ÝÂÑÓν� óÚhÛh��� Ë̶·Ë ���Ð��Ð��0� ÑÒ�
���0� ��
Þ�ßà� �Ô��� B��� áâjd�����Ð� 0� .ÑÒ�
��,0� >�
�N§Ä�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð��0� �ÑÒ�
���0� ,�
ã�äâ�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð��0� �ÑÒ�
��-� ,�
���������� ��������������� �� �
���
å[�câ�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð�,0� .ÑÒ�
��-� ,�
�Ocâ�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð�,0� �ÑÒ�
���0� ��
¨Ô�æ�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð��0� �ÑÒ�
���0���
Wç7�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË sè�
�Ð�,0��ÑÒ�
��-� ,�
NOéjm�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð��0� �ÑÒ�
���0� ,�
@�ê�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð�,0� .ÑÒ�
��-� ,�
V æµ7�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË �Ághò�
�Ð��0� �ÑÒ�
���0� >�
ë�ìbíî7�VÂ�çÄ�
ÅÆÇȶ·¯¶·ãäË ê_`�!�
�Ð��0� ÑÒ�
���0� -�
�
�
����� �����
� ���8�9�:;�<�'=>?�@� �
VWX� YZ� [,�\]
^_`a�
�
�� 0>�
��ң��Speech & Emotion Belfast 95 Satellite of Eurospeech
��,0�
��Ñ ��ESP Groups Meeting
VÂ�çÄ
ÅÆÇȶ·¯ 25 Research Planning
���0�
.���CREST Meeting
øïùúû�ÇÈ
��ü�� 30 Research results
���0>�
�����
ATR-CREST
Workshop I,II
����
@U¥(±¯C 65 Satellite of LP2002
�Ð��0��Ñ
.�Ò>��
ESP/CREST
Group Meeting
VÂ�çÄ
ÅÆÇȶ·¯ 15
¶·ïðñò¡ó
k�ô_
���0 �
��Ò ��
1st
CREST ESP
International
Workshop
h� 85 Reporting Results
���0.�
-�Ò >��Voqual Workshop Geneva 75 Satellite of Eurospeech
���0>�
����
Eurospeech � Special
Session Geneva 80
Public discussion of results
and related research
���0�
���-��1st
JST Symposium JjÈêõJö÷Þ
@`<C 218 Poster presentation
���������� ���������
���
���0�
��Ñ ���
CREST ESP
Symposium
ICP Grenoble,
France 80 French Team Workshop
���0,�
,�Ò ���SP2004 conference øïøùú�å 250
�mB���¶·�
jûA�
���0�
�Ñ����JST Symposium û�ßüý 178 Oral presentation
���0���
��Ò>��
ICSLP Special
Session Jeju Island, Korea 110
Speech & Affect
Session ofICSLP-04
��-0�
,Ñ�þ�
Crest ESP Final
Workshop
VÂ�çÄ
ÅÆÇȶ·¯ 50?
Depending on funds being
available
�
���ABCD��E@�
�� ������ �� �� ��� ����
Á¾�z½É��¿½ÉÊ�
@\yz{� ÇÈËC��yÙy�/���Ð/¶·�
VÂ�çÄÅÆ
Çȶ·¯�
���0�
ÑÒ-Ñ�
ݾk�¬Éٽξ½�yÉÓ½ÎÂ�
¼¾ÂÊÙ��¾�
�Î��«½À�ν���¾½ØÉ�
�_¿¾ñ/º»(k�ô_VÂ�çÄÅÆ
Çȶ·¯�×��0.Ñ�
ݾk�����
¼¾ÂÊÙ��¾�
�Î��«¿�½Ï � ���y�
�_¿¾ñ/º»(k�ô_VÂ�çÄÅÆ
Çȶ·¯�×��0.Ñ�
ݾ�«¿ÉÎ��¿¿� �
{Ù�Ù½¾�¿Ù¾�
yz�z ���y�
�_¿¾ñ/º»(k�ô_VÂ�çÄÅÆ
Çȶ·¯�×��0.Ñ�
ݾk��оÎ�;½Î�ľÂÀ�
ݾÙ�ľ� �
� �«zz��ÂÊ��z× ���Ù�ÙÎ� �
��z�«{¬�z��_¿¾ñ/�
º»(k�ô_�
VÂ�çÄÅ
ÆÇȶ·¯�×��0 Ñ�
ݾk��½ÎÙÄ�«½¿Î�
{Ù�Ù½¾�¿Ù¾�
ÁÂľÂɽ��½Ó¾½Ä¾Ù��
��z�«{¬�z��_¿¾ñ/�
º»(k�ô_�
VÂ�çÄÅ
ÆÇȶ·¯�×��0 Ñ�
¼¾ÂÊk�yÎÄÂÎÙ�yÑ�¿ÉÎ�
�Î�Ù¾�ÄÙ��Ù�ÍÙÎÙ�Ù�
��z�«{¬�z��_¿¾ñ/�
º»(k�ô_�
VÂ�çÄÅ
ÆÇȶ·¯�×��0,Ñ�
ݾk�yÉÓ¿Ù�«¿½�½�Ù�
{Ù�Ù½¾�¿Ù¾�
z¾ÎÄ��«ÂÉÉÙÏÙ �Ú¾ÙɽÎ��
��z�«{¬�z��_¿¾ñ/�
º»(k�ô_�
VÂ�çÄÅ
ÆÇȶ·¯�
��0 �
×��0,ÑÒ�Ñ�
Á�kÜÉϽ�ÝÂÑÓν�
{Ù�Ù½¾�¿Ù¾�
ÁÑÙÎ�¿ÙÎ��Îk�ÍÙ¾À½Î��
��z�«{¬�z��_¿¾ñ/�
º»(k�ô_��
���Ðfh`12_w/¶·
VÂ�çÄÅ
ÆÇȶ·¯�
��0 �
��0��
ݾk�×½¾ÄÀÑÄ�¼ÊÄØÎÏÙ¾�
{Ù�Ù½¾�¿Ù¾�
ÁÑÙÎ�¿ÙÎ��Îk�ÍÙ¾À½Î��
��z�«{¬�z��_¿¾ñ/�
º»(k�ô_��
���ÐÇÈ/¶·�
VÂ�çÄÅ
ÆÇȶ·¯�
×��0 ÑÒ�Ñ�
×��0,ÑÒ�Ñ�
�
�
�
���������� ��������������� �� �
���
���������� �
���������� ����������������������� !"�
���#$� ��%&'()�*+�,-.�/0123��4456�!78��,9
:;<=>&'9 9 ?@AB;C�DEFCG9
:H<IJ&'9
KLM�IJNO9 ?@APFC�DEQRCG9
STUV�&'9 9 ?@AHFC�DEQBCG9
W�XU&'YZ[\9]^_`9abcd_e?HRRHf�HRRBf�HRRFfG9
:B<ghij?@AkC�DElijmnCG9
:F<opqrs9 9
Kopqrt9;nC?DE�uvG9
Swx9 9 2y9
W��z9 9 {_|`b}_9~^__��t9�\��9HRRFe9�����`�9HRRBe����9HRRB�9
9 9 9 ��V�@������?HRRHf��G9
9 9 9 �X�Y~��c�_t{�`_}9:����_9[�<e9HRRF�9
9 9 9 LM�mN�Y~�_�_`e9{[ae9HRRFfe;R�9
9 9 9 Z[\9 ]^_`9abcd_tLM&'�HRRFf�;;�9
9 9 9 ��N�Y�Z�~[e9HRRFfeP�,9
9 9 9 s9
�
�������� �� ���������������
������� � ! "#$% &#'1. Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro Shikano, Nick Campbell,
"Cross-language Voice Conversion Evaluation Using Bilingual Database", IPSJ Journal, Vol.43, No.7,
pp.2177-2185 (2002-7)
2. Kazuki Adachi, Tomoki Toda, Hiromichi Kawanami, Hiroshi Saruwatari, and Kiyohiro Shikano,
``Designing target cost function based on prosody of speech database," IEICE Trans. Inf. and Syst.
2005�
�
3. ���������� �����������“���������� !"#$"%&
'()*+,-�./01&23,”456789:;<=>�Vol.J87-D-II�No.2, pp.447-455
(2004-2)
�
�(�)*��
+,-)*./ � !"(#$% (#'1. Hiromichi Kawanami, Tsuyoshi Masuda, Tomoki Toda, Kiyohiro Shikano, "Designing speech
database with prosodic variety for expressive TTS system," Proceedings of International Conference
on Language Resources and Evaluation (LREC2002), pp.2039-2042 (2002-5)
2. Mikiko Mashimo, Tomoki Toda, Kawanami Hiromichi, Hideki Kashioka, Kiyohiro Shikano, Nick
Campbell, "EVALUATION OF CROSS-LANGUAGE VOICE CONVERSION USING
BILINGUAL AND NON-BILINGUAL DATABASES", Proceedings of 7th International
Conference on Spoken Language Processing (ICSLP2002), Denver, pp.293-296 (2002-9)
3. ?@AB5, CDEF, Nick Campbell, “=G&HI�J0*+,-&K�,” 67LM:;N
OP;*Q<=R (2001-9)
4. ����, ��, �� �, ���, ����, “CHATRST� U&STRAIGHTVWX
���YZ,” [\�]:;^_NOP;`a<=R, 1-2-20, pp.245-246 (2001-10)
���������� ���������
���
5. bcde, fghi, Nick Campbell, “+jk6lm�./0�]�no,” [\�]:;^_
NOP;`a<=R, 2-2-3, pp.261-262 (2001-10)
6. ?@AB5, CDEF, Nick Campbell, “*+,-&K��J0*+pqr&st,” [\�]
:;^_NOP;`a<=R, 2-2-7, pp.269-270 (2001-10)
7. ����, ��, �� �, ���, ����, “*+,-&uv0!"#$"%VWX�
� wxyz&{|,” 456789:;}~��7�, Vol.101, No.603, SP2001-122, pp.61-68
(2002-1)
8. �� �, ����, “�������� !"#$"%&��)1&���no&st,” [\
�]:;�_NOP;`a<=R, 2-10-12, pp.287-288 (2002-3)
9. ����, ��, �� �, ���, ����, “������v!"#$"%&'()2
3,” [\�]:;�_NOP;`a<=R, 2-10-13, pp.289-290 (2002-3)
10. ���, �������$�, “������V�����"��H��&{|,” [\�]:;�_
NOP;`a<=R, pp.387-388 (2002-3)
11. ?@AB5, CDEF, �������$�, “�����./0uv0+,)*+%#��q&*
+pqrYZ� X¡,” [\�]:;�_NOP;`a<=R, 2-10-17, pp.297-298 (2002-3)
12. h¢£F5, ��, �� �, ����, �������$�, “[¤¥�q�./0 ¦K
§)¨�©q&b]�ª«0{| ,” [\�]:;�_NOP;`a<=R , 1-10-16,
pp.261-262 (2002-3)
13. ¬®¯, ��, �� �, ���, ����, "°±²!"#$"%VWX�*+³±²
wx&{|," 67LM:;��7�, 2002-SLP-42-5 (2002-7)
14. P�´z, h¢£F5, µ¶·, �� �, ���, ����, “¤��]�!��J0¤�:
¸j&*�23yz&{|,” [\�]:;`a<=R, 1-6-1 pp.209-210 (2002-9)
15. ¹º�», ��, �� �, ���, ����, “GMM�¼½ ¦K§VWX�k6Q
¾&YZ)1&23,” [\�]:;`a<=R, 1-10-24, pp.277-278 (2002-9)
16. ¬®¯, ��, �� �, ���, ����, “°±²VWX�¿"À%$"%ÁÂ�J0
*+³±²wxyz,” [\�]:;`a<=R, 2-10-15, pp.315-316 (2002-9)
17. ¹º�», ��, �� �, ���, ����, "GMM�¼½ ¦K§VWX�k6�
wx," 456789:;}~��7�, SP2002-171, pp.11-16 (2003-1)
18. ¹º�», ��, �� �, ���, ����, “GMM�¼½ ¦K§yzVWX��
&k6YZ,” [\�]:;`a<=R, 1-6-23, pp.267-268 (2003-3)
19. P�´z, h¢£F5, µ¶·, �� �, ���, ����, “¤�:¸j&*�23yz�
./0+jÃÄ}~&ÃW,” [\�]:;`a<=R, 3-6-18, pp.363-364 (2003-3)
20. cGÅ, CDEF, �������$�, “*+,-V������ÆÇpqrYZyz&{|,”
[\�]:;`a<=R, 1-6-2, pp.225-226 (2003-3)
21. È�ÉÊ, CDEF, �������$�, “�ËvrÌ&� �HVWX��HÍ�,” [\�]
:;`a<=R, 1-6-14, pp.249-250 (2003-3)
22. ���, CDEF, �������$�, “�ÎÏÇ�� wx�./0 F0&ÐÑV����ÏÇ
¿%�&{|,” [\�]:;`a<=R, 1-6-15, pp.251-252 (2003-3)
���������� ��������������� �� �
���
23. )õ*+, ����, ½¾¿4vÀhÁj, “d���K�/��iR�Á j�a� ¥j���
!,” �[����~�²|t, 2-6-8, pp.313-314 (2003-3)
24. EF[ð�½¾¿ vÀhÁj�����, "�89� speech-to-speech ���Ð/2�/�^
Ü�����a�����<"," �7�¡§Ä��Çȶ·¡ó, SP2003-82 (2003-8)
25. RSKT, àÔ�Å, ��, �³�, ôõö÷, "�¶T����F��ê_`Á_w��
s�2��^ !�=µ�µ7�"#," �[����~�²|t, pp.221-222 (2003-9)
26. ÔWY$������½¾¿4vÀhÁj, "=�������tR%��¡/Î>��êj¡/
&',”�[����20030(), pp.233-234 (2003-9)
27. ���, ����, ½¾¿4vÀhÁj, ”*�+�jF0/,*�-¹�2�89����Ð��
������/�ï/&',” �¡¢ì��¶·¡ó, 2003-SLP-50-9 (2004-2)
28. RSKT, àÔ�Å, ��, �³�, ôõö÷, "����� T��ê_`Á_w��s
�2�Ð��/¨.a¡," �[����~�²|t, 1-7-5, pp.221-222 (2004-3)
29. WXE$, àÔ�Å, ��, �³�, ôõö÷, "GMM�Å/��a^_����01Å2
34�5"," �[����~�²|t, 1-7-26, pp.263-264 (2004-3)
30. ÔWY$, ����, ½¾¿4vÀhÁj, "d�67������/tR%��¡/ !�ap
X/89," �[����~�²|t, 1-7-10, pp.231-232 (2004-3)
31. EF[ð, ����, ½¾¿4vÀhÁj, "�89�Speech-to-Speech ���Ð�os�2�
a^_��/�a:¡�;," �[����~�²|t, 1-7-25, pp.261-262 (2004-3)
32. RSKT, àÔ�Å, ��, �³�, ôõö÷, "����� T��ê_`Á_w��s
�2��,-�/.a:¡ª<<"/&'," �7�¡§Ä��Çȶ·¡ó, SP2003-199,
pp.37-42 (2004-3)
33. WXE$, àÔ�Å, ��, �³�, ôõö÷, "GMM�Å/��a^_%/01Å2�
5/=s," �7�¡§Ä��Çȶ·¡ó, SP2003-200, pp.43-48 (2004-3)
34. �\]ì, ��, �³�, ôõö÷, “F0�`_h/ª>?@Ð �������ñîA
#/µ7�âB,” �[����~�²|t, 3-2-20, pp.355-356 (2004-9)
35. �bc, ����, ½¾¿4vÀhÁj, “=���/C©��C©D�F©E/ !,” �[��
��~�²|t, 2-2-6, pp.283-284 (2004-9)
36. eõf , ����, ½¾¿4vÀhÁj, “12fwF_¿�s©2bGd��������#�
¡/ !,” �[����~�²|t, 2-2-7, pp.285-286 (2004-9)
�
-=�F�G)� � HI � �JKLM� NJO�1. Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano, ''Voice Conversion Algorithm Based on
Gaussian Mixture Model with Dynamic Frequency Warping of STRAIGHT Spectrum,” Proceedings
of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2001),
SEECH-P8, pp.841-8444 (2001-5)
2. Mikiko Mashimo, Tomoki Toda, Kiyohiro Shikano and Nick Campbell, “Evaluation of
Cross-Language Voice Conversion Based on GMM and STRAIGHT,” Proceedings of 7th
European Conference on Speech Communication and Technology (EUROSPEECH2001),
pp.361-364 (2001-9)
���������� ���������
���
3. T.Toda, H.Saruwatari, K.Shikano, ''High Quality Voice Conversion Based on Gaussian Mixture
Model with Dynamic Frequency Warping'', Proceedings of 7th European Conference on Speech
Communication and Technology (EUROSPEECH2001), pp.349-352 (2001-9)
4. Hiromichi Kawanami, Tsuyoshi Masuda, Tomoki Toda, Kiyohiro Shikano, "DESIGNING
JAPANESE SPEECH DATABASE COVERING WIDE RANGE IN PROSODY", Proceedings of 7th
International Conference on Spoken Language Processing (ICSLP2002), pp.2425-2428 (2002-9)
5. Kei Fujii, Hideki Kashioka, Nick Campbell, "Target Cost of F0 Based on Polynomial Regression in
Concatenative Speech Synthesis", Proceedings of 15th International Congress of Phonetic Sciences
(2003-8)
6. T. Shiraishi, T. Toda, H. Kawanami, H. Saruwatari, K. Shikano, "Simple Designing Methods of
Corpus-Based Visual Speech Synthesis," Proceedings of 8th European Conference on Speech
Communication and Technology (Eurospeech2003), pp.2241-2244 (2003-9)
7. H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, K. Shikano, "GMM-based Voice Conversion
Applied to Emotional Speech Synthesis," Proceedings of 8th European Conference on Speech
Communication and Technology (Eurospeech2003), pp.IV-2401-2404 (2003-9)
8. Kazuki Adachi, Tomoki Toda, Hiromichi Kawanami, Hiroshi Saruwatari, Kiyohiro Shikano,
"Perceptual Evaluation of Quality Deterioration Owing to Prosody Modification," Proceedings of the
4th International Conference on Language Resources and Evaluation (LREC2004), pp.2159-2162
(2004-5)
9. ��â�7, àÔ�Å, ��, ôõö÷, Nick Campbell, “H�EI J�êj�Å/��
a^_"/�UiR�%/=s,” �[����()ÌV��~�²|t, 1-P-17, pp.389-390
(2001-10)
10. #Ó6$, àÔ�Å, ��, �³�, ôõö÷, “µ7�a���¢m§�Å/��_�wÁ
_wd�K&'�Ð,” �[����~�²|t, 2-Q-13, pp.399-400 (2003-3)
11. W./0, ����, ôõö÷, ½¾¿4vÀhÁj, “LMÍ6:N�ÕÖ�a�O����,”
�[����~�²|t, 3-Q-12, pp.175-176 (2003-3)
12. ���, ����, ½¾¿4vÀhÁj, “�89����Ð����F0/PQ�8R�289�
wx,” �[����~�²|t, 2-Q-6, pp.323-324 (2003-9)
13. ���������½¾¿4vÀhÁj, "�89����Ð/��������iR�¡jF0
/,*�Å/��ï," �[����~�²|t, 2-P-16, pp.361-362 (2004-3)
14. W./0, ����, ½¾¿4vÀhÁj, ôõö÷, "O����@NAM ��C����Shh
òD"/�;," �[����~�²|t, 3-Q-1, pp.145-146 (2004-3)
15. �UV7, àÔ�Å, ��, �³�, ôõö÷, "�&'�_�w�s©2d�K&'�Ð
������T/&'," �[����~�²|t, 2-P-12, pp.353-354 (2004-3)
PQ./����� KRSTUVUW�
���XVG)� � HI � YJKLM� YJO�
���
�
���Z[G)�
*A�KZ[\]� HI � �JKLM� �JO�1. Yasuharu Den (Chiba Univ.). Are word repetitions really intended by the speaker? Proceedings of the
ISCA tutorial and research workshop on Disfluency in Spontaneous Speech (pp. 25-28). Edinburgh,
UK. Aug, 2001.
���������� ��������������� �� �
���
2. Michiko Watanabe (Univ. of Tokyo/JST). The usage of fillers at discourse segment boundaries in
Japanese lecture-style monologues. Proceedings of ISCA tutorial and research workshop on
Disfluency in Spontaneous Speech (pp. 89-92). Edinburgh, UK. Aug, 2001.
3. ³´âµ7(`<�/JST). ~§/U�Sò�hxVW����1õ�_/Ö�PX�F©E.Y15
P�[����ÌV��Z[t (pp. 85-90). 200109Ñ.
4. ³´âµ7(`<�/JST). U�Sò�hxVW����1õ�_4\]^/������F©E.
�[����()¶·d��~�²|tI (pp. 277-278). 2001010Ñ.
5. ³´âµ7(`<�/JST). _`~�����1õ�_�sa1�`h/ ¥. Y16P�[���
�ÌV��Z[t (pp. 145-150). 200209Ñ.
6. Yasuharu Den (Chiba Univ.). Some strategies in prolonging speech segments in spontaneous
Japanese. Proceedings of the ISCA research workshop on Disfluency in Spontaneous Speech (pp.
87-90). Goteborg, Sweden. Sep, 2003.
7. ³´âµ7(`<�/Macquarie�/JST)4:è°(±¯�)4bcde(`<�)4fQÄ�(`<�).
gVW/¤¥j1õ�_/Ö�a1.Y18P�[����ÌV��Z[t (pp.65-70). 200409
Ñ.
8. ³´âµ7(`<�/Macquarie�/JST)4:è°(±¯�)4bcde(`<�)4fQÄ�(`<�).
h9i/jké�,µ�lQ</Zm�1õ�_Dnoµí�. �[����20040()¶·
d��~�²|tI (pp. 463-464). 200409Ñ.
9. ³´âµ7(`<�/Macquarie�/JST)4:è°(±¯�)4bcde(`<�)4fQÄ�(`<�).
gVW/p_q41õ�_DtQR�$/lQr��noµí�. �[����20050v)¶·
d��. 200503Ñ.
�
-=�F�G)� � HI � �JKLM� ^JO�1. Michiko Watanabe (Univ. of Tokyo/JST). The function of filled pauses as discourse segment
boundary markers in Japanese monologues. Proceedings of ISCA workshop on Temporal� Integration
in the Perception of Speech (p. 51). Aix-en-Provence, France. Apr, 2002.
2. Michiko Watanabe (Univ. of Tokyo/JST). Fillers as indicators of discourse segment boundaries in
Japanese monologues. Proceedings of Speech Prosody 2002 (pp. 691-694). Aix-en-Provence, France.
Apr, 2002.
3. Michiko Watanabe (JST) & Yasuharu Den (Chiba Univ.). When and why do speakers prolong their
speech segments? Proceedings of the 1st JST/CREST International Workshop on Expressive Speech
(pp. 71-74). Kobe, Japan. Feb, 2003.
4. Michiko Watanabe (Macquarie Univ./Univ. of Tokyo/JST). The constituent complexity and types of
fillers in Japanese. Proceedings of the 15th International Congress of Phonetic Sciences (pp.
2473-2476). Barcelona, Spain. Aug, 2003.
5. Michiko Watanabe (Macquarie Univ./Univ. of Tokyo/JST), Yasuharu Den (Chiba Univ.), Keikichi
Hirose (Univ. of Tokyo), & Nobuaki Minematsu (Univ. of Tokyo). Clause types and filled pauses in
Japanese spontaneous monologues. Proceedings of 8th International Conference on Spoken
Language Processing (pp. 2981-2984). Jeju Island, Korea. Oct, 2004.
�
�
�
�
�
���������� ���������
���
GQ_`����KIabcdefghi��j�klm�� �
���XVG)� � HI � �JKLM� YJO�1. vÀhÁj ½¾¿, d���/��s��|"¤�/tu, pp 161-182, in ��j|"III, �v�
�Ö�, 2002.
2. Carlos Toshinori Ishi, "Analysis of autocorrelation-based parameters in creaky voice", Accoustic
Science and Technology, pp299-302 2004
3. Nick Campbell, Donna Erickson, "What do people hear? A study of the perception of non-verbal
affective information in conversational speech", ��¶·Y8wY1Å, pp9-28
��
���Z[G)�
*A�KZ[\]� HI �nJKLM�nJO�1. Li Chiung Yang., "Prosody and Topic Structuring in Spoken Dialogue", In Proceedings of 6th ICSLP
2000, Volume 1, pp. 126-129., 2000.
2. Mihoko Teshigawara, Emi Zuiki Murano, "Articulatory Correlates of Voice Qualities of Good Guys
and Bad Guys in Jpapanese Anime: an MRI study", INTERSPEECH2004-ICSLP, pp1249-1252
3. Mihoko Teshigawara, Random Splicing:A Method of Investigating the Effects of Voice Quality on
Impression Formation, Speech Prosody2004
4. Mozziconacci, S., : "The Expression of Emotion Considered in the Framework of an intonation
Model", ISCA (International Speech Communication and Assosiation) ITRW on Speech and
Emotion, pp.45-52, 2000.
5. Yang, L., Campbell, N., Linking Form to Meaning: The Expression of Emotion and Recognition of
Emotions Through Prosody, 4th ISCA Tutorial and Research Workshop on Speech Synthesis, in
CD-Rom proceedings, 2001.
6. Yang, L., Prosody as Expression of Emotion, in CD-Rom proceedings, ORAGE 2001.
7. Yang, L., Visualizing Spoken Discourse: Prosodic Form and Discourse Functions of Interruptions,
2nd SIGdial Workshop on Discourse and Dialogue, 2001.
8. Nick Campbell, "Speech & Expression; the Value of a Longitudinal Corpus", 4th International
Conference on Language Resources and Evaluation, pp183-186
9. Carlos Toshinori Ishi, "A New Acoustic Measure for Aspiration Noise Detection", 8th International
Conference on Spoken Language Processing, pp941-944
10. Ishi, C. T., Mokhtari, P. and Campbell, N.: "Perceptually-related acoustic-prosodic features of phrase
finals in spontaneous speech", in Proceedings of the 8th European Conference on Speech
Communication and Technology (Eurospeech'03), Geneva, Switzerland, pp.405-408. (2003).
11. Ishi, C.T., Campbell, N., Analysis of acoustic-prosodic features of spontaneous Expressive Speech,
First International Phonetics & Phonlogy, UNICAMP Campinas, Brazil, 2002, 9
12. Ishi, C.T., Hirose, K., Minematsu, N., "Using Perceptually-related f0-and Power-based Parameters to
identify Accent types of Accentual Phrases", Speech Prosody2002, 2002.4
13. Campbell N., "Labelling natural conversational speech data" pp273-274 ASJ2002, 2002.9
14. Campbell, N. and Mokhtari, P.: "Voice quality: the 4th prosodic dimension", in Proceedings of the
15th International Congress of Phonetic Sciences (ICPhS'03), Barcelona, Spain, pp.2417-2420.
(2003).
���������� ��������������� �� �
���
15. Campbell, N., "Towards a grammar of spoken language: incorporating paralinguistic information",
ICSLP-2002, Denver, Colorado.
16. Campbell, N., Analysis of emotional speech ? what constitutes a representative corpus?, II Seminario
Paranese de Processamento de Sinais, 2001, UFPA, Brazil.
17. Campbell, N., Systems for Speech Synthesis, II Seminario Paranese de Processamento de Sinais,
2001. UFPA, Brazil
18. Campbell, W.N., Marumoto, T., "Automatic labelling of voice-quality in speech databases for
synthesis", In Proceedings of 6th ICSLP 2000, pp. 468-471, 2000.
19. Nick Campbell, "Getting to the Heart of the Matter; Speech is more than just the Expression of Text
or Language", 4th International Conference on Language Resources and Evaluation, pp - ( *
Keynote Speech)
20. Nick Campbell, "Perception of Affect in Speech-towards an Automatic Processing of Paralinguistic
information in Spoken Conversation", 8th International Conference on Spoken Language Processing,
pp881-884
21. Nick Campbell, ACCOUNTING FOR VOICE-QUALITY VARIATION, Speech Prosody 2004
22. Nick Campbell, Communicating affect in our speech-analysis of a large acoustic database, Seminaire
de l’AFCP/l3 Journee Parole Expressive
23. G3n - Nick Campbell, Listening between the lines; a study of paralinguistic information carried by
tone-of?voice, International Symposium on Total Aspects of Languages@TAL2004C
24. Nick Campbell, Modelling affect in speech communication, The 1st Chinese Conference on
Affective Computing and Intelligent Interaction, Dec 2003
25. Nick Campbell, The role of speaker-listener relationships in determining speech prosody, 6thNWCL
International Conference PROSODY AND PRAGMATICS
26. Erickson, D., Mokhtari, P., Menezes, C. and Fujino, A.: "Voice quality and other acoustic changes in
sad speech (grief)", in Proceedings of the IEICE/ASJ/IEEE Interdisciplinary Workshop on Speech
Dynamics by Ear, Eye, Mouth and Machine, Kyoto, Japan, pp.43-48. (2003).
27. Erickson, D., Ohashi S., Makita Y., Kajimoto N., Mokhtari P., "Perception of naturally-spoken
expressive speech by American English and Japanese listeners" pp31-36 JST/CREST Workshop2003
28. Mokhtari, P., Iida, A., Campbell, N., Some articulatory correlates of emotion variability in speech: a
preliminary study on spoken Japanese vowels, ICSP-2001, pp.431-436, 2001.
29. Mokhtari, P., Pfitzinger, H. R. and Ishi, C. T.: "Principal components of glottal waveforms: towards
parameterisation and manipulation of laryngeal voice-quality", in Proceedings of the ISCA Tutorial
and Research Workshop on "Voice Quality: Functions, Analysis and Synthesis" (Voqual'03), Geneva,
Switzerland, pp.133-138. (2003).
30. ÆOfÇ, ½¾¿ vÀhÁj, d�«1�8R�2d�`fóhò !, 3-10-10, pp347-348, �
[����20020()¶·d��, 2002.9
31. ÆOfÇ, ½¾¿4vÀhÁj, "��<d��Å2j�2d�xy�z/ ! -�{©�j��K��
W��-", �[����20030(), pp.275-276(200309Ñ)
32. ÆOfÇ, ½¾¿4vÀhÁj, =�������¥|���`_h/d�uhp, �[���
� 20040 v)¶·d��~�²|t(229-230), 200403Ñ
���������� ���������
���
33. ÆOfÇ, ½¾¿4vÀhÁj, }ÔK~, ��������d�`fóhò !, �¡¢ì��
¶·¡ó(2002-SLP-40-19), pp.109-114, 2002.
34. ÆOfÇ, ½¾¿4vÀhÁj, ¥|���`_hbMÕÖ<"(Ö�Zm)
35. ÆOfÇ, ½¾¿vÀhÁj, JST/CRESTd�¶·ñØ�2¿x/��=�ê_`Á_w, 3C5-11,
:�µ7��2002, 2002.5
36. íîïð, "�89����Ð/2�/���Áj�¡�a�ä8����<", " øïùú
û�ÇÈ��ü���B²| (2002- �3)
37. Mokhtari, P., Perceptual validation of a voice quality parameter AQ automatically measured in
acoustic islands of reliability, �[����20020v)¶·d��~�²|tpp.401-402, 2002.
38. íîïð, ����, Nick Campbell, "�$�������������, " �[����()
ÌV��~�²|t, pp.261-262 (2001-10)
39. ë�ìbíî7(Ó�ÈjØw��(½¾¿� vÀhÁj("Intraspeaker Voice-Quality Variability
with Interlocutor", �[����20040()¶·d��(pp279-280
40. Ó�ÈjØw�� "Creakyd�/�����/ ! ", �[���� 20030() ,
pp.235-236(200309Ñ)
41. Ó�ÈjØw��, ���Y/bM&Ö���������_`/���, ��¶·�2004/12/10
42. Ó�ÈjØw��, �Q���/��Y�,\µ������_`/&'�, �[����2004
0()¶·d��,
43. Ó�ÈjØw��, ½¾¿ vÀhÁj, i/67�±Þ, �[����20040v)¶·d�
�
44. Ó�ÈjØw��, ½¾¿ vÀhÁj, ���¤Td���/�����/ !, 1-10-23,
pp275-276, �[����20020()¶·d��, 2002.9
45. Ó�ÈjØw��, ����������i/��4����/ !, pp311-312�[���
�2003v)¶·d��
46. Ó�ÈjØw��(���Y/bM&Ö���������_`/���(�7�¡§Ä��2004
08Ñ(Ä�Ç¡Vol.104 No.253(pp19-23
47. Ó�ÈjØw��(�Q���/��Y�,\µ������_`/&'�(�[����2004
0()¶·d��(pp295-296
48. ë�ìbíî7(Ó�ÈjØw��(½¾¿� vÀhÁj("Intraspeaker Voice-Quality Variability
with Interlocutor", �[����20040()¶·d��(pp279-280
49. Campbell, N.: "Voice characteristics of spontaneous speech", �[���� 20030() ,
pp.231-232 (200309Ñ)
50. ½¾¿ vÀhÁj, :��DØ�¾xj��jQ, �KT����¤, Y4PK1�_�×s:/�, Ø
�¾x/� 2003 Sep
51. ½¾¿4vÀhÁj, "Predicting the prosody of speech for synthesis", Y14P�Ã,*��,N�
»\���, 2002.11
���������� ��������������� �� �
���
52. �bc(���T(½¾¿� vÀhÁj(�=���/C©��C©D�F©E/ !�(�[�
���20040()¶·d��(pp283-284
�
-=�F�G)� � HI � �JKLM� oJO�1. Mokhtari, P., Pfitzinger, H. R., Ishi, C. T. and Campbell, N. (2004). "Laryngeal voice quality
conversion by glottal waveshape PCA", in Proceedings of the Spring-2003 Meeting of the Acoustical
Society of Japan, Atsugi, Japan, Paper 2-P-6
2. Iida, A., Campbell, N. "Developing an AAC Device with Natural Speech Output - The First Stage:
Deciding on Discourse Labels, " CREST International Workshop on Expressive Speech Processing,
Kobe, Japan, pp. 103-106.
3. Li-chiung Yang. 2001. "Prosody as Expression of Emotion". Proceedings of ORAGE, 2001,
Aix-en-Provence, France, June 2001
4. Li-chiung Yang., "The Expression and Recognition of Emotions Through Prosody". Proceedings of
ICSLP2000. Volume 1, pp. 74-77, 2000.
5. Carlos T.C., Towards Automatic Detection of Creaky Voice, Workshop on Speech & Sound
Processing in Relation to Auditory Representation, 2003,Aug
6. Ishi C.T., Campbell N., "Acoustic-prosodic analysis of phrase finals in Expressive Speech" pp85-88,
JST/CREST Workshop2003
7. Ó�ÈjØw��, Using Perceptually-related F0 and Power-based Parameters to identify Accent
types of Accentual Phrases� Speech Prosody 2002, pp.407-410., 2002.
8. Campbell, W.N., "Expressive Speech Processing: The JST CREST ESP Project", 1st International
Symposium User-System Interaction, Institute for Perception Research (IPO) Holland, 2000.
�
pqrst��:'uv����� Kw�xyTUTUz�
���XVG)� � HI � �JKLM� �JO�1. Iida, A., Higuchi, F., Campbell, N., Yasumura, M., “A corpus-based speech synthesis system with
emotion,” Speech Communication, Vol. 40/1-2 pp. 161-187, 2003.
2. Iida, A. and Campbell, N., “Speech database design for a concatenative text-to-speech synthesis
system for nonspeaking individuals, ” International Journal of Speech Technology, Vol. 6, Issue 4,
pp.379-392, 2003.
3. ©Ôuâ, £��KL, ��|:, ½¾¿4vÀhÁj, ¥O§�, �=���/2�/����
�Ðwu×/�dj"#�, �Ú_Fhfh`12_w��²|�,� Vol.2, No.2, pp. 169-176,
2000.
4. Iida, A., “A Study on Corpus-based Speech Synthesis with Emotion,” �ª§¨����ü��4�
êõJ¶·û�Ö²|, 2002.
5. ©Ôuâ, �����j�:#�3��2���Ð<"j�óÚ½�_h��%/¦s�,
Sophia Linguistica, No. 50, pp.179-195, 2004.
�
�
�
�
�
���������� ���������
���
��������
���� �� ���� ������ ����1. Iida, A., Iga, S., Higuchi, F., Campbell, N., Yasumura, M., “Designing and Developing a
Conversation Assistive System with Speech Synthesis and Emotional Speech Corpora,” In
Proceedings of ISCA (International Speech Communication and Association) ITRW on Speech and
Emotion, Belfast, U.K., 2000, 9, pp. 167-172.
2. Iida, A., Mokhtari, P., Campbell, N., “Acoustic correlates of monosyllabic utterances of Japanese in
different speaking styles,” In Proceedings of 15th ICPhS, Barcelona, Spain, pp. 2861-2864, 2003.8.7.
�
�������� ���� ������ ����1. Iida, A., Campbell, N., “A database design for a concatenative speech synthesis system for the
disabled,” In Proceedings of ISCA 4th International Workshop on Speech Synthesis, Perthshire, U.K.,
2001,8.31, pp. 188-194.
2. Iida, A., Sakurada, Y., Campbell, N., Yasumura, M., “Communication aid for non-vocal people using
corpus-based concatenative speech synthesis,” In Proceedings of Eurospeech 2001, Aalborg,
Denmark, 2001, 9, 7, pp. 2409-2412.
3. Iida, A., Campbell, N., Developing an AAC Device with Natural Speech Output� The First Stage:
Deciding on Discourse Labels,” The First, CREST International Workshop on Expressive Speech
Processing, Kobe, Japan, 2003, 2, 22, pp. 103-106.
�
������ !� �
��� �
• 2000,11,10 ������� �� ���������IT�����6��
• 2001,1,1 ������ !���"#$%&'(�)*+��39���
• 2001,1,1 ,-(��� ./0 IT120� �C 3��
�
�"#�
��34�
�
$%&'�
5678(�
• 2001,1,27 RKB 9�:;� 8(<=>?@ABCDE�%FGHIJKLM�
��NOPQRALS���STUV�
WXYZ�
�
• 2003,7,9 �[\]�/�O3Q^�P_�`a>bcCde@fghi�j
k�lmnopq�r@sBC�tC
• 2003,7,26 �u^vwxyz{|}~4P�������`a>bcCde@
fg�����lmpq����y8�����25����p+
• 2004,7,24 ���v����������d�5��jk���������
13�������q���+� ¡¢�£¤�lmpq
�
()*+,-�.� �/012�34526�
¥¦�§¨©Cª�«���¬q�`a>bcCde@������Q®q¯�°�¯
±�²�³´µ¶E·�OPP_����¸/¹�º�»¼\0½«¾½¿À/³³0�
Q®¾ÁÂy/ÃP/�ÄŽ�¸¹/ÆÇ_ÈQ0ÉÊ˽ÌÍPÎ�ÊϽ�ÐÑ
ÒÓ�ÔÕÖ×K� �«��½ØÈVÙÚÛÜ�ÔÕÖ×K� ���ÝCÞ�ßàV
���������� ��������������� �� �
���
ãäåÝæ! çÓèÕ�é¥êëìÖ×ØÙì,íî�ç!Nï�n^YãäåÝæ�ð
ñò! Ö×ØÙìóô´Ù�ç!N���õö��Y÷ø�ùÈ&!4úûó�Å5Ë
��rÂ4ü�ý�þ��ý�Å5Sb¿ �º¿�b¶��T`����¿�°Ë�
�
���XVG)� � HI �oJKLM� �JO�1. mnoð4�Ôz74W�Eð4Q[|â7� 20010�Ñ\�� �Èu�gjñØ�êõ�, ��[�µ
û��Y18P��d�²|t , pp. 52-53.
2. u vw� 20010\Ñ30� �¡<��R�¢£�,�ù¤���åRi�sY¥¦ÌV�åR�
��È�§ , ö¨��Ö�Ý, pp. 204-207.
3. Q[|â7� 2002011Ñ%�� �\Í�©g/ñØ�êõ���`J�J�[RýªVÂhp�«
ײ|t@nwC �WV¬3xVR�ü�pp.124-131.
4. Q[|â74mnoð� 20020%Ñ22�� ��[R/��������Qm�j(�[R�5$/ì
�1�, Department of Japanese Studies, The Chinese University of Hong Kong and Society of
Japanese Language Education (ed.), Quality Japanese Studies and Japanese Language Education in
Kanji-Using Areas in the New Century, Himawari Publishing Company, pp. 455-461.
5. W�Eð� 2002011Ñ%�� �WVR/M^���®�¤���ѯiR �Y31w�Y12Å�
pp.74-79.
6. W�Eð4uvw4�Ôz7� 20020+Ñ31�� ����`_hjR/kl��Chinese Language
and Culture, No.3, pp.1-34.
7. u vw 2002010Ñ15� �WVR/��sd�°1j�â/^±�, ��Ð140�Ã,*�
�,N�»\���~�²|t , S, 49.
8. mnoð(²)� 2002011ÑG�� ���K�j�X��/iR� . `<: ³F´µ¶.
9. mnoð� 2003010Ñ10�� �Í�jµ�·�óÚ½Èuõ¸4wx�u�_�, �¹|º·�»jý
¼/¶·· , Y48w, Y12Å, `<: º½Ý, pp. 54-64.
10. �Ôz74uvw4W�Eð 20030+Ñ31� ��±^\Í�©����|"j�����[R
|"��(²), ��[R|" , Y+w, Y%Å, `<: �v��Ö��pp.100-116.
11. Q[|â7� 20040+Ñ31�� �Ö¾FxI��1Y�¿nD�âfhxÀ_hs2Á�89��
��ÔÂÃ)7���[R¶·Sh`_¡ó , Y12Å, �ÔÂÃ)7���[R¶·Sh`_,
pp. 41-53.
12. Sadanobu, Toshiyuki 20040ÄÑ30� "A natural history of Japanese pressed voice," ���¶
· , Y\w, Y%Å, �[����, pp. 29-44.
13. mnoð� 20040&Ñ30�� ��[R/�Qm·2Á�89·�, ��|"¶·�(²), �|"j�
�IV , `<: �v���, pp. 35-52.
14. u� vw� 20040&Ñ30�� �M^ÅR|����WVR×_Æä^ a /��jkl¸67�, �
�|"��(²), �|"j��Ç , `<: �v��Ö�, pp. 53-76.
15. mnoð� 2004010Ñ25�� ����óÚ½�_hýª/ÈÉ#jÊË�, ��[Rýª , Y
123Å, �[Rýª��, pp. 1-16.
16. u� vw� 2004011Ñ��� �¡<��Qn�g¢£�Ì��� !�, �[WVR��(²),
�WVR� , Y251Å, pp. 1-13.
���������� ���������
���
17. ����� 2005����� �� ��� , ����, �34�, ���, ��: �����, pp.
30-37.
�
��������
���� �� ���� ������ ����
� � !"#�� $%%$�&�$%� '()�� *� +,!+��--.���/�.012345�
-6 �
�
$ � ����� $%%7�����%� 8�9:;�<�=>? *� �<@A.01+5+�@AB6 �
�
�������� � ���������� ����
1. ����CD8EFG� 2000�11�26� 8��'(HIJK9:; LM8�-NO�P
QR , The Fifth International Symposium on Japanese Studies and Japanese Language Education
01ST!U�-(!+)6
2. ����� 2001�10�27� .V�!�WX XY , Z[��-.�26\�.]^_`ab
cdefLgh�U 01ij�-6
3. !"#�� 2002�8�4� !+��kl , mn@A]^_`abckl�mn 01opqr
+��-6
4. ����� 2002�s�15� tuvw�xyz , !+{\|}~����, !�8U
�C8���@A+,��^��01��r+�-�(!+)6
5. Sadanobu, Toshiyuki 2003. 2. 22. "Expressive speech and grammar: with special reference to
pressed voice," The 1st JST/CREST International Workshop on Expressive Speech Processing, Japan
Science and Technology Corporation, pp. 55-60.01op�-6
6. !"�GC����� 2003���22� 8�C!+�mVHIJK�����CL�����
��� , The 1st JST/CREST International Workshop on Expressive Speech Processing, Japan
Science and Technology Corporation, pp. 79-84.01op�-6
7. ���C!"#�C���G� 2003�2�22� !+����� ¡¢��£�yz�¤¥¦
��§; , The 1st JST/CREST International Workshop on Expressive Speech Processing, Japan
Science and Technology Corporation, pp. 49-54.01op�-6
8. Sadanobu, Toshiyuki 2004. 8. 22. "Voice quality and grammar: with special reference to Japanese
pressed voice," The 6th symposium of Nordic Association for Japanese and Korean Studies
(Goteborg, Sweden)
9. ����� 2004�10�30� .V�<L�V¨©�_¥ Hª«h , Z[��-.�29\�.
]^_`abc�V¨©�_¥L.V�< 01�¬r+��-6
10. ���GC!"�G� 2004�10�30� ¨©�®¯K¨©�,®¯°«¨©� , Z[��-.�
29\�.]^_`abc�V¨©�_¥L.V�< 01�¬r+��-6
11. ���� 2004�10�30� r+±-NOH�V¨©�_¥²³´µ�¯K¶:· , Z[��-
.�29\�.]^_`abc�V¨©�_¥L.V�< 01�¬r+��-6
12. !"�G� 2004�11�7� 8�C!+�=>�VHIJK¸¹º»��� , 8!+�-
.�54\¼+�.01�¬�-6.
�
�
���������� ��������������� �� �
���
��01V������ K������������H�O�
� ���XVG)� � HI � oJKLM� �JO�1. Rilliard A & Aubergé V (2003), Prosody evaluation as a diagnostic process: Subjective vs. objective
measurement ». International Journal of Speech Technology, Kluwer Academic Publishers..
2. Aubergé V. (2002), Prosodie et émotion, 2e Assises nationales du GDR I3 (Information Interaction
Intelligence), Cépaduès-Editions, 263-274
3. Aubergé V., Cathiard M. (2003), Can we hear the prosody of smile? Numéro special Emotional
Speech, 40, Speech Communication Review.
4. Audibert N, Aubergé V, Rilliard A., Rossato S. (2003), Capturing the emotional prosody in live but in
lab, Prosody & Pragmatics.
5. Morlec, Y., G. Bailly, & V. Aubergé (2001) Generating prosodic attitudes in French: data, model and
evaluation. Speech Communication, 33(4): p. 357--371.
���Z[G)�
*A�KZ[\]� HI ��JKLM� �JO�1. Aubergé V. (2003), “Expressive Speech in France”, 1st JST/CREST Int WS on Expressive Speech
Processing, Kobe, 10-19.
2. Aubergé V. (2003), Expressions, attitudes et expressivité: une architecture cognitive distribuée pour
les voies parlées des émotions. Interfaces Prosodiques 2003, Nantes, 319.
3. Aubergé V. (2003), Integration of emotional, pragmatic and meta-linguistic affective information in a
superpositional functional Gestalt model of prosody, Prosody & Pragmatics, Preston.
4. Aubergé V. (2002), A Gestalt morphology of prosody directed by functions : the example of a step by
step model developed at ICP, Proc of 1st Int Conf on Speech Prosody 2002, Aix-en-Provence, France,
151-155
5. Aubergé V (2001), Le sourire parlé, Actes du Colloque Emotions, Interactions et Développements,
121-125/
6. Aubergé, V. (2000) Modélisation de la prosodie par formes globales : amont ou aval de la phonologie
tonale in Journées d'Etudes sur la Parole. Aussois - France. p. 281-284.
7. Aubergé, V. and L. Lemaître. (2000) The prosody of smile. in ISCA Workshop on Speech and
Emotion. Newcastle - Ireland. p. 122-126.
8. Aubergé V., Audibert N., Rilliard A., (2004) Acoustic Morphology of Expressive Speech: What about
Contours, Int Conf on Speech Prosody, 91-95, Nara.
9. Aubergé V., Audibert N., Rilliard A., (2004) E-Wiz: a trapper protocol for hunting the expressive
speech corpora in Labs, 179-182, LREC , Lisbon.
10. Aubergé V, Audibert N, Rilliard A. (2003) Why and how to control the authentic emotional speech
corpora, Proc of Eurospeech, Genève, 185-188.
11. Aubergé, V & Rilliard, A. (2000), Prosody evaluation: quality measurement or diagnostic?,
Workshop COST 258, Stockholm, février.
12. Brichet C. & Aubergé V. (2004) Domaine de la fonction de focus dans la perception prosodique, JEP,
Fès.
���������� ���������
���
13. Brichet C. & Aubergé V. (2002) La prosodie de la focalisation en français : faits perceptifs et
morphogénétiques, XXIVèmes Journées d'Étude sur la Parole, Nancy, 24-27 juin 2002
14. Brichet, C. & V. Aubergé. (2001) La focalisation en français : morphologie de la prosodie. in Actes
des Journées Prosodie. Grenoble - France
15. Rilliard A. & Aubergé V. (2004) Evaluating an authentic AV expressive speech corpus, 175-178,
LREC,.
16. Rilliard A. & Aubergé V. (2002) Towards a linguistic validation of a prosodic generation model,
Proceedings of the first International Conference on Speech Prosody, 607-610.
17. Rilliard, A. and V. Aubergé. (2001) Mesure de l'intelligibilité de la démarcation prosodique, Actes des
Journées Prosodie, Grenoble, 483-487.
18. Rilliard, A. & Aubergé, V., (2000), Perception and Analysis of a Reiterant Speech Paradigm: a
Functional Diagnostic of Synthetic Prosody, Proc of 2nd International Conference on Linguistic
Resources and Evaluation, Athènes, Grèce, pp.661-664.
19. Rilliard, A. & Aubergé, V., (2001), Prosody evaluation as a diagnostic process: subjective vs.
objective measurements, 4th ISCA Workshop on Speech Synthesis, Atholl, Scotland.
20. Rilliard, A. and V. Aubergé. (2000) Perception and Analysis of a Reiterant Speech Paradigm: a
Functional Diagnostic of Synthetic Prosody. in International Conference on Language Ressources
and Evaluation. Athens - Greece. p. 661-663.
�
-=�F�G)� � HI � YJKLM� nJO�1. Aubergé V (2001), Prosodie et fonctions: libertés morphologiques et contraintes fonctionnelles, Actes
des Journée Prosodie, 35-39
2. Aubergé V & Lemaître L. (2000), Audio-visual expression of amusement : some over-added
information, Workshop COST 258, Aix, septembre.
3. Audibert N, Aubergé V, Rilliard A (2004), EWiz : contrôle d’émotions authentiques, JEP, 49-52,
Fès.
4. Audibert N, Aubergé Rossato S. (2004), Paramétrisation de la qualité de voix : EGG vs. filtrage
inverse, JEP, Fès.
5. Rossato S., Audibert N. & V. Aubergé. (2004) Emotional Voice Measurement : A Comparison of
6. Articulatory-EGG and Acoustic-Amplitude Parameters, Int Conf on Speech Prosody, 53-57, Nara.
�
����������� KIabcdefghi��j�klm�� �
���XVG)� � HI � �JKLM� �JO��k� ½¾¿� vÀhÁj(���iR�¡��FYÍ/�����_`_�(|"j��Ç��|"¶·
�²(�v��Ö�(�� �Î,��
�
k� ½¾¿� vÀhÁj(���|"/ÏY¤�/�Ð�F©E�(|"j��Ç��|"¶·�²(�
v���(�� - Π->�
�
���Z[G)�
*A�KZ[\]� HI � nJKLM��JO�1. Campbell, N., "Building a corpus of natural speech - and tools for the processing of expressive
speech - the JST CREST ESP Project", �_¿¾ñ���i¯/û�j���, `<��, 2001
2. Campbell, N., "Databases for Concatenative Speech Synthesis", Univ. of Munich, 2000
���������� ��������������� �� �
���
3. Campbell, N., "Databases of Emotional Speech", in Proc ISCA (International Speech
Communication and Association) ITRW on Speech and Emotion, pp. 34-38, 2000.
4. Campbell, N., “Future Directions for Speech Synthesis-a personal view”, IEEE speech Coding
Workshop, Oct 2002 (* Keynote Speech)
5. Campbell, N., "Integrating Different Prosodic Systems in Speech Synthesis", Prosody, 2000: Speech
recognition and synthesis workshop, 2000
6. Campbell, N., "Recording Techniques for Capturing Natural Every-Day Speech" LREC2002, May
2002
7. Campbell, N., "What Type of Inputs will we need for Expressive Speech Synthesis?", IEEE2002
Speech Synthesis workshop, Santa Barbara, 2002.9
8. Campbell, N., "tap2talk: an Interactive Interface for Large Speech Corpora" pp223-224 ASJ2003.3
9. Campbell, N.: "Towards Synthesising Expressive Speech; Designing and Collecting Expressive
Speech Data", in Proceedings of the 8th European Conference on Speech Communication and
Technology (Eurospeech'03), Geneva, Switzerland, pp.1637-1640. (2003).
10. Gerard Bailly, Nick Campbell, Bernd Mobius, ISCA Special Session: Hot Topics in Speech Synthesis,
8th European Conf. On Speech Communication and Tech.(Eurospeech2003)
11. Nick Campbell, "Advances in Conversational Speech Synthesis", Advances in Speech Technology
2004(11th International Workshop),
12. Nick Campbell, "Extra-Semantic Protocols; Input Requirements for the Synthesis of Dialogue
Speech", Affective Dialogue Systems, pp221-228
13. Nick Campbell, "Speech & Expression; the Value of a Longitudinal Corpus", 4th International
Conference on Language Resources and Evaluation, pp183-186
14. Nick Campbell, Specifying Affect and Emotion for Expressive Speech Synthesis,
CICLing-2004(Fifth International Conference on Intelligent Text Processing and Computational
Linguistics) (* Keynote Speech)
15. G7n - Nick Campbell, Synthesis Units for Conversational Speech-Using Phrasal Segments-", �[
����20040()¶·d��(pp337-338
16. Nick Campbell, User Interface for an Expressive Speech Synthesiser, �[���� 20040 v)
¶·d��~�²|t(pp253-254), 200403Ñ
17. ½¾¿� vÀhÁj(��_�wÁ_w���ÐÇÈ/MX[ 4Ñ]―�I_���_�w�a��
��Ð―�(�7�¡§Ä���Vol.87 No.6 pp497-500
18. ½¾¿4vÀhÁj, Collecting Really Spontaneous Speech, ����-¹�2��iR�¡¢ì/
¨1¡�Y2PÌÍ�§, pp.155-158, 2002.
19. ½¾¿4vÀhÁj, DAT vs. Minidisc Is MD recording quality good enough for prosodic analysis?,
�[����20020v)¶·d��~�²|t, pp.405-406, 2002.
20. Mokhtari, P. and Campbell, N.: "Quasi-syllabic and quasi-articulatory-gestural units for
concatenative speech synthesis", in Proceedings of the 15th International Congress of Phonetic
Sciences (ICPhS'03), Barcelona, Spain, pp.2337-2340. (2003).
21. Mokhtari, P., Campbell, N., "Automatic Detection of Acoustic Centres of Reliability for Tagging
Paralinguistic Information in Expressive Speech", LREC2002 2002.5
���������� ���������
���
22. Mokhtari, P., Campbell, N., "Automatic Characterization of Quasi-Syllabic Units for Speech
Synthesis based on Acoustic Parameter Trajectories: a proposal and first results", 1-10-5, pp 233-234,
ASJ2002, 2002.9
23. Mokhtari, P., Campbell, N., "Some Properties of the Glottal AQ Parameter Automatically Measured
in Expressive Speech", LP2002
�
-=�F�G)� � HI � �JKLM� �JO�1. Carlos Toshinori Ishi, "A New Acoustic Measure for Aspiration Noise Detection", 8th International
Conference on Spoken Language Processing, pp941-944
2. Mokhtari P., "A proposal for acoustic-articulatory gestural units in concatenative speech synthesis"
pp253-254, ASJ2003.3
3. Campbell, N., "Recording and Storing of Speech Data" LREC2002, Satellite Workshop, May 2002
4. Mokhtari P. "Automatic processing of expressive speech: physiologically-motivated but robust
analysis" pp97-102, JST/CREST Workshop2003
�
2���G)�
� � � Long-Term Research
Since 1989, the Advanced Telecommunication Research Institute near Kyoto has conducted
some of the world's most significant, long-term research in human-machine communications.
Now the Institute is being restructured, and more than a decade of quiet research is coming to
fruition ...� � pp.12-23. Sept 2001.
bcd�effgh�
Computers get emotional
Kentucky.com, KY - Dec 9, 2004
... Nick Campbell, a speech synthesis researcher at the Advanced
Telecommunications Research Institute in Kyoto, Japan, says it
helps to understand how the speech ...
Synthesizing human emotions
Baltimore Sun (subscription), MD - Nov 29, 2004
... Nick Campbell, a speech synthesis researcher at the Advanced
Telecommunications Research Institute in Kyoto, Japan, says it
first helps to understand how the ...
Computers get emotional
Lexington Herald Leader, KY - Dec 9, 2004
... Nick Campbell, a speech synthesis researcher at the Advanced
Telecommunications Research Institute in Kyoto, Japan, says it
helps to understand how the speech ...
Synthesizing human emotions
Baltimore Sun (subscription), MD - Nov 29, 2004
... Nick Campbell, a speech synthesis researcher at the Advanced
Telecommunications Research Institute in Kyoto, Japan, says it
first helps to understand how the ...
���������� ��������������� �� �
���
No laughing matter
The Scotsman, UK - Dec 9, 2004
... Nick Campbell, a speech synthesis researcher at the Advanced
Telecommunications Research Institute in Kyoto, Japan, says it
first helps to understand how the ...
No laughing matter
Electric New Paper, Singapore - Dec 8, 2004
... Mr Nick Campbell, a speech synthesis researcher at the
Advanced Telecommunications Research Institute in Kyoto, Japan,
says it first helps to understand how ...
( Get the latest news on "nick-campbell speech" with Google Alerts )
�������HI � oJKLM� ������ �JO�
* I �d�$s¼½¾¿½À�ÁÂÿĽ¾ �Ò�Ã�«½À�ÓÙÉÉ�
'Ós���/���¨©ÄÔ#LÕµ» �Ömµ�2�/×z�apX/2
�/ñØò�×(��ÄÅ/���¨©ÄÔ#LÕµ» �Ömµ�2�/
×z�apñØò�×(T�p�`|�gØÕÖ×z�apñØò�×�
Ö�ÙÅs �� Ú���,>��
��s �� k��k���
�
d�$s½¾¿� vÀhÁj�
'Ós���Ð×znp�hÛÚ_`ñØò�×�
Ö�ÙÅs ��,Ú���� ��
��s ��,k�,k� �
�
d�$s�_Ü×��¿`g(Ó�ÈjØw��(� �
Ü_x×¾x1õÝõhi_(½¾¿� vÀhÁj�
'Ós�a�êj(ÐD"(�a^_D"(Þp�XY�/2�/�hÛÚ_`ñØ
ò�×(´ßñØò�×�àè�2àèáÍ(np´ßñØò�×�a�ñØò
�×éY2�hÛÚ_`�
Ö�ÙÅs ��,Ú�,� �>�
��s ��,k� k ��
�
d�$sÆOfÇ(½¾¿� vÀhÁj�
'Ós����¡ !×znpX/â¢ì×z�
Ö�ÙÅs ���Ú�����-�
��s ���k�,k�.�
�
d�$sÓ�ÈjØw��(½¾¿� vÀhÁj�
'Ós��ê_`/��YãäbM&Ö×z�ap��YãäbM&ÖñØò�×�
Ö�ÙÅs ���Ú �-..-�
��s ���k�>k���
-LM���� �� Ú���,>�/åæçÖ��
�Ö�ÙÅs¼«z��¼�,���>���
���s ��,k� k ��
� � � � �èmVsJ�gÈ(È]é�
� � � � �
���������� ���������
���
����
���� ������@� ¡D¢�£K¤ �D���3x@�¥¦§¨K�
����©ª�
H5iJ�jklmnopqrst�uXvwxyz{�|mn}w�~�u�t�
�*+���2�w��t��r��k����w�mn���������#u�
s}wu�����u����xn����#��m�x�o��x�
����*+������r����q�� ¡��H�¢Q£¤¥¦�§¨q©
ªtG«¬�®¯�°�*+�±�(²³´w��t�ªµs�¶Uk·¸¹�n�
ºu»_�¼®½�a¾a¿k©ªÀUÁ��xn�
Â:.GÃP����ÄÅuÆx�t̂ Çuw��£���ÈÉq©ª�Fq©nkt
Ê��H5iJ�jkËxt©nx£H5iJ�jkÌÍÎn�£tÏÐÑÒÓÔÕÖ
�^Çw¼×�ØÙÚªÛq©ªÜÝwrnHÔ��q©n�
����uÆx�tÎÙuÞ[ußànáâkq��xrx�£ãäq©nkt�å
æçE�èqtÁ�uéarÈ·����TU*+wxy�ª£ê�ët��r*+�
°�±x�wxmn�ì5í�í�rHIJ0K�LM5:;<�tî���2�k
��x()(²*+�·ß���
ïð���������ñ�wòut��róP��ô����õ������kt
Ê��öÆ����÷�E£tøù�^kúûu¶Uq�n��w��tüý�TU*
+þ�wrn��x�m��tG«�q�n}w���u�sÎn�ªtG«q�rx
}w�~�uÎn�kî ¡�*+���2��?�q©ªt�æq©nw�mn�
�
��«� ��¬)EC®��¯>°9±²³´µ¶®�
���������� �
�¦£t�°w����÷�EÎn q©��kt����/�����/�÷
���/�����/r�����k��q©nw�½���ok�½�tÊ�-�ª
u��kµs�����n}wkq���G«q�t}��°�¶Ý����u�p
�k��ª�xw�ykt}��æk�VÇ�è½½nw�����½�t � !
���uÆx�t}��VÇ£"#æu$%k©nw�m����&xq©n�Ê�
èt ���'(½�£)*�+Æ*+R��q©nkt}��VÇq��,u��
�t}��'(��-Î}w.�/�0ª1p�w���n�
���������
^23�45uÆx�t:;���*+£64�45�ª^7q©n�8t*+3�
µs£H�¢QÒ�j�9�:�;}��<6�=|u>ª�p��t_µs�?õ*
+/ÃÔ�í@j�k}�AB�CuÏâ9r*+DE�F��tG�������H
Iu�$%k©���
JK½�LYr��/�¨r�MtºuAB�Nê?õ*+/�OP�OQ�Ëy}
wkq���î���2��Ò�j�R�S���¶UwòutT��mU�*+
Ô��V�WuAB����m��}�AB£�*+¼®uX��tÁ�uYZ
���������� ��������������� �� �
���
u[m��n�}�\7?õ*+/�kÃk�Ê�"q]�æ^_�+Æq©ëy�
`3uÆx�t"#"aÖb�\]u¶U�tî���2��*+ckT���
���u�.În}wkq���drn�E�ef�gÝu�hwijuî*+�k[
�[l�t�"#m$u���t���*+��mU�nEq������o�up
�n}wqtÁ�un7r"#æq�r�s�q���
Y;uÆx�tG«t"aKqi �Ò�j6�QF�kçË]q©n�}�£î
���2�k}��°u�tuuª¸�o��q©n�v�uw���¶w£�xë�
oktTuyx��}�*+�µ�E�µ��æz{¨��|În}w£tµ�r¶
w��n}wuÆrkn�}~æ^Ç�ÄÅ��-�t�x�£tXîæÄÅ��Lu
rn�
������������
*+��jut�xH5iJ�j��¼��<�În�ªt~�r��tø���<
�������u����x�Ê}qtÔ��5���Ã�Ç�XvP���Ò�j6
�Q�ºu���i r*+k�¹��xn�qtÊ��NO��x�
¹�*+-./t ^*+/�gݽ��t}����2��z½Î�8t��@
���íÃ��S������ír��ø������ª�x��ðt��uP��t
�yrn½��rx�qt��u*+������ÀUk~�r���� ��x�
�¡¢ ¡�*+���2����În}w£t*+-./uw��t½rª�£
¤�Â¥n�Ê�u·������ß�ti��è�¶w�ÕÖu¦CÎn�§�Â¥
n�
������� ��
}�*+�TUuP��t¨©æuö2�����2�Aªk©n�Ê�øÆ
£ºu�¹��yªt�yøÆ£G««¿]q©n��
=§¬�®æ:;&'*+±�¯ç°Ð±²³´µ¶·q£t}����2���
� !ÌÍ���(îw��t¸¹:;�º¨t»5���.�TUu�½y��
¼£t²³´µ¶�½¾J��5HIJ0K�LM5�ó[ôk¿8n�À1��*+
±�Á���S���½�Â�
�%���³�, a�a�õ^�CD�ághC��eDF���������"
·5�¡���mn_5����§¨��b�n^§�ì^��ý�|;�����
_5�¡�ŸcdAefFghC|;���«�E��z���¡�mn����_
5� %�ý����¿�C8E !Câ«F�"#��$�%û��&FD'(ì
P)ì���«�E������õ*�_5�+�ý¬,âD�¬�mn�-.þ
��ý&FD/F�����«�E���0m��1_�23�¿4567àF
�8â��9¬�:;|;ì��|;�<È�P)=>àF�^�?��:;
���|;µ¶����ì@���cdAefFghC|;�P)A9â�àF��
�|;� µ¶^�§�ì�ý�|;�P)B&B��àF�0^�������
¿ùC5^�D��^³EF�GH��&«ùIJ0^�������¿ùC5^��ì
R���Kýì�ýì{|�L��MN��&«��$�O 5��¶�'P��Ê
���������� ���������
���
A9â�4Qn�RF !Câ�¡zg�8��²�SäT¸��4"b5��«�
E�����mn�U°��¶�½º,89âåbV�DWFf�F��¿��
�RI°���XY���}� Z�16�ý[18�ý!
«¿]�®æÃB*+¯ç�Ä� =>�6�*+ ICORPj@�����2�
£tóE-Å�LM534V4: Ambient Intelligenceu�p����.Gs!u�nÂÆ
:;<�LQRS�*+ôq©n�
óE-Å�LM534V4ô�*+£t��oF�Ò�j�Ò�juÇ��H5iJ
�ju0ª¸êóÂÆ»5�ô��wt��ÈxUu�nì5í�í�:;<����
*+q©n�^Ç��kÉÎÂ::;���ËÊ:;��/ËÐ:;r��óÂÆæ:
;ô�ÌU�u<�În�8�()*+q©n�øU£tÌÍu�p�t��ÍÎu�
n�æ:;��Ï�Ðß�ÅÒ�E�Ëxt�yøU£t��u�p�t�9Ç�Ô�
ÑqÒS�±��Ëy��Óu£.�rx��:;tƹª��aÔ���:;�è�
�� !q.GÎnì5í�í�rHIJ0K�LM5:;�(u�t¢<��:;�
����u]mn�}���£ÑÒÓÔÕ�.��oFâF�ÃÏÐÑÒÓÔÕÖ�z
À:;��.�TUu�p�t��.:�ºê.GÖ½r������UE�tÁ�u
£t��u0ª×s@5Rq��5�r:;&'����.�(²ÌÍ»5���u�
¶Uq�n��q©n�
�
��·� ���� ¸¹ºfg��»¼½¾´P¿�� 3¡<ÀÁ�
�¼t*+-./��y}w����Ø��q.G��x�
“It has been unfortunate that the researchers employed for this project by the JST have not
been entitled to the same rights and privileges as other ATR researchers working in the same
building. This has resulted in some of my best researchers opting to join ATR when the project
was coming to an end. Similarly, since my own salary has not been paid by the JST project,
this has raised some problems concerning the use of my time at ATR. I hope that these small
difficulties can be smoothed out for future projects with similar funding arrangements, for
although it has been a work of great interest to me, it should not place an undue burden on the
laboratory in which.I am employed”.
óATRq� CREST�Ùc*+c£tãärk�ti¼®�Ú�Ùc*+/wdrn
Û¶¼u©ªt*+���2���j�ܽ�i¼®.�*+ÝuÞn}w�tßà
rsÁ���i¥st-./�á|kî���2�½�£Nâ��rx}wkåãw
r�����q©��Nä�rxêUq©n½���rxktüýtJSTk}�åã
uÆx�t��rå�±x�æ�xw�y�ô
“The staff of the Kyoto office of the JST have been extremely tolerant and helpful. I
would like to thank all the Kyoto staff for their continued patience with my requests and to
praise their efforts to comply with even the most difficult of them. This research would have
been impossible without their help. I would especially like to thank them for finding ways for
���������� ��������������� �� �
���
me to employ so many young people to help with this work, since I believe that such a
labour-intensive project is not the norm for JST-funded projects. They have made the official
arrangements of the project smooth and have greatly eased the burden of paperwork that might
otherwise have distracted from the research. They have also proved excellent at reading
English!”.
óJST�çè�§,�Qj1�uésÂê��x�¨�r�ëì±Äu}�.�rk
x¥��»írîïuP��ðñÐ�Tò�txëxëróqt}�*+kQS�Òu
�6q�n�yuô7��soÁ���Äutî:;�����2�u^æ0�Ò±½
rªµx?õQj1�ð�0ª¸ê°Ðè���ÁuP��tçè�§,�7rs��
£��q©��q©ëy�õ�q�PTu�ösT¥�soÁ��}w�té÷�Âê
În�ô
“Finally, I would like to thank the advisory committee, and the staff of the JST in Tokyo.
Their help and advice has been most encouraging, and their positive attitudes always most
refreshing. The responsibility of managing such a large research project has weighed very
heavily upon me at times, and it is a credit to their professional support and management that it
has run so smoothly. I have been quoted in the press as saying that one of the greatest
strengths of research in Japan is the breadth and depth of its funding. Without such long-term
fundamental support, we would be seeing only small incremental improvements to the
technology, instead of the paradigm shifts and new openings that come from deep basic
research”.
ó�ýutÔrí@øùcÖw JSTî�u�nú@û5QuÆx�t½ü�x*+q
©��Á�÷>��NOÃ�F£ýér��q©n�þ���yt ��G��8��
¹�tÙgm$£�x�w���x��Á�utJKw����tXî�����*+
��øÄÅ£t}��5�j�Stéx��5ÒÓ5�q©n�}�q��r¢<û@
SL��k���n�ô
�
��Â�Ã;�ÄÅÆ�
�
�
���������� ���������
���
�
�
�
���������� ��������������� �� �
���
�
�
���������� ���������
���
�
�
�
�
���������� ��������������� �� �
���