˘ˇˆ - JST · ˘ˇˆ˙˝˛˚˜ !"#$%&’()*+,-./012"$ˆ3 ˜ 456789%:2; 0ˆ?@; ˆa- ˜ b cd ef...

��

��

�� !"#$��%

&'��(

�� )*"+,

��

�� !�"#�� !��"

��

�� !"#$%&'()

*+,-./0#123456789:$%&'()!";<#=>#?=�@ABCDEFG

=HIJ#!";<KL9MN@AEOPQRSTNUV:WXYZ[\N]R^_`\$

a'bc(,d%#efN9ghijklHmnNopqr@DstuvYklH

wx\N9yz{|}~��#@A*+KL#�#�j:� !"Y��\��:

�No�@A��#� �Cj��l�"Y��0N�Q��lHW9��N

9��Q@AD"#��N�� :�z:r@��"#�@ PQR�D¡

�H¢£¤¥¦!"§ |()¨(-#©ªP«¬N�:� �C#®¯°(,d%D«±

�lP²N:"#�@Dst��!";<³oQ_´µKL¶#·fN`¸�12

D]�=H

12#¹º9:$%&'()N�� !"#»B�¼!½@A:¾¿@A:"À#Áv

½DÂ�ÃPG�: GÄ#r@³� ��ÅN§·�l!"�Æ#¥ÇÈÉD]RHÊ

¼N:Ë�¼DÌÍ#o�¹º³ÎÏ#� ]Ð�Æ¶ÑÒ&%Ó�lÔPN�:yÕ#�

@A*+KLDÖ�P�lH

�×�"@A#��ØoQ_N��YÅ#�RNÙÚl�

�4��#_ÛØÅ#�RN� ��DÜpl�

�Ý�µÏ#ÞßØ��àá#âãäPo�#åæä

PQRç�#è¨éNêQ�:!"o��Û#+ëìP~éí¯î/ì#12D]Q:

�ï�¢£¤!"|()¨(-#ðñò©ªò«±

�ó�ôñ\õöN�l÷ø{()ì

�ç�o�¹ºò÷øo�@A#ÑÒ&%Ó

�ù�� *+,-./ò$a'bc(,d%úû#üfì

#ù�#~34(ýDþ�;Ú��12D]�=H��9�ï�� j¡�H

Ô#KL9!"|()�¼�"Y��0#��µÏ#ÂP�l� �Cò ��zòr

@�j#@ADÈÉòf�lH!";<KL#;:hi�oQ_j�"#�@0DÜplH

G=Y��:WYÅ#�RN"D�� DÜp:Å#�RN!"@A�¼� #��0D�

��l�:PQRW# G_#12�¼��¹ºD��lH

�12# �jkl�� !"#$%&'()*+,-./0#¹�9:$a'b

c(,d%úûKLD9�>:WP��#!"×%)(�6(-#k_:WN�GQ@A�

�#Ä�DÞßG:yz{|}~��NêQ��\��Yj�=HÊ¼N:Ô#<��P

G��¼�= �!"|()¨(-9� o�õ:� �ìõ:ü!"+õê�#!"$

õN§G�%&\N¬D'�Q12()DEOj�=H

�*#�mnNopq0r@DstuvYklÔPN�Q�:+,N9!";<#Õ#-

.Ò3PG�"N�@D-�lÔP:��./01D"N��lÔPY �PG�2

p¼�=HG�G:�ü#"Y��3495¦#ÔP�Y¼�r@��0D9l�N6p

lH

��

�� !"#$%&'�(

)*+,-./012"$�3�� 456789%:2;�0�<=>?@;��A-�� B

CDEF� 1'GHIIJKL��>#:KL67MNOPQ)RS<=TUJV

�H�

�

§ � !"$(÷-#«±N��:Ê¼N� !"78#��j$(9ìÊ��Ql

@A9o�@A#��¼::Ë�P;�N:<=\N>%®(®é�!"@AY?v��

�D��ÔPD+�G=HÔ#>%®(®é@AY¡�@#9� �#r@��NPÅ�¼::

A�³B«�Å#"\��P²N:C�ÄP#WXDE³§ FG�¼H��lIGÊ�Å

#@AD��\NsJjQlHÔ#@A9o�\!"78#��0j9�K:LMNU\�

OjÜp¼��QlHPè¨é#�"#��jÜp¼�l@A0#Q|éì³�þ�DRS

�lÔPY:�12345678# \P��=H

Ë#=>:n�l�"Nr@0#��¼::��\$a'bc(,d%Q|é#T�12P�

�=HË#U�:?=�!";<Q|é³VW×%)�6(-KL#XY:¢£¤§ Z!

"$(÷-�¼#� n[\]#~éí¯î/#^_N@��=HIJ#� n[9!½j

k�=Y:�`:+� «MxDn[PG:$(÷-�¼"#��DRS�l=>:��³�

zD\]abPG�n[\]D]R_CjklH

�`:óc¬#;<deP��QlY:Ë#ô;ìYfg#hiP�lHo�\@ADÜ

pl#9IJÛ �jk��#^_�lk��jk:@Ç@ADÜpl#9§ Z°%5%

�Chakai�DÏNef�lÔPP�lH

�345678#!"$(÷-#m�nÔGU�D«¬�lP:opZ� ��q��rst�� u

@A�P kpZ� �sqq��t @Ç@A�#óc¬N«¸luvYklPà�=HopZ� 9m�

nÔG@A#��:��@A�jv«��D��j�lY:kpZ� #;:¾¿@

��

��

Aò"À@AY�QP��: ��D��j��Q;Y�QH��:@Ç@A

P9wmNxyÊ��Qn!�¶:Ýz:p(:{Q|�#� !N}�l@AHkpZ�

9:m�nÔG=� n[#~�«�¼Q#�P�lH

Ô#�R�@Ç@AD��l=>:!";<#VW#?=�×%)�6(-#T�Yj�:

��VWj9�K:��VW#ÞßN�l!";<ÄÛDEFG�QlHÔ#×%)�6(-

9��#��¼:: ��B«òA��Å�P²N�ÄDE�WXDEò§ FG��

Å�#VWN�� \� #äÀò¾¿�ÆY�}é)ìÊ�:FG³��Nhi��

n[Y|()¨(-�¼\]Ê�lHË#� ��9;��.�-8BC��=�jk��

@: \§ ��DÜplKLjklH

¬�N�l��D�lP:k��9��\Nóc¬#78D;�NÜ��lH��:

�_9� #":�_9$%&'()#@AHó��#KLDef�lP:Ë#��¼�

#"�¸G��¸¼��QY:�`#KL#efN��"@AP$%&'()@AN«¸�

�78D�¸lÔPYj�lH� !"#;@;�N:"Yóc¬#@ADÜ��lP�p

¼�lHIJ#!"*+KL#efN�lPo�\@AG��¸¼��=Y:�`:�

��345678#KLN��!"78YÜ��l�_#@A@*+j�l�RN��=Ho

�@AP²N@Ç@A@*+Ö�P��=H

$a'bc(,d%úû#STD9�>:WX�¸Yef�lKL#��¼::4�Ò8³�

-)Ñ(c~�(�-N@efj�lKLjk�RH=Ppq;<KL9:�(��Å#e

fj9@A#EOP²N§ \Z$a'bc(,d%@Ö�N�lHËG�Ë�9:�� j

��Q@ADÜ�j�l�RN�:oQ_´µKL9 ¡Ä#��@Qb)�lKLP�

�RH

�123456789¢�#�£Xj��K#W#��N��:yz{|}~��¶#K

LEFP²N:WX#�"N�l$a'bc(,d%0#+�#�j95¦jklU�D�=G

=�@G��QY:o�¤õò!"$õ«¥PG�?=�'_DEFG=12jk�=Hfg

@Ô#12Y¦�:%§¨NêQ��K#12�Y�Expressive Speech Processing0D.(

ÑPG�©ª#¯(9NõJjQlH

��

��

�� !"#$%&'(")*+,-�.�/0123�45��6 78"9:;

;�<'%�=>0? ��@AB��CDEFGH��IJKLM�N��

O��P��N�QB�RST?UV>WX�YZ[F\]>@^_�\]

`aEFIJKLM�N��b

GcdEefgh�ij0kl�mnE;IoXp��ef��&q��

rjstu�vwZ�xZ��u�y��z{|ef}~ghP0?�ef��

��b��@��ef&q��Z��y�ef��

��ef��9��@��&q��CD��c� ¡vw��¢

£¤¥¦§�¨©Xf|0ªl�«nX¬��@�!®$¯°±²"³´&q��b

��

��

�µ¶·¸Z�Z�x�ef'"%¹K)vw�º»@�¼½¾¿&q��À

ÁZ�CD��y��Â��Ã8��ÄÅ#"�@�ÆÇ��&q

��bÈÉÊbË��ÌÍÎÏ��Ð��ÑÒXmnÓ�ÆÇÔÕÖ×@p�GHE0�

�@� Q��ØÙX�g�ÚÛ@ÜY±)9Ý¶Þ��b�CD��c�

¡vw��¢£¤¥¦§��ßà��ákÓ@âã�ef!Ã)S��äh��;�

��åÝäh�efghæefçèé«'"%¹K)�äh@� Q�b

?UêàE0��ëì��<�(±²"@}~_�díSî0kl��Z

ïðñòefó�!Ã)�S��ôe�}~0? ��f��@î·_��ef

çè�¬�� ;��åÝ�ñòó�� õ��º»@ABQ�

.�ö÷��øù��XúûE0�ü�R�ý��W�þU�6��E��

�F��ÚÛ��^Ó@Â��[Q�ñòF��E��1��ó��ef@

`p��ü�ü��Ùøù�Q0Ã8�'Ý±¹M@� Q��5��efgh�i

j�-@d�_?U�.��-��QB��efgh�{|k�XúûE0�ó

��õ��ghvw��X0}lQ��0�� R!�efWX

p�f@��3_��F3�þ Q3"Y��@#$�vw��XF Q��þ

U�ó�ef%,&'"<<�F!®$¯°±²"@ 0��^(�N��6�

¼)��!ÓEF��@ã_.X�N��efgh0X ��QFf�*�á@

î·p+�pQ�b

�.��þU¬'I��efgh�ijXp��ef�,��;��

�åÝ�+�@��f�-� ¨©��0kl��_��ef��X

G0��QFR.ákÓ@/�Wó��efghá��0>012_��õ��

�ó�ef��!®$¯°±²"³´��¼½¾¿��ÆÇ��

��X�üù�¨©��¬��ÆÇ��

X±)9Ýßà��g��±)9Ý¶Þ��p��}3@45��.X

XG0�� 6@¾7pQ�b

.��p�GHE08ùü�[Q��}3X��9ER¼½��WXG0�Y�E

R�ý��W�#:0��9Ä�N��.��ý��0?Uó��;ü�

��6X<[5�V�1=��6�!�� }��/5�¼)F��

��#:@�Y�>��?4@Aøù'BCBX;��åÝ@�g�p�DE

�efghXøFU"Yõ��efghvw%Fþü�[Q�ó��efghtG F

há��A��0?ùF�R¼)�0>W0?�Ró��!®$¯°±²"

³´IM%'W�+�%�[��ý��@��_�'"%¹K)XF�HI%J

$�[Q�b

b

b

b

b

b

��

��

��

��

z{|ef}~ghP0?�ef��Expressive Speech��@Üë

�1��

��ÖKLM@£N§befghvw� 6XF�efgh±)9Ýª?Oef��f

|PQ�R¹MBK;�+��£Î§b2�Fef��@/�ef%S)�S��

ô�TT�£�§bghef�UV�0Znp�WX_��b

��

��!"*+#§«P�l��N9:!"o�N�¬#@Ajkl:��ò�z�

Å#÷øo�:r@ò �ä�Å#o�@AY�p¼�:Ë�¼DRS�l®�8¯6~Yu

vP�lHËÔj:Ô�¼#�ÆD@�=!"D;<�l®�8¯6~PG�:.�-8�¼#!"

;<°��± ��tpt�p� ��²,-./P:�MPG�³��´#�!"Dµ�l!"�ÆD��

!"N¥ÇÙ��l:"ÀÙ�®�8¯6~#T�D]�=H

��,-./PG�:y¶Àj·¸�RSDÖ�P�l!";<,-./PG�:½�\]

Z!";<_Cjkl�jk��P«±;<ÄÛ��ko¹j�_CDþ�;Ú�=!";<,

-./Dº<G=H�jk��P��ko¹j�_CDô;�lÔPN�:¥¦!"N»Q¥¦äD

@�;<!"N�=��Y¼:!"�ÆD·¸NRS�lÔPYÖ�P�lH

Ù�!"#¥¦ä`�#=>NDFW(Dynamic Frequency Warping¼Ç\½»�¾Ù�²

D¿V�lÔPj:yÀz��¥¦ä#yQ"ÀÙ�YÖ��®�8¯6~PG�ü�G=H�

=:Ô#"ÀÙ�KL9«Á\�ÆYÙ�§«P�l=>:¢Â\�¾¿ä#¥ÇÙ�Ä

ÛPG�:;+� �¼�l¾¿$(÷-DfQ�:VWÊ�=ÃÄ÷)(%PÅ@»QÃÄ÷)

(%DÙ�ÆP�l!"#$(÷-�¼\]G:Ù�ÇP�l!"#$(÷-�¼Ë�N§·

�l÷)(%DfQ�¾¿äD¥ÇÙ��lÄÛDXYG=H

�� !"#��$%&'(��)*+,*�-./01/23�

fg#!"@A*+KLNêQ�:;<!"NÅ#�R�c¬#��D�=�l#Yu

vjkl�DXYG:È� �N�lÉ[#³��´!"�¨N:��#|()¨(-Dð

ñ:©y:ÊËD]�=H

�4 � 56789:;<��)*+,*=�

Z��òÌ�®×¯%Íé �N�lZÌ³��´!"H"ÀÙ�ü!NuvP�l|(

)ÎPG�Z��:Ì�!"P@ �ÏN�Ä«Ð�Ñ«#|()DÒÓ�Ô:�©yG=H

�44 � >?@A!BCDE�F5G8��)*+,*=�

!"@A*+NêQ�9½»��äP"ÕÖÇ�°ÃÄ²P9NUjklP×ÞG�!"#

RSD]RÔPYÖ�jklY:�ØÙN9ÃÄ#ÙÇ9ÚÛ[Ü#ÙÇDÝR=>½»�

�äP@DÞ�lH�=:ßQ!"³àQ!"NêQ�9!¾#R��Ú#Îáj�!Ù

ì#��NâYklHË#=>:yQ"³ãQ":ßQ"³àQ"Dy¶ÀN;<�l=>

N9¥¦� #yQ"³ßQ"�ÅD©yG�!"�ÆD«±G=:Ë�D!"|()¨

(-PG��,-./Nef�lÔPYuvjklHËÔj:äc¬#¢Â\Nµ�l¾¿°É

[#�":yQ!":ãQ!":ßQ!":àQ!":yKßQ!":yKàQ!":ãKß

��

��

Q!":ãKàQ!"²D@�!½®ø%-!"|()D©yG=HË�å�#|()9ÑæÑ�

ç�¼�:æÔ#Óä�è()#!"D©yG=H

�444 � H%&IJKLMNOE��)*+,*=�

!"o��¬#��#è�\�@#PG�r@��Yé´¼�lH;<!"NêQ�ê

�#r@��Yëìj��q¬íjklHËÔj:"ÀÙ�ÄÛN��ê�#³��´!

"Nr@Dëì�lü!:��#!"|()¨(-PG�#¬íä#ü!DÖ�P�l=>

N:îï� #!"|()PP@N�/0�.#0�ðG�0Dñ>��"Ê�=!"|()D

Ë�å��Xòz:æÔ#Óä�è()#.�-8³��´N�©yG=H

�4P � QR�;S;��)*+,*=�

.�-8³��´Nê¸l!"#�ÆP:¥ó� Nê¸l§ !"Pj9: �Y§

�NêQ�¯øÒ7-G�Ql�ôõG�Ql�#�ö:�=÷¼�N G�Ql�=Å=ÅGK

G�Ql�#�ö|:�ª��ÆD��P�p¼�lHIG�³�Q;<!"#ü�#=

>N9Ô�¼#�Æ#«±YuvjklP�p¼�lHËÔj:æÔ#Óä�è()N�l¥ó

� N�l§ !"D~ø�X©yG=H

�ê:Ô�¼#|()¨(-D��,-./Nþ�ñt=>N9:!½GùD¥Çø¨¯%

Ó�lKLYuvjklHËÔj!"#¥Ç~ø×%{%8ÀzD`��l=>N:!"´µ

KLjfQ¼�l �h·#KLYefÖ�jk�=H

�T � ��UVWX�

T�G="ÀÙ�®�8¯6~N�:Ù�ÆP�l �#!":Ù�ÇP�l �#!"

Ë�å��«#!"|()DúË�lÔPjê�# �# �äÙ�YÖ�P��=H�=:

®×¯%Íé �#Z��òÌ�!"DfQ= �äÙ�ü!�¼:Ù�£ûõöNZ��

!"|()DfQ:Ë#Ù�£ûDÌ�!"NhfG=;N@ �äÙ�YÖ�jklÔ

PY¡Ê�=H

=�G:É[#³��´!"�¼r@Dñ>=!"¶#¥ÇÙ�ü!j9:

r�üp�� t�ýrþ�DÂúPG=�Sü!j9Pr@!"#Ë�N»�¸¼��Q=@##

MSü!j9r@��#Ù�#í�9¬�j9��=H!"#r@��Nê¸l?v�

v½jkl¾¿äD¥ÇÙ��lKL#uväY¡�Ê�=H

�=¾¿\N��!"|()¨(-D:�jk��N¾¿RS#=>#g*+PG�

��ko¹j�DfQ=��,-./#!"|()¨(-PG�fQlÔPj:�ª�¾¿#;<

!"Dy¶ÀNH<�lÄÛDEFG=HË#&N:ÉW�l¾¿N·�=½�\]D�D

�f�lÔPj:y¶À�;<!"YH<j�lÔPD:yQ":ãQ":ßQ":àQ"#

;<!"NêQ�Í�ü!D]Q¡G=H

YZ�[�

��!"*+,-./NuvP�l®�8¯6~PG�:½�\]Z!";<_C

�jk��Ny¶À�!"«±��ko¹j�Dþ�ñJ��°��tpt�p� ��²,-./P:

¹��°¹sý��s��tý��ü²ê�#�Ã�°��sr�� Ã��ý��s� ��¼Ç\½»�¾

�²NÂ�K"ÀÙ�KLDT�G=H

�=:12NfQl!"|()PG�:°�² ®×¯%Íé �!"|()¨(-:°��² ¾¿\

��

��

N�?ìG=!½®ø%-|()¨(-:°��²r@DÔ>=!"|()¨(-:°��²¥ó�

§ !"|()¨(-D©y:ÊËG=H

ËG�:"ÀÙ�®�8¯6~N�:!"#��ko¹j�«±N��¼�=½»��ä

°r�üp�� t�ýr÷ø{()²#Ù�N�:°;+o�jõöPÙ�D]R;#��¼::õö

|()#o�PÙ��l!"#o�PYµ�l;N@² �ä#Ù�Y¬íäN]Ú�l

ÔPY�´Ê�=H�=:Ô�Dr@��#Ù�Nhf�l;#¬íäD� G=Y:«Á

\�Æjkl½»��ä#Ù�#�j9Í�\N¬��r@Ù�!"9�¼�::§�

\�Æjkl¾¿\�Æ#Ù�YuvjklÔPY¡�Ê�=H

�=:��¾¿�[NßQ!"�¼[NàQ!":�=:[NyQ!"�¼[

NãQ!"�j�D��;<!"#y¶ÀìD:��#|()¨(-D��ì�lÔPP:É

W!"#¾¿N·��jk��#½�\]D�DÙ��lÔPN��ü�G=H

��

;<!"Nê¸l��#�¢N�Q�9:Ô#��%§¨jD"Y[Ny��=HÔ�

9ñ��FG#`�N�$(÷-¨(-!";<*+KLY�»�@#P��=ÔPP:!

"§ ,-./³��úûY�ü\�@#P�;<!"N§G�@�'(Ñ%×%)�6

×-PG�yz�¶ÀY��l�RN��=ÔPN�lH�`:r@³�z�Å#��D;<

!"NìplÄÛPG�:�Óé(3#<�jkl:É[#!"N§G�"ÀÙ�D��_

Û:j��°j��s��ü¼��Ñé$�Q|é²DfQ�;<�l_Û:½�\];

<_CN��c¬#ø¨éDëìG=!"|()¨(-DÊË�l_ÛYklHË�å�

#ÄÛN�áYklY:�12j��="ÀÙ�KL9VWP�l@#Y!"|()jkl

=>:�ª�!"o��¬#�ÆYefÖ�jklPQR�áD��H�=Ù�N&G�!

¾@ADu:G@uvPG�Q=>:!¾*+Nn��l;<!"#�9��j�lH

��!";<D �G=:�Óé(3#"ÀÙ�KL:�� °��tpt�p� ��²É

W#yä�ì#129: ��)-��:®¯~�¯(��Dü��l=>#�'(Ñ%×%)

�6×-PG�5¦£!Ê�l@#jklH�12#12<�Yüf,-./¶"#Ê��q

¬fä9yK:$J#@A��Nê¸lIG�³�Q@A�%#wx&PG�@�ßYkl

@#jklH

�Óé(3#<��PG�:

ò"ÀÙ�®�8¯6~

òPc#��!"|()¨(-:ê�#:Ë#©yP|()¨(-ÊË#=>#>¯

Ý¯

Dé´lÔPYj�lH�N!"|()¨(-9|()�fND�l'~D(=�)*+

Ö�jk:;12«¥#�,#=>N��j�lH

��

��

��

!"�Û+ëò!"��#�.í¯ì

(1) �� !"#��

O��ëì�dí��Y�ef��Eçè@�ùø0p�efEZ[+\]

p��^_EF¬�@¾7_�.X0N Q�pøp�̀ a0N��b0?U�

ëì�dí@?UcÙ�pQ+�0de_�.XXF Q�fg0�O��Ö

%�îhE}~@ciQef��j�¬�"�@kQ��LMX_�%�.ü

��¼½¾¿��X+�F�+��NU��ÚÛ}Ü@�l0

_�m^%N Q�fn0��9Ä@}Ü_�xF@lo�[Fø QQB��

�ïð@pZ�[q�̂ _EF¬�"�?U+gy��j0�Ù@r�Q}~@

d�st�@8Fø Q�fu0��+vw_x0�Y�ef�^_E¬�"��

yz��IJKLM�:h_�0�{|Fdí�N�X�Yé«0:p�N$��ü

@}~_�X��E�þXþU�F�+�0F �pþY��Ó%Z[�X��sü

Q�b

`��0?U�?UcÙ�süQdíXp��Y�ef0p�p�Jùü��

9E^(�}�X��@�ùø0_�.X@d�pQ�..��9E^(X��

RN�WR$XWF��¹:8 R�ü�W0Jùü�?YFe��[��pF

��N��.üù��j0�dpQ��`a��b0?��b

� 9�)M��[�X�Y¼½�DE��%ójXp��Q�j��

�0NU��1=@\�_�.X��}3��@A%

��[��b

� ef��0ç��j�NUF%ù��EF¼½ �Þ0��efg

hvw��.üþ��X��º»sü�.X%Fø Q%�Y�ef��

��0Jùü�+��NU�ñòFefgh0��ø�F�^(�N��b

b .�dí0óp��̀ a� ��9Ä01_��@�F��23�hI@��

.X%�[Q�b

È¡ ��¢+v0£Ye��[��p��b

ÈÈ¡ e��[��p�}�X��b

ÈÈÈ¡ ¹:8*Þ�Y�^¤b

È¥¡ ¹:8*Þ�mVÓb

¥¡ ¹:8*Þ�¾�E^¤b

¥È¡ ¹:8*Þ%<[50¦$�§Ib

b

� ��

b ñò��p�p��g¨+vpQ��@©ªp� U«_.X%N��QX$��

¬5��0��þ_®�?YF��N��DE.ü��¯°�(8X\$

ùü�[Q�+pQ�F�(8�N�Fù�efghvwX�±�1=+F��pø

p�.ü%±ùø�á²EF+��N�Fù��YX+�$F��F³Fù�.�´�

��

��

¢+v%<[50óp��@<3¾$@T$�µ¶@¦$�·�Ó%N�øù�N

��.��¸F(8X\$ùü��«p�R5�¹º0��þ_W�X

�W�?YF¢+v�»¼@ó�!Ã)�½¾ó��øù¿°p�©ªy�R5W�

y}��e¨ÀÁ�@ÆÇpQ��HI�)½¡Â¡Ã0ã_?Y0�©ªy�e¨�¢

+v�+��«p�+�[��sü�%��g��¢+v��Y%�¼0Z[�.

X%Äø Q�.�.X��¢+v��©ª0£Ye��[��p%�Q�F�e

¨E^¤�©ª0?�ÅÆ ÇÈÉ��-Fùq��p5�N�´��á²0

+XÊ��·�Ó@ãËp��X30��ÌÍ��ÎEÏÐ��@XÑ�D

0.�?YFá²@Þ��.X��p5��¯°X<[5��Â�ÒT@T$

�µ¶@¦$�V�HÓ�ñòF'"%8L±²"�Ö�0@Ap��+�X\$ù

ü��b

b

0.5

1

1.5

2

2.5

T1 T2

Mea

n N

orm

alized

D

ura

tio

n

Repetitions (N =

38)

Repairs (N = 12)

� W�X��Y�Z[\]^_`_ababcde�Kfg3]^_`hbie-�j�klgm�R]n�eopqrst%uv

�w��xHt ��y )uv�w��z{opqJV�I�1'G0HnX )klg|L}~�R

opqrstHklgm�R)�Z[\�B-��-q��f�H�

\\] ��^_`aD�bcdef�

�-j'ÉÊ�=!#.�qG#/�\��9:� 0T1�N)¼�Q�RN2Ú

�lHËÔj:N $(÷-�ÁæÁ3 �4ø5��D§«N:!#.�qG#6&D78\N

«±G=H�Á�æ�æ9:X9:n[�9:j;i¼�=� n[�§j#�#[Ü�+�� ò

ÇÛò<�òe=�PË�¼#�§j#Q(ø#[Ü�+Q(ø�òÇÛò<�òe=�>PN:

.�qGYH�l�;D¶?Ï�¥U�òë@��N«¸�¡G=@#jklHAB��ÆP

G�:Cn+��¼�l� n[j#n+Q(ø�#.�qG�ê@N�j(0��D(0�Å

#E¦?ò3 �µ�:Fn+��¼�l� n[j#e=Q(ø#.�qG�ê@NMi

Ô?G�:H� n[e=j#n+Q(ø�#.�qG�ê@NI?�Y�QÔPY'�J

�lHPKN:CPF9!#.�qGP3 v�P#DÞD¡��l@#jk:3 !"#

¥¦�;<NêQ�3 v�D�K�lÔP#?väD¡G�QlH

��

��

�� W�X�XY��3��,]�eH��]�cda_da��ci��d�abcd��ci�e��L�

�]�bd��_�� dbabh��¡_�bh��¢bdh��e��L�?":��]�bd��_� dbabh��¡_�bh��¢bdh�e�H�

��3£��3>Gf��3-�f��£V�H�

�

\\\] �gG*hi�j;kl�

!#.�qGN: GÄ#� LÉPC�Ä#� +�#úËDÊplMNDìp:×%

)ø7,d%DO÷N�l��YklP�l�¼:;�#ÔPY�k#0�p(P0�Å#QÚPl

�}ø(N@opl#j9�Q�Hü&:3 !"�¼�}ø(DQRG�u\S«#�DT

ª�lP:C�Ä#� +�#UVYW¢G:�>�C��¼Q@#N�lÔPYAXÊ��

QlH¥¦�!";<,-./YC�ÄNP��C�³�Q� PQRÔP@ �PG�s>l

�¼:�}ø(#�R��u\v½#hi��fÛ@Y«N�K�Z�jklHËÔj:PK

N�}ø(#�fP3 v�P#DÞD[l=>:3 Gù#\Ê��g#�]#á^#ò

z�PGù_gj#�}ø(#c¬ò`zP#DED«±G=HË#U�:�Á�æ�ÁN¡��RN:

\Q3 Gù#_gaÅ�}ø(#HnbYyQc`Y´>¼�=HÔ#c`9PKN�°(

80�°(0jABjk:�~>0�®>0j9'¼��=HÔ#ÔP9:�}ø(#c¬P3

v�P#�DD¡�@#jk:�-P;�N:³93 !"#¥¦�;<NêQ�3

v�D�K�lÔP#?väD¡G�QlH

�

0.0

10.0

20.0

30.0

40.0

50.0

Bn0 Bn1 Bn2 Bn3

�

eto

e

ano

sono

ma

�W�X�WY�¤�¥¦§�%¨dy£©ª«��¨dW£©ª§f0�¬®¯:"%°"±²°"²³´²µ´²

M"0��,]�eH§f¤�¥¦J)�g|-¶°"±·¶°"·£��3>Gf��

\m] �gG*hi�nop�

�-jJ�´=�°(80�°(0�~>0�®>0�Ñ(09@�P@`z#yQ�}ø(jklH

G�G:Ô�¼#RV#Å�D�KfQ:Å�Dk�fQ�Q�9:��#ÎWâYklHÎ

��

�

�

�

�

�

�

�

��

��

��

��

��

��

��

�

�

�

�

�

�

�

��

��

��

��

��

��

��

��

ä\�!";<,-./j9:Ô#�R�ÎWä@�K�luvYk�RHÔ#�R�ÎWäD

«±�l=>N:N $(÷-�dÑ3 ��# �>PN�e Ñ c¬#�}ø(#�f`zD

�ÉG:§·«±N��Ë�¼#+ÜDf>=H� Á�æ�ø9g �±æÕÆ¾N@P�K+Üj

kl�Á ÕÆ�N�lh/��b9 äÄ�di�H� Á�æ�ø �¼�°(0P�~>09��\��Ú�_

D�lÔP:��ÚV:�°(0D�f�l �9�~>0#�fYj�K:"§N�~>0D�f

�l �9�°(0#�fYj�QÔPYÚ�lH§·«±N�l-$~N@P�Q=7ø-)(

«±#U�:ù�# �7ø-)(YÈÉÊ�=H«+[Ü�¼:Ô�¼9Ë�å�:�°(80

k:�°(0k:�~>0k:�Ñ(0kN§·�lP�p¼�lH�°(80�°(lk9Óä �D

�Ks�:�Ñ(0k9Òä �D�KstHÔ#�R�ÎWä@¥¦�!";<,-./Nm

ì�l#j9�Q�H

��

��

��

��

�

��

�

��

��

��

��

��

��

��

�

��

��

��

� W�X�¸Y� ®¯:"¹Fº»-��E¼½%�¾� X ¿ÀÁÂ�0HÃ²ÄÅÆ)¬�Ç1'GHÈ

+)�E¼½$!³-É�Ê�Ë:$Ì"¼½ÍÎJV�H¶°"±·Ï�¶°"·Ï�¶³´·Ï�¶M

"·Ï�Ç-ÐÑ��H�

�

m] �gG*hi�qr@kl�

klc#�}ø(Y\Q3 Gùj�fÊ�lÔPDÇN¡G=Y:ÔÔj9:�}ø(�f

#v�D��N[l=>N:� §j#¹�\v�DXYG=HÇ]12j:� G�R

P�l�YÈn�aÅ:oÛj�}ø(YfQ¼�³�QPQRÍ'YklHN $(÷-�d�

3 �D§«N:X9:n[#oÛN�}ø(Ykl;P�Q;PjË#n[#ÈnÊY

Å#�RNµ�l�D�Z=HÈnÊ9Ë#� Ns��lQ(ø�ò��ò�Á�N��ñ

�=HË#U�:�Á�æ�ÑN¡��RN:Q:�#ÂúDfQ=;j@�}ø(#Ç]�l�

#aRYÈnjklÔPYÚ��=H

eto

e

anoma

sono

��

��

17.9

9.2

4.0

14.6

7.4

3.8

0.0

5.0

10.0

15.0

20.0

morae words phrases

mean num

bers in IP

U

with

without

� W�X�ÒY�ÓÔ®¯:"ÕÖ]�ba×��ba×c�ae-��q�rst%?":Ø²�LØ²Ù

ÚØ0HfÛ��Ü-�f�É�®¯:"£V�Ý/£w�q£qfH�

�

Ê¼N:Á�MÁò<=Áòp?Á�#oÛj#�}ø(#HnbDÁ#�Ê��#D�P

G�34Ò8G=PÔ�:�Á�æ�q#�RNar�Bj»sj�lÔPYÚ��=°�t �4ä²H��:

�Èn��Y�Q�� aÅoÛj�}ø(YfQ¼�lbYyQH³9:�}ø(#

�f9: GÄN� LÉ#úËDÊplMNDìp�QlÖ�äYyQHj9:�}ø(#

�f9C�ÄNP��@í�\�#��R�H

�

0

10

20

30

40

50

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Number of words

%

filler ratio y = 27.82+0.52x (r = .79)

� W�X�ÞY�Úq�%�LØ0ß�ÚÔJ®¯:"��,Hgà)áâgàHLØãfw�Ýä

ÓÔJ®¯:"£Ffå��,£æfH�

m\] �gG*hist_uAvwxy[�

�}ø(#�fYC�ÄNìplí�DXu�l=>N:"+õ\ü!D]��=H�}ø

(#�fY GÄ#v¼�#� _w#"#jk:C�ÄYË#í�D�xG�Ql�¼:

�}ø(#�fDyº�lÔPjC�Ä#� +�Nv¼�#ÙìYH�l9:jklH�-

#U��:�}ø(9Èn�� #oÛjfQ¼�³�QH@GÔ#c`NC�ÄYB�Q

�Q�:� +�NefG�Ql�¼:�}ø(#æ`N��g¦�l� #ÈnÊDkl

òzz{j�l9:jklHËÔj:|¡�#}Ûj#e*NI��~��#�BD\]�

��

��

lhi�j:e*#ÈnÊPÇ]�l�}ø(#¬�DyºG�:�!�#\]#"·�

XD� G=H� Á�æ�4 �:Èn�e*DuvP�l;N9:�}ø(YÇ]�l;#a

RY��e*#;�@¬�N"·YßQHn��e*j9Ô#�R�â9'¼��QH

Ô#ÔP�¼:C�Ä9:Ç]�l�}ø(#æ`N��Èn�GY¦KÔPDklòzz{

G�QlPQRÔPY¡�Ê�lH=�G:�(îN@;�#í�Yk:Ô#í�Dìp�Ql

#9:�}ø(#§xPQR�9tG��}ø(ò�(îYºlX��#�@G��QH

933957

928

874

862

874

800

820

840

860

880

900

920

940

960

980

fluent with filler with pause

mean reaction tim

e (m

s)

complex simple

� W�X�çY��èéêëEì�rst]íîeHïÑåð-�ñ'£uv�òó]��_dae²®¯:"ôõ

]�ba×��b��_ie²ö"÷ôõ]�ba×�`h�î_e�BHøùú�à)òó�û£üý��B�þù)��

�BHüý�òó1��G��B+�ñ'k-®¯:">ö"÷£ôõG��ÇëE£�

��H�

�

YZ�[�zdJ�

��#12<�9��#�RN�P>¼�lH

Cb 3 !"j9:�}ø(³!#.�qGPQ�=�u\v½Y�KfQ¼�lH

Fb Ô�¼9:¹�\v�#��¼::3 v�N\K��Ê��QlH

Hb Ô�¼9: GÄ#� LÉPC�Ä#� +�#úËDÊplMNDìp:WX;�

#¥¦�×%)ø7,d%#ü�N��G�QlH

�b 3 !"#;<³C�³�Q�C�ÄNMNDìpl�!";<#=>N9:Ô�¼#

�u\v½#hi��fYuvjklH

�

(2) ��

�}ø(³!#.�qG�Å9��ä°��qüý��²P��Ê�:»�%§¨j� Yy

��QlHG�G:Ë#aPJÅ9Â�\�e*NPÅ��=@#�:!"´µ,-./j#

�QN�TDk�=@#jk:!";<,-./j/�\NefG�RPQRST�¼#@#

9�QH�12#<�D��N!";<,-./N·fj�lÚ¸j9�QY:Ë#�ßP

Ö�äN�Q�9Y«N¡�=@#P2RHPKN:IJ#!";<,-./�t��tpt�p� ��

��t�r�Y�n�#!"ì0PQR�#��¸j,TG��=#N§G�:Õ�è#!";<

��

��

9��P#¥¦�×%)ø7,d%#�j#!";<D ��ÔPN��RHËÔj9:3 v

�D�KG=!";<³C�³�Q�C�ÄD�KG=�!";<YgÖ�#KL\hiP

�ljk�RHË#�R�_`äN§G�:�Óé(3#<�9Ê¼�l�,#=>#Â�D

ìp�QlPQplHn�#�ù�¼3 ³C�Ä#æ`D�EPG=�ùN��É�ÔPN

��:!";<#·f«¥9��\N�YlP£!j�lH

�

��

¨©Xf|0ªl�«n�¬��b b

This research group is concerned with the modeling of prosodic and voice-quality

information that modifies the interpretation of an utterance or expresses speaker-

specific states or relationships.

��

Speech communication has an important linguistic component, but it is also

characterized by the expression of paralinguistic and extra-linguistic information.

Traditionally, linguistic research in general, and speech technology research in

particular has been restricted to the study of text-based information, related primarily to

the expression of propositional content, and has considered the paralinguistic and

extra-linguistic content to be ‘not part of the message’. It is a relatively unknown and

much under-studied component of the linguistic code.

In the present research, we are aiming to produce basic speech technology for general

use in an Advanced Media Society, and consider the expression of personal attitudes and

relationships to be as much a part of the message as is the propositional content that can

be equivalently transmitted in the form of text. That is, we are not so much concerned

with text-based linguistic information, as with the additional affective information that

distinguishes a spoken utterance from its written counterpart.

This research group is concerned with the modeling of prosodic and voice-quality

information that modifies the interpretation of an utterance or expresses speaker-specific

states or relationships. We started out with the intention of adding ‘emotion’ to

computer speech, but soon realised, as a result of our research, that emotion plays only a

small part in everyday spoken interactions and that the expression of ‘affect’ (which

includes more general indications of e.g., personality, mood, politeness, and discourse

intention) is far more common.

Figure 1 illustrates a speech utterance (in this case, taken from read speech) that has

been labelled for prosody in the traditional manner using the ToBI conventions. It

marks the accent peaks and phrase-boundaries alongside an orthographic transcription

but shows almost nothing (apart from an indication of the syntactic structure and

phrasing) about how it has been said.

In the case of read speech, the speaker is concerned primarily with expressing the

content of the text, and often has no direct relationship with the listener and no personal

commitment to the content of the utterance. However, this is not the case with

conversational speech, where the speaker and listener are usually in direct contact, and

the speaker is personally motivated and has defined relationships with the listener.

Whereas speaking-style has only stylistic relevance to read speech, in conversational

interactions the manner of speaking (i.e., the how) is as important as (and often more

important than) the content of an utterance (i.e., the what). Non-verbal ‘grunts’ and

laughs are common in conversational speech, and they reveal much about the speaker

while contributing little to the flow of propositional information.

��

��

Figure 3.3.1. A Japanese speech utterance labelled for prosody. The top row shows

the fundamental frequency (pitch), the second row the speech waveform (power) and

the third row the associated ToBI labels which describe the accents and phrasing. The

fourth row shows the text of the utterance, split into words, and the bottom row shows

the ToBI break-indices that mark the prosodic linking between the words.

Although we first attempted to

label our speech corpus for

emotion (using the Feeltrace

software illustrated in Figure 2)

this proved to be a difficult

exercise, as most of the utterances

varied little in their emotional

expression. The differences

were better described in terms of

the 3 dimensions of voice, speech,

and speaker (see table left) using

features to describe not just the

emotion, but also the quality of

the voice, the intentions of the

speaker as determined from the single utterance under observation, and the (possibly

different) intentions of the speaker as determined from the long-term context of the

discourse. The 3-way labeling indicated where the surface-level features differed

from the deeper ones (e.g., when pretending, acting, recalling, or quoting) for

subsequent analysis (see Figure 3).

In addition to speaking-style labeling, we also performed discourse-act labeling of a

portion of the utterances. Since none of the discourse-act label-sets that we

encountered in a search of the literature was general enough for our conversational

speech data, we formed a combination set and labeled the utterances accordingly.

Weekly meetings were held to agree on a consistent set of labels that adequately

described the speaker’s intentions while being both general and concise. Phonemic

��

��

and syntactic/semantic annotations were performed using the public-domain software

Julius and Chasen. These labels constituted the perceptual component for a

subsequent statistical training against acoustically-derived features.

Figure 3.3.2. Feeltrace labelling of a speech utterance. The four quadrants indicate

the primary emotional spaces, and the listener indicates the emotional colouring of an

utterance by (a) marking a point within the circle, and (b) writing a descriptor on the

line. The 2 dimensions of valency and activation are widely used for this feature.

��

Figure 3.3.3. An excerpt from the speaking-style labels, showing voice, speech, and

speaker attributes for the sample word ‘�� ’ . Pitch patterns are marked in

addition to the paralinguistic features of the utterance for later training.

��

��

Figure 3.3.4. Principal Component Analysis shows the relation between perceptual and

prosodic features for (a) improvement of the label set, and (b) mapping between

paralinguistic and acoustic parameters for recognition and synthesis.

As the example in the figure below illustrates, the same word spoken by the same

speaker can have several different meanings according to when and how it is used, and

to whom. By use of statistical modeling (such as that illustrated in Figure 4), we have

learnt the mappings between the acoustic and paralinguistic features for a small number

of highly ambiguous words in order to facilitate the automatic labeling of the

conversational speech corpora, and to provide ways of accessing appropriate segments

from the corpora for use in concatenative conversational speech synthesis (see below,

on joint work with Group 7).

We had hoped to automate the

paralinguistic feature-labelling of the

main ESP speech corpora, but for a

large number of utterances there is a

strong dependency between text and

acoustics, and this work has not yet

been fully automated. However, this

insight led to a new categorisation of

the corpus transcriptions into those

representing utterances for which the

text alone is sufficient (I-type), and

those for which prosodic information

is essential to their understanding

(A-type).

This ‘I/A’ categorical split later formed the foundation for our new ‘conversational

speech synthesis’ interface technology, to distinguish content from affect-bearing fillers.

��

��

bFigure 3.3.5. Main findings from the prosodic analysis: NAQ shows the voice-quality

setting, F0 the pitch range. It has long been known that voice pitch varies according

to the listener, but we show here for the first time that vocal settings also vary in

The second main finding of our research is illustrated in Figure 5. This shows that

the voice is controlled according to the context of the discourse, and that we adjust not

just the prosody of our speaking styles (F0, timing, and loudness) but also the phonation

style (voice-quality settings) as well. This may appear common sense, but the fact has

not yet been integrated in any speech technology or prosodic analysis. It is due to the

unconstrained nature of the recordings in or corpus that we have been able to show for

the first time that phonation style is

(perhaps consciously) controlled in

conversational speech. Similar

findings were obtained for

politeness levels and discourse act,

indicating that (a) these features can

be recognized alongside the text of

an utterance for a more intelligent

‘understanding’ of human speech by

machine, and (b) that this level of

control must be introduced into

speech synthesis if it is to carry the

same information as a human voice.

Accordingly, we now distinguish the ‘affect-carrying’ (A-type) utterances, from the

‘information-bearing’ (I-type) utterances, and make use of prosodic information

(including voice-quality) to further label the A-type utterances in our corpus.

��

��

�Figure 3.3.6. Categorising speech according to discourse event, speaker-state, and

speaker-listener relationship. Discourse events (E) are considered directional and can

function to give or to get A-type of I-type information according to the speaker-state (S)

and listener-relation (O) framework. This is used for input to the synthesis engine.�

Finally, by linking these findings, we proposed a new input framework for

conversational speech synthesis. The equation ‘U=E|(S,O) can be read as ‘utterance

(or ‘speaking-style’) equals discourse event (E), given the self/other relationship

contexts’, where S indicates ‘self (e.g., in a good mood, interested in the conversation,

etc.,) and O indicates ‘other’

(e.g., speaking to a friend in a

friendly environment, etc.,).

Only by the slow process of

manually labelling several

hundreds of utterances for

speaking style and discourse

act information did this

framework become apparent,

though in hindsight it appears

to be simple common sense.

The implication of this finding

for speech technology is that a

new affective (paralinguistic)

level of information processing

must be included if a computer

is to be made aware of (or to mimic) the information that is present in human

conversational speech, whereas current speech technology is only sensitive to a

text-based linguistic level of information.

��

��

Figure 3.3.7.� The CHAKAI notebook-computer interface for the AESOP conversational

speech synthesiser. Users click buttons to select Self, Other, and Event features of the

utterance, which can then be modified by the sliders on the left of the screen if required.

Final selection is by the activated smiley-faces (lit in orange) at the top, which indicate

the available speech variants for this utterance in the corpus.�

Although in this research project we are not concerned with processing text-based

linguistic information, preferring instead to focus on the paralinguistic information that

distinguishes a spoken utterance from its written text, we have been forced to categorise

many of the A-type utterances into similarity classes for our discourse-level labelling

(e.g., ”Hello”, “Hi!”, and “Good morning”, are all ‘greetings’ and so can be treated

equivalently, as alternatives, to be selected according to the S/O criteria). This event-

based clustering has resulted in a prototype interface for a conversational-speech

synthesiser using iconic representation of discourse acts instead of typed text as input.

A-type conversational utterances are selected in the new ‘Chakai’ interface (figure7)

from a matrix of possible act (or Event) type icons that appear when a given

combination of Self and Other options is selected. Three clicks (and four optional

slider-settings) produce an utterance (retrieved from the database) with the appropriate

speaking-style characteristics for a given situation. Field-tests have shown that with a

talkative partner, this interface can be used to sustain a lively conversation, but since it

can only produce back-channel or affect-revealing utterances, it must be combined with

an I-type synthesis interface for real-world use as a future communication aid.

Figure 8 shows a sample of the underlying structure of the A-type synthesis selection.

��

��

Figure 3.3.8. Sample of the linguistic (or discourse-act) structure behind the Chakai

synthesis interface. Equivalent utterances are grouped and appropriate waveform

segments are then extracted from the corpus according to the selection criteria.

� ��

�6�{|YZ�}~��

The importance of non-verbal speech processing will grow considerably in the coming

years as the limits of current speech recognition approaches become apparent.

Machines that can not understand what a person is saying may be able to process how

that same thing is being said, and as international travelers well know, it is often

possible to understand much about a human interactive situation with little or even no

knowledge of the language being spoken. One example of the application of this

non-verbal speech processing technology is the SCOPE “Robot’s Ears” project at ATR:

��#�w\@AÉ712T��Rz��²j9:WXDQb)�l�%�KLP

G�#12Y1��QlH��:��t��s�� #�(/�(5�¼H

��12hij9± WPW#§ ×%)ø7,d%DQb)(G Ë#�0#��D³�Jl

KLDXYT��l¡o�\��Ds>: � ��ò ��z\@A�ÅDàá�lKL

jk $a'bc(,d%@A*+~éí¯î/³Â�KL#T�D \P�l¡ ��z#

129MN ×%.¯56%8é(/FGD¹�G zË\�|()©ªò«±ò*+~éí¯

î/#XYD§«P�l¡Õ�z9 348)×3#T�P¢£ Å¤�z9 |()¨(-P

*+~éí¯î/#¥TD ��¡wx\N9 ¦_Çd¢õý(/9 �8/~Ò3j ~

§@Aò!"@A#ÈÉP«± ¨©¢õý(/9 �>P#~§P!"@A�¼#o�

\òo�\$a'bc(,d%@A#«± k�� #ªÒ81ý(/9!"@A#��¼ �

��

��

��ò�z#@A#«± {|}~1ý(/9 � #�0#��Nê¸l ��«P i

�,#RS#Q|é ê�#� #�0#��Nê¸l GÄòC�Ä#+�zò�zòr

@\��#��#Q|é#¹�D¬�l¡Ë�¼#©PG� ªÒ81Y®�\°(56%

8KL³,-./#ô;ì �� Y³>l0~éí¯î/#12T�D]R¡Ë�¼#U

�94�Ò89�> �-)Ñ(c~(�ÅN·fj��R¡0

��12è��:î< �q�zÐ�d�z�

¯°j9:�� P¬s12345678Y±��¼�²G=H

�HUMAINE (Human-Machine Interaction Network on Emotion) is a Network of

Excellence in the EU's Sixth Framework Programme, in the IST (Information Society

Technologies) Thematic Priority IST-2002-2.3.1.6 Multimodal� Interfaces.

The HUMAINE Network (Contract no. 507422) started on 1st January 2004, and is set out

to run for four years. 27 partners from 11 countries participate in the network. HUMAINE

aims to lay the foundations for European development of systems that can register, model

and/or influence human emotional and emotion-related states and processes -

'emotion-oriented systems'. Such systems may be central to future interfaces, but their

conceptual underpinnings are not sufficiently advanced to be sure of their real potential or

the best way to develop them. One of the reasons is that relevant knowledge is dispersed

across many disciplines.�(http://emotion-research.net/)�

�

��[��

For the provision of information, such as in station-announcements or newsreading,

currrent speech synthesis technology is probably adequate. However, it is

unacceptable (at best, uncomfortable) for use in conversational situations where

two-way speech-interaction is required. The immediate application of this technology

is therefore in speech synthesis for interpersonal applications as a communication-aid,

or in humanoid robots and similar interactive devices that interact with (but do not

necessarily inform) human beings.

Of considerable importance, too, is the impact of these findings in Academic Circles;

until recently, ‘prosody’ was considered as including pitch information first of all, then

timing information, and thirdly loudness. Voice Quality was not generally considered

to be a relevant prosodic parameter. However, our findings show that it needs to be

included as the fourth prosodic parameter, and thereby open up a new field of research

which will have repercussions in technological applications as well.

�

��w��x��y[�

If machines can become sensitive to the feelings of people, as revealed by changes in

the way that they speak, then we can imagine a much softer, friendlier environment for

the future. Instead of people having to adapt to machines, the machines will be able to

adapt to the people. It is trivial to substitute an attention-getting speech-utterance for a

beep, but it will also be necessary for machines to sense when to use which; and an

essential ingredient for this is an awareness of and an ability to process affect as

expressed through differences in use of the human voice.

��

��

��

��

��

�� !"�#$% �&'()*+,-.�/0

12!"3 45!67�

89:;�

<=>?�� @0A!"�� B"�#$% CD<

=>?3 45!67�8EF;�

GHI�� <= �JK��LM��NO�0PQ!�RS

TSUVW0X#GHIYZ3 45!67�8U;�

[\]^_ ��

[\]^_ `a b cd!"ef&'()$gh,

�� ij�YZ$k�X# 3I`Pl �B

67�

8Em�

n.o>?��

��%pqf#��$n.o>? �q>rs

truvtrwxt,y$z{%pYZ3 4

5!67�

8|m�

�} ��

~��0 ��^�� q��>?� ��#��

,$%��#�+}��\_[��\�� .o�>?� ��

#��,$%CD3 45!67�

8E;�

�_^��}��^� ��

��)��X��)�I��$��0

��$�S2$�� D��I�3 45

!67�

8�m�

�� $¡�r¢£S,y$ ��X� $¤�r¥

�$YZ3 45!67�8:FF�

�¦§¨�� ¦§$©��ª«¬$j0��"$YZ

3 45!67�8®m�

�¦¯°�� ±I��²�$³´rµ¶012!��¦$·�

0�l#¯°3 45!67�8:m�

�

� ��

��!� "#$%&�'()*+,-��

!"×%)(�6(-KL#12

��

�Óé(3j9:.�-8!";<#?ÄÛjkl$(÷-¨(-!";<_CN³ G:

Ë#ÄÛNfQ¼�l!"$(÷-ðñN$´DT¼�ÔPN�WX¼GQ!";<,-.

/DüµG:Ë�Dþ�ñJ�$a'bc(,d%úû,-./Düµ�l��D¶�=HÔÔ

j9:°�² r@��YÖ��!";<#12:°æ²¨·¸¹º»¼��Å# �ÔPY½º

�W#=>#$a'bc(,d%úû,-./#üµ:°Á² Z[� D¾¿G:÷øo�@AD

��j�l$a'bc(,d%úû,-./ðñ#=>#� «±:À¢ÏG�AX�lHÁ:

�Á#°�²ê�#°æ²9:�AXm#ÂN¡�ë�:ÃÄÅÆ°æÄÄø²D�"N�P>=H

�

�

��

��

H%$�s�f#��YZ

�12912bÒ7ò�Ç%¨éý(/¯(ÈPÉ·ý(/#{%®(Y�ää4��²;j

�>�Q=345678j:æÄÄÄ�N��345678YÊ]Ê�=g:�345678#

+FPG��>¼�=H��N*ZlËr@�./ð�#r@!"|()¨(-#¹�PËr

@!"ê�#;<!"D� G=ÌJü!:Ër@N§·�l!";<,-./#¬º�

j9345678�²��#<�jklH

�� @#H%��u��

r@��9$a'bc(,d%NêQ�?v��D¶��QlHr@DÜplÔPYj�l

!";<,-./9 �ÔP³r@��Y½º��Í�#��¼::Î[�NP��@¬í

�$a'bc(,d%Ä�P<�lH�Áj9EFG=r@��YÖ��!";<ÄÛN�

Q�*ZlH

�12j9r@!";<,-./Dü��l=>N:÷ø{()ì#uvY�K:Ï�:!

"#¥¦äDÐ�j�l�Ñ�%&�BÉ7Â�KL12á�k��jT�Ê�=$(÷-¨

(-!";<,-./�jk��DfQ:��G=Qr@#!�\�ÆD�KsJ�¢£¤�

!"|()¨(-Dº<G:Ë#��¼E¦#=>#!"½�D\]G:Ë#!"»BD_

EE¦�lPQRÄÛDEFG=HÒ=�KQ¸qWX#!"Ë#@#D��78*+D+

i]Ú:N��G=Qr@#�ÆD;<!"N"#j�lH�Á�ø��NEF�lr@!";

<ÄÛ#��D¡�H�_C#eT9��G=Qr@#!"|()¨(-Dº<j��q

Å#�R�r@;<!"j@;<j�lTNklH

Á:�AXmj9:r@!"D �#r@Y"#Ê�=¹":r@;<!"D �#r

@Y"#Ê�=;<!"P�lHËG�:r@!";<P9:!";<,-./DfQ�r@

; name start(s) dur(s) zdur f0(Hz) zf0 voice

# 0.00 0.80 0.713 114.370 -0.335 0.008

g 1.24 0.07 0.145 95.999 -1.351 1.000

o 1.31 0.09 0.075 105.358 -0.817 1.000

k 1.40 0.03 -1.054 92.892 -1.845 1.000

i 1.43 0.03 -0.802 91.945 -1.298 1.000

g 1.46 0.05 -0.685 109.285 -0.475 1.000

e 1.51 0.09 0.244 134.526 ��

N 1.60 0.08 -0.207 ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

;<!"Dº<�lÔPDÊ�H�=:!"|()¨(-P9!"»B|():ê�#:�X

@A³¾¿@A�Å:�jk��j#!";<Nuv�+Þ#!"|()DÓÔG=|()¨

(-#ÔPDQQ:!"$(÷-P9:!"|()¨(-#½P�l©yÊ�=!"»BDÊ

�HËG�:�AXmjnN$(÷-Pm��=@#9!"$(÷-DÊG:.�-8$(÷-

9�w�:N.�-8$(÷-P�e�lH

�

�� H%��* �¡��

H%�¢{�

�12j9:�ª�r@#�j@:Å#Â�r@#12N@s��ê:!"NêQ�

@� \!�\�Æ³ÌJ\Õ«NöQYAXÊ��Ql�.#0�/0�ðG�0D§«P

G=H�12j9:Ô�¼#Â�r@9&%�×%8�[N�Þ#r@D��@#j9�K:

Ë�¼#kHr@@7sG�QlP�plHÔ�D�Cì�l�¼q:�Á�ø�æ#�RN�lH

£¤d�>5G8�

�12#!"$(÷-#£¤N�Q�9:�jk��j#!";<N�ú\NfQ¼��Q

lk��ÑæÑ�D ÖPG=HÕÁj*Zl+óN�:¥¦äD?×�l=>!½®ø%-D6

�N�KG��=Y:U�\NÑæÑ��5#®ø%-Y�¼�=HPr@#£¤P!¾®ø

%-D�Á�ø�� N¡�H

��

��

��

��

��

��

��

��

(�)�

��

��

�� 12 495 39171 349 57

�� 15 461 40928 377 51 (��

!"# 10 426 31840 345 48

�� 12 495 39171 388 43

�� 15 461 38360 402 43 ($�

!"# 9 343 27302 383 31

ATR525 -- 525 31053 403 42

�

¥��* �

IJ#!"$(÷-9:Ë#f<N��Þ)-7DðÞG=�j!½®ø%-D�K�

l_ÛY��=HG�G:�12j9)-79ðÞ�::)(ØÒ8r@D��G�QlG³

�çDj�l�¸�Ks�:³�ÄYj�l�¸¥¦NË#r@DÙ�¦¸¼�l§xPQR

ÔPDº<# �NÚ´=Hwx\N9:.#:/:ðG�#Ër@Y�K�Ú��Ql�

çD�Í�#¥ÛÉÜ��Å�¼©ªG:ÝB�#Þß#@PNr@�DëàG�r@>P

NQ>4(ÓBC#.�-8$(÷-Dº<G=H©ªÊ�=�ç#r@D¢õH4æÔN\R

Ë]jàÏG�@¼�=PÔ�:4Ä��¨9$(÷-ðñ�#r@«¬P+áG=H

01�

~�¯%-â³ã�DHâPG�Q�QÒÓPïÔ#�"D��äj©yG�q�jå± �q��t

jk��Ù�G=H©y¶5�9©yT1�N³�ÄY)(ØÒ8r@DÙKÔPYj�l�R

� ijn3D�l�Å)(ØÒ8r@#æ.N"Y¸=H©y�@³�ÄNçèY'¼�

=�9�áG:¯øÒ7-G�@¼�=g:�=r@#æ.D]�=H

H%¦§UV�

º<G=r@!"|()¨(-#!"D��r@àÏü!D]�=H�]#��Dj

�K�l=>NËr@.�-8$(÷-#M�ç�¼ø%È/N��:�\]G§·�l!"

D0HG:¢õéHæäÔNË�å�ÑÄ�:�C��\RË]jr@DàÏÊ�=HU�

9.#dÄi:/dqi:ðG�äÁijk�=Hr@!">PNê¦�b�ÁÁ�Ái�Dëé×�PG�

À-«+Nê¸lîì#â#XÞD]�=U�:ëé×�9¬�íú�ijîïÊ�=HI

��:r@9¬�NàÏÊ�=Pàáj�lH

�¨@©ª�

Ò"Ó"P@NÂ�½»�9ðG�:.#:/#ðjyK�:«ñ«±D]�=U�:

¬�íúÑijPr@9¬�Nµ��=H÷ò(:� ßzN�Q�9Ò"Ó"XN+áG

=c`9'¼��=HÒ"Ó"#!�\�ÆD�Á�ø�æN¡�H

��

��

��

(Hz) (ms)

��%&'( ��)*��+,��

�� 255.8�52.1 67.3�29.5

�� 249.1�49.3 65.3�30.1(��)

!"# 235.7�34.5 74.6�32.8

�� 174.8�38.9 60.7�26.8

�� 161.1�36.1 58.6�25.3($-)

!"# 124.9�20.856.2�20.8

�

��

./"*01��234567�8�9:��;<"=>?"@01A/��2./

"=BC2DE*FGHIJKL*A/MNOP<QRSTUKL*MNOP2VW�X

"@Y�ZS[\Q]2D^@01_`@��abc@d0cSefgBC2DE*F&

hijkLl�BC�mn<opSq^=rsgF�

«¬�E��®¯d°±�E®¯�²³�

�12j9Â�½»�Pó¦�X�DÂúNn[\]D]�=ê�#�çD�ÄÛj

;<G:z{Ê�=;<n[Ã=Pü&N\]Ê�=n[=D� G=U�:�Á�ø�Á#Â�

½»�Ã=Y¡�É:z{ôN»Qn[Y\]Ê��QlÔPYÚ��=H

� � � � � � � � � � � � � � � � � � � �

��

��

��

��

��

��

��

��

��

��

!"��#$

�� F0�� F0�

��

� !"#$%&'�

� ��()*+�

��,-./01�2��3456789'�

��

��

´µWX¶·� H%¦§UV�

¢õéH�dÔN�_Cj;<G=r@;<!"�%3éDE¡G:\RË]jàÏÊ�

=H�fG=�ç9õöN�jk��j;<Ê�=�ç#��¼Ñ�Dø%È/ÈÉG:�12

jº<G=.#:/:ðG�#Pr@|()¨(-DfQ�:;ñï¢#;<!"�%3é

Dº<G=Hr@Y÷GKàÏÊ�=�;9:Ò"9:.#Ñæi:/Ñ�i:ðG�4øi:Ó"9:

.#Ñ�i:/qÄi:ðG�dæijk�=Hü!U�Dæ�Á�Ñ#r@!"$(÷-#r@àÏ¢

£P;�ÄÛ#XÞD]�=PÔ�:ÒÓ"P@Ár@!"NêQ�:¬�íú�ijr@9¬

�NàÏÊ�=HÓ"#àÞU�D�Á�ø�øN¡�H

�

´µWX¸·� r¹º»¼UV�

¢õéH�æÔD§«Nøù,-./P#� PQRBCj�çÞ�z¢£D]�=H�

,-./9�Ñ��/$#$a'bc(,d%°×9#��oýÒ3NefÊ��Ql!";<,-

./P;ú�#��û�Ü�Ñ�¯$(#üýþæ�ÄjklH!"�%3éPG��ÑN¡��²

��Dst�ç:æ�²nYÞ�Ê�NKQÔ?#þ�;Ú�Dst�ç:Á�r@�Dst

�çD:�ÄÛP� ,-./j!"NÙ�G=@#DE¡G�9�Á>P#÷�D@��

çÞ�zD�ÉG=H�çÞ�z9�_CYäæ��ij� ,-./Yd��äijk:EFÄÛ

jº<G=r@;<!"#_YyQÞ�zY�¼�=H

´µWX½·� ¾¿UV�

EFÄÛjº<G=r@;<!"#�rzD¢£�l=>Nr@#Ô@�=�çDE¡

G�Ñ��MS¢£D]�=HÔ#ÌJü!@øù,-./P#� PQRBCj]�=H°¹�²

¢õH��øÁÔ�:ê�#:°¹æ² §« (�(�¸��ä��ì�k��¼�ÑÔ�#óÓé

(3D�!�PG�ÌJü!D]�=HQ:�#Óé(3NêQ�@:¢Tîì9�_C#

_Yøù,-./�yQ°�¹��,-./ØÔPD�´G=H

�À H%��WÁ�

º<G=r@!"$(÷-DEFÄÛ#|()¨(-PG:ê�#�çDr@#Ô@�=

;<!"NÙ��lr@!";<,-./��sts��DüµG=H�,-./N9Ër@#�

51.1

0

20

40

60

80

100

��

��

��

��

��

�

5.9

14.3

24.4

15.6

60.0

15.6

33.3

3.3

14.4

82.2��

��:;��<=01�

��

��

Nîï!"@\]3,d%PG�àp=H;�ÒÓ �Nk��ÑæÑ�Dâ�#³�_j³J

j@¼��îï!"|()¨(-Dº<G=HüµNÇU��Ër@!"Nîï!"Dàp

=�c¬#;<!"jàÏü!D]Q:üfìNú�#�Qè¨éj¬�NàÏj�lÔ

PD�´G=H

�,-./#~�è×~¯8D�Á�ø�ÑN¡�Hüµ9×%)3¯)o��j]�=H�

,-./9��û��jÇºG:Ò"òÓ"#\]P!"#c¬#\]D]�=g.�-

8VWyºD]RÓø�}�é (�×%)�6(-�¹ o²Dº<G=HÔÔjVWÊ�=@AY

!";<SN�Ê��;<!"YÉWÊ�lH��ÚV: (�(9Ò"òÓ"#Q:��

P!"#c¬D\]G=gN.�-8DVW�lÔPj\]G=r@;<!"DÉW�lÔP

Yj�lH�=VWgN �:r@DÙ��lÔP@Ö�jklH"Px#¥óD��=W@�

ÍN;Ú�=�IVWµÜDef�lÔPj�,-./D�fj�lH�=: �Pr@@A

DëàG=.�-8DÐæ�l��@µËG==>: (�(9vzj@ÐæG= �r@@

Aë�.�-8D;<�lÔPYj�lH

�

�

� �

�

��

thiu*vw2x<S@yz{-|}~;��5�� 2u=@&��

S�g��3��5��2�."*F&� ij¡¢��

£�¤��¥¦S�§2¨^*F&hijY�¥¦©lQªGi2«¬fgF�

��

k��³¸5-84�}(��Å#¨·¸º»N��Ê�lWª9:¨·��Y��Ê

�� Ç��Y�Ú��QK=>N:�ªN�x��]:×W:��:o�:��Å

Yã�G�³Y�9M��P�lHËG�:��¸N@��Yê��PB�iTGW$�

�%Dµ³G�¸�q�N lHG�G:B�iTD]RP"DÉ��K�l;Y�QH"

DÉ�ÔP:!#�@DÙplÔP:�ÖÄÖD�RÔP:Ë#M�Yj��K�lPQRÔP

��>��:;?�@ABCDEDFG�

��

��

��

��

��

��

��

!"#�$�

��

��

9:o�@A�¸j�K:¥"#��:�z:

r@�ÅD��NÜplÄ�D�RÔPN�lH

º»¼�=VNP��$a'bc(,d%Yj

��K�lÔP#_YB�iTË#@#�@

�¼K�#�GQÔPjklH�129:Ô#�

R�º»¼�YB�iTD]R�N"D©y

�lÔPN��:"D��@¥"j#$a'

bc(,d%DÖ�N�l@#jk:�12

#<�9Ë#�R�WªNP��$AP�l

P�pl

�� * �¡��

ÂYZ�ÃÄ<�

�12j9k��¼�:%}�+&#'WD��¼��W#"D©yG=H%}&9()

*`+#qÑ,#ÒäjklH4��Nk��P-áÊ�:�`9.x��Y�]G�M/0#

HjklH��Á1NB�iTD]�=Y:©y5�:¥��9½º�Y�I��µÜ

Dµ³GZ[� 9]pl��Nk�=H$%&'()N�Q�9�GK:�`j@÷®$%#

e2äN�Q��N34ÇD]��QlH5Q�ÔPN�`9��67D��"j

�lYQ:�9"DÉ��K�lÖ�ä@klH%}&N9�8��¼"D��@¥«#"

j"#É�=$a'bc(,d%DG=QPQR9�Yk:&#!"|()¨(-Dº<�l

#P��=HÁ:�W#&ÔPÎW@A9�W#\Q9�N��W#Þ�#@PN¥�

G�QlH

¥��* �-.�

�12#M=l \9§« (�(�W#"D��!"|()¨(-Dº<�lÔP

NklY:�12j9Ë�Në¸àp�îï!"#.�-8$(÷-º<NWTDÜQ=H

�12#§«�9xYg¥ój�"@Î[��½º��Nkl=>:j�l)³tÎ

9j�Q_Y��GQHG�G�Y¼:¥ó� DÐuj�l!¾®ø%-P§«��W#

G_#�Æ�¾¿:�!:"À�Å�D0��lN9klòz#ÎPÀYuvjklHËÔj:

�12j9:M�#À!½Þ:D+��st��Ò8P%}&��:'W��# G_#

�ÆDj�l�¸0��l|()PG�:'W�Y³�;��Ql�mD³�´<)PG�

ÊfG=H�=:�jk��j9!½n[jn[\]Y]Ú�lY:;<G=Qn�³��Y$

(÷-§Ns��Ql;N9:Ë#n�klQ9��DÞ¦xPG�\]Ê�lÖ�ä

YyQHË#=>:�f`z#yQZ[Hf�³��D.�-8$(÷-PG�àp=H�

�ÚV:Õ#Ëc¬#.�-8$(÷-Dº<G=H

°s² Å=)#!¾®ø%-��Ò8�k��ÑæÑ�#RV#�æä��

°�² ¥¦�¾¿:�!P"À#@PP�l=>#'W��WYmQ=Ý>�Áød��

°�² Z[H:ê�#:Z[Húûå?f�³��#�Ò8�ä��:øÑän��

HI��J�K��4LM�NOP'�

� � � � � � QHIR4S�TU�V�WXYZ

��

��

�N:r@!";<N§·�l=>

N:gÀÁjAXG=r@!"$(÷-

#�çÎD��øNQ@G=.#:/:ð

G�#.�-8$(÷-Dº<G=HË#

&:'W�Yr@AVG³�Q�RN�

W#m��G#�ç@àp=HËG�:

!¾®ø%-D�L�l=>Nk��ÑæÑ

��¼ÑÄ�aÅP.�-8$(÷-NB

àG=H

��)*+,*�¡��

©y9C°DL$¤¢õ#'WD��;¢õ#®¯~�¯(��äj]Q:§«�#Î

E��DÅ�ÇNhF9GDJ�Y¼ÀZ�¸�]�=�HIó�H�12j9:ê��#

;<Yj��:�W#� �ÆD0�j�lPQR �#@PNîï!"N�Q��#Ëc

¬#.�-8$(÷-D©yG=Y:Î[�YÞ¦G�³�´�@b�X��lÎ#=>:+

S#�j@efÖ��;<!"DH<j�l�DXu�lÔPNG=H��ÚV:©yG=!

¾®ø%-�Ò8nN:'W�¥�YmQ=Ý>nN:æcþ�;Ú�:Mcþ�;Ú�#ø

c¬#|()¨(-Dº<G:Ë�å�#|()¨(-DfQ��jk��jgÀÁ#�çÞ

�z¢£jfQ=q�ç#;<G:;<U�#«±D]RPP@N�!�æÄÔ�¼<l¢£

ü!D]�=H

� �J�K!¾®ø%-��Ò8#�:

� �JæK'W��WYmQ=Ý>:

� �JÁK�J�P�Jæ#þ�;Ú�:

� �JøK�JÁNZ[Hê�#úûå?f�ò��Ò8DBàG=@#H

gÀÁ#r@;<!"¢£P;�_Ûjn[\]U�#«±D]�=U�:ñÎ\N9

�JæYz{n[Ã=P\]n[Ã=#LMY=Ê��=Y:ÌJü!#U�:�çÞ�z

9�J�K qq�äi:�JæK qÑ�øi:�JÁK äæ�4i:�JøKd4�øijk:�rz¢£j9Ñ��¢£j:

�J�Kæ�4:�JæKæ�ø:�JÁK Á�æ:�JøKÁ�æjk�=HI��:ÌJü!U��¼9�JÁP�JøY

�hG�QlPoplHÊ¼N�JÁP�JøN�Q�9Z[f�ò��Ò8�#æÑ�D;<G:

ÅV¼#|()¨(-D��;<G=�#_Y��QÞ¦S«D|()¨(-�¼\]

j�l�D«±G=HË#U��Jø9Å��æ!½jîìÁ�Ñ:�JÁ9Å�q!½jîìæ�æjk

�=H��#U�D��p=�j:�12j9ÌJü!U�D�ÇG�:�JÁP�Jø9ê�

�#.�-8!";<N§·j�:��#�j@üµ \N9�JøYhG�QlPQRUëD

�=H

�ÅÆÇÈ*�É8ÊË��ÌÍÎÏÎÐÑÒ�\À�Ó¡�

�jk��P�Jø:r@!"|()¨(-Dàp�$a'bc(,d%úû,-./

HI��[\]�

� � � � QHIR4S�TU�V�WXYZ�

��

��

��sts��pko�DüµG=H�,-./9x#g¥ó�¼�¥�Yef�lÔPD�ÞG�:¼

�#Z[�³r@��Å�K�R@#D¯-8�¡NG:�)%yºj\]j�l�RN~�D

ðñG=H�,-./#~�è×~¯8D�Á�ø�q¡�H�,-./9�o�§·N@�õYÖ

�jk:��Tj9Ì�Q(9¶#i«pYj�lH�`�,-./9'W�YefG�

Y¼¢£D]��ê:^NT#~9®×-D�¸�Y¼yQ¢£D��QlH

�

��

�

�

�

�Ì ÔÕ�;&Ö×DØ GÙÚ%Û&$�Ü_x�ÅÆÇÈ*�É8ÊË��-.�EJ

��;bÝ�

Ô��j#12j$J �ÔPYj��K�lWª#¥"Ë#@#jê�!";<,-.

/D¹��lÔPYÖ��ÔPYÚ��=H�ÁjAXG=$a'bc(,d%úû,-./#

üµj9xYg¥ó� (�(YklòzmnNVWj�l�RNyº~�D$´G=Y:

��+O#^NY��lHË#=>N9,-./#wm¹ºD (�(Y�R�u³�è

(îD¾¿�l¹ºNG:yº~�j9�Ç\N\]j�l�RNüµ�lÔPYuvjklH

ËÔj�12j9:Ë#ÔPD �NÚ´:Ë#úË��PG�:Î[�:�¼#N:§«

(�(#Z[� «±D]�=�HIÁ�ø�Á�H�u³�è(îD.�-8¨(-#�j� �

l#j9�K:� Ê�=P�#÷øo�@A@«±G=H

��

��

��^�� _`ab�?cdef?�@ABCDEDFGghij�

HI ��P�klmn�

QHIR4S�TU�V�WXYZ�

��

��

«±9�::Î[�#Z[#� ]ÐDSOG:M�#� D�®(�l÷øo�@AD

"#G=3 ø¨éPË#�O¹ºDXYGEFG=HEFG=3 ø¨éPË#�O¹º

D�Á�ø�P¡�HË#g:§« (�(#Z[#� ]ÐDSOG:P� NEFG=3

ø¨éDëìG=HËG�:©ªG=�u³�è(îD�f`zN�«¬G:yº÷ªéN

ê¸l\]_ÛN�Q�XYG=H

«±#�j:Z[� #�j�K'�¸¼�l��Q0N�Q�:� ��D«±G=H«

±N9Óä �+Ô#� � #��¼�k(0:�RJ0:�9Q0PQR�Q:;ñïùï�

DfQ=H«±#U�:�Q#� ��N9:�RÞ0:�"S\0:�)(%Ð�0#ËcN¢

Ïj�=HË�¼P!�\�Æ�¾¿:ê�#H"À�P#DÞäN�Q�«±G=H�Á�ø�T

#�Y¡��RN:�RÞ0#��D��Q9�"S\0P�)(%Ð�0�@&ÒýYyK¢

��"j�"Ê��Q=H�=:g�Àc9�¦�XP"À#��PÊ�lUVWN�Q�9

�RÞ0#�Q�@yQ�ôD¡G:� ��P!�\�Æ#DÞäD¡�ÔPYj�=H

��o��pqrst�

��

��

��

��

��

��

�� !"�#$%

&�

'()*+*

,�

-./012

3*45

67 89�:�;<=�>

?@ ABC@D

EF

GHIJKL

M*3*NO

PQRST

UVVW XY

Z[\]^"

_�`a bcd�&efgh

W)i

jkl m�ncop

qrC@D

��

��

�� !"#$�%

�12N�:IJ#!";<µÜj9ºG��=ê�#�ç�¼r@��DsJ�!

"D;<�lÔPYj�=H�=:"Y�Ú�l��N¼��W#"D©yG�êKÔPN��

� �ÔPYj��K��@�ÄÛDfQ�¥"jQ�@#}�j$a'bc(,d%Yj�

lÔPYÚ��=H"Y�Ú�lÖ�ä#kl»B9�12#'W�Y��Qlk��D

stË#��K#¨·¸º»YklY:Ë��¸j9�K:ÚÛX³YZ¼�Í�Å@é´

¼�:Ô��¼#y[ì��NêQ�9:b(î@y�l@#P2Ú�lHËG�:¼�=V#

Å@\Qv�Y�¥«¼GQ$a'bc(,d%Yj�lÔP0jk:�12#<�9�ÊNÔ

#v�D(=�ÔPYj�:��¶#��z9yQH

fg#�12#,T'ñ�jklY:�345678#<�jkl�"#�@0DÊ¼N"

#G:�y¶À�;<!"DÉWj�l�R��l^_D]��Q�=QH�=: (�(

NÖ£N,-./DEOj�l�R�XYPG�:(\Rz#�j�,-./D�$a'bc(

,d%úû�%0PG�´ÞG�@¼RÔP�ÅN]WG�Q�=QH�NÔ#´ÞN�Q�9:

è^_H�N`Q;Ú�=PÔ��¥"D©yG�Ql0PQRTj�"YÉ�Q0PQRÂúN

;áG�QPG�:´Þ9�¼��Q��RPQR}Ûj#�aYk�=Y:§« (�(Y

ef�l�Tj9�"9gÖ��½ºPQR��N*KG�:bcP@´ÞG�d�=QH

�Óé(3#fg#?=�l,TPG�9:�345678#<�D!"´µN@H�G

=$a'bc(,d%úû�%#12T�D]��Q�=QP�p�QlHwx\N9efg

h�Å#:� 9j�lYxYg¥ó�W#:Z[#� D´µG:��³�z:r@�Å

#÷øo�@A@Üp¼�l�R�!"´µDi}G:�ìj¶�Å�#�#�%Dyºj

�l,-./#¹�DXYG�QlH

�

Affirmative Reflective Turn-holding

0.5

1.0

1.5

ah,un: dur


100

200

300

400

ah,un: f0


6.0

6.5

7.0

7.5

8.0

ah,un: naq

Affirmative Reflective Turn-holding45

50

55

60

65

70

75

ah,un: power

(sec)

(relative

unit)

(dB

(Hz

� ��u�vw�xqy��()*+�z{�

��

��

�

� �

&'(% )*+,-./�%

� +�ò� ��ò÷øo�ÑÒ&%Ó

�0�%��1�2345�67%

��¹ºÓé(3#�J# �9:��èZ�� D�"N: GÄ#!"P�z:�

@VDU#�¸l0PQRe*\jklH�Y:Ô#e*ºâDwx\N�>lNk=��:õ

`+k�òÄ¦��#ìYæ`�lÔPY:345678#l£��j´µÊ�=H{%®(

X#mëDÉ��:Ô�¼#ì9Q:�@�J=JN9noj��Q:G�Gp�\�ì

jk:Ô�¼#ìD�×òq×G�e*D�>lÔP9:e*#ÀzP¬íäDár\�

Bjã�Ê�lÔPY�¼�N��=H�345678#�j��¹ºÓé(3YY«��

D�=G:IN�� ;<!"#T�N��Yl�R�e*D,T�l=>N:Ô

�¼#ìD|s×�::tG�Ô�¼#ìD�tG:Ë#�u#7DÊ�lÔP@:��¹

ºÓé(3# �PG��N[Ü�¸¼�lÔPN��=��Á�Ñ�ï�H

õ`+k�#ìP9:õ`#ô;ND�lìjklH�ü#� !"N�Q�9!

"õ³:vwxµÜDfQ=�!Ç�y§PQ�=!"¤õ\�12Y�jk:�_:

GÄ#�z³�@VN�Q�9�fë:� «±:"+õPQ�=:$a'bc(,d%ë\�

12Y�"\jklHÔ�¼ó�#129:�vDØÙN�p��ÇÊ�:vDzÞ\N{�

��RP�l�0�vY�|() PG�´>¼�l�0�ÅRQRÔPD¡�q�¬í�p| D¡G

=ÔPN�l#�0PQ�=Â�\��p_Y¢�K6Qö��ê:ô;9x}j9�QHÊ

¼N:Ô�¼ó�#12DnNô;G=�¸j9:¢��~YH��G�RH GÄ#��

l!"N@: GÄYË#� Nñ>l�z:�@VN@\KDÚl9:#�Û\@A9:!

"¤õ\�12j@:$a'bc(,d%ë\�12j@:>K½�\��QG��¸�Q�¼

jklH�ü#e*ºâD�>�Y¼:!"¤õ\�12P$a'bc(,d%ë\�12D

s��:�Û\�@ADYM�Bj�Q�l78\��þ�D�p:¥¼#e*�þ�

D^NG�QKÔPY:��¹ºÓé(3# �PG�ðÞÊ�=H

��

��

� ��>�|}y~��t��LM�)��}��_àb�?cd��

��LM��)��:��W��r� ?��¡¢£��

��¤¥¦��q��§¨��©Vª«¬��®¯LM�)°;��±W²³´µ�¶�

Ä¦��#ìP9:|()©ªPe*NÝR:3ø×®,(³�§�ND�lÛ¿�#

ìjk:Ô�@:�345678D�]�l�j¢��ÍP��=H

Ô�¼#ìD�u�l=>N:æÄÄÁ�z�j9��w��12�0:æÄÄø�zN9+S¬

�j#��X�!"12�0DÏ1T�G:{%®(Xje*ºâ#��DAXG;Q:

mëDêÔ��=H

Ë#g#?,T�¼H��= �|Ø¥¦� Nê¸l!"P�z:�@V#§·D�Ù

NSO�l�j:��óT# �Y?GKH��=H

gïTH¥¦� Nê¸l!"P�z:�@V#§·De*�l�j:� GÄ9�z:�@

VD!"jC�ÄNÜpl0PQR[µ\�$a'bc(,d%�CYü9YM�@#j�QÔ

PYà�G=H�*#78\�e*�þ�#XYPÞÇÊ�lBj:IJ#�C#gËD�

R$a'bc(,d%�CDÊ�lÔPY:?=� �PG�ðÞÊ�=HÔ�9:� !�Np

�G=¯~é�-&(ý~78ë#¹�PoQ�plÔP@j�lH

góT9:gïTPÞÇ�lBj:� �Çø7)De*�lÔPjklH¥¦� Nê¸l

!�P� �zDSO�l�j: GÄPC�Ä#��\[��[:äÏ:�4:#��

(ÑéÊ�Å�P9ÏN:� �Çø7)#�KYuvgÖ�jklÔPYÚ��=Y:� �

Çø7)9Ô��jaPJÅSOÊ��Q�Q#j:Â�\�e*DêÔ�RuvYklH��

Á�Ñ�æ�

��

��

��>�·}¸q�¹�xqº»r¼�}½q�R§¨�©Vª«��´¾©W¿À�Á�®¯lÂ)��

_àb�?cd�ÃRÄÅ�V�´R�Æ�q��¾©�Ç¸)-v�ÈÉWÊÀË�xqº»

r¼��ÈÉ¦²³ÌÍÎ´µ�¶�

12¹�ü�#=>N��¹ºÓé(3Y¶�=��«¶Ø�!"P�z:�@VYÅ#�

RN§·G�Ql�0PQRì:oQ�pl�¼�!�P� �z#§·DÅ@�KÊ+Ê�=

BjP¼plN9:Ë@Ë@!�PG�Å#�R�@#D´ÞG:� �zPG�Å#�R�@#D

´Þ�Z��0PQRìDBfG=HË#õòj:!"¤õ\�12P$a'bc(,d%ë

\�12Dô;G:o�@A@YM�BjJñ>l�R�78\�e*�þ�DXYG:

¥¦�� |()#©ªNP@�RÛ¿�#ì#�uD9��=HÊ¼N:|��¶#·

fÖ�äDXYG=H

��89:;��%

��¹ºÓé(39:¥¦� Nê¸l!"P�z:�@V#§·DSO�l�j:MN

��¢T#Í'D�=HÔ�9:!"¤õò$a'bc(,d%ëòo�12DJñt78\

��þ�DXY�lPQRhiN§�l:��j#�ajklH

gïT9:¥¦� Nê¸l!"P�z:�@V#§·De*�l�j:� GÄ9�z:�

@VD!"jC�ÄNÜpl0PQR[µ\�$a'bc(,d%�CYü9YM�@#j9�

QPQRÔPjklHwx\N9��#PêjklØ°�² �N"ÀPDÚlQK��#!�9:

Ë@Ë@� GÄ9�z:�@VD!"jC�ÄNÜpl0PQR�CN��QHË�¼9

�C�Ä¶#Ü��0j9�K�C�Ä#�jQ�êÔ��lx!]Ç0jklH °��²

� GÄ9�z³�@VD!"jC�ÄNÜpl0PQR \ë\�$a'bc(,d%�C9

�ü#�¥��¥¦� � #e*NhÊ�QÔPY�ªklH=Ppq:n�D�!�N

��plPQR�«9:��R: GÄY��l@#j9�QY:U�PG�9��p_

N·��Ê�� zY�@G�Ê�lH

gó#Í'9:¥¦� Nê¸l!"P�z:�@VDSO�l�j:�Û\�STYÔ�

��

��

�j�p¼��Q=��N?v�PQRÔPjklHÔ��j:¥¦� 9�Û#�¨0Nkl@

#P´µÊ��=Y:ü&N9�Û9¥¦� PiMÊ�=PÔ�Næ`�l@#j9�

QH×%8ª(,d%³"À�Å:Ê��!�Y:=PpqMÁ��Á�PQR�Û\F

GN·��DÙplPQRÔPYÚ��=H

gç#Í'9:� �Çø7)PQR:Ô��j� Ê��Q�Qv�#�'jklH¥¦�

Nê¸l!�P� �zDSO�l�j: GÄPC�Ä#��\[��[:äÏ:�

4:#��(ÑéÊ�Å�P9ÏN:� �Çø7)#�KYuvgÖ�jklÔPYÚ��

=H

gù#Í'9: GÄ#�z:�@VND�l@#jklH GÄ#�z:�@V9âã\�

@#j9�K:o��ìN��öQYklH=PpqZ��j9:�J\N9�G�PU#

�Q�Ql��"ÀY:r"D��&N@SOÊ�l�Å:�ì�¬#��-8ø.5(D

?×G=ÎÏ�fëDvüÊ�luvYklH�.0�/0�00�10#ù«)j GÄ#"@D

P¼pl�pD:!"$a'bc(,d%#�þ�Nhf�lÔPN9`iYklPoÚ�lD

��Q��Á�Ñ�Á�H

�Ï�>��}ÐÑÒÓ��Ô�Õr@Ö�«×Ø��Ù=�Ú��²³Û¶q��§¨�©VªRÜÝ

)�V�´R�Æ��ÐÑWÞßËà®¦µ�¶��Àák4�´R�4â)WRã�ä�0åæ®

Ë®�ç©ä�è¦��«��WVéêë �¶�

g¢#Í'9:SO#_ÛëN@DÚl@#jklHü!ä\�FGj�¼�=!"P:¥

¦�� !"P9:Ô��j�ÞÊ��Q=��Nµ�lÔPYÚ��=Hwx\N9:×%8

ª(,d%P~7�%8#DEjklHÔ��j:Z��j9×%8ª(,d%P~7�%89��

NU#DENk:�#~7�%8D×%8ª(,d%Y��ÔP9�QPQR#YÉ�jk

�=Y:Ô#É�9ü!ä\�FGj9�K5�9�l@##:¥¦�� !"N9u:G@

��

��

5�9�¼�QÔPYÚ��=H GÄ#B�V#\ÊN��9:×%8ª(,d%Y~7�%

8D��G:Ê¼N;N��9:�N~7�%8Y×%8ª(,d%D��ÔP@k�

lÔPYÚ��=��Á�Ñ�ø�H

�Ï�ì�í}��[îï�&WÞ�¸q��*+}k4�´R�dÕð�?cd�ñ¼òdÕRvó

ôõ�z{Wµç��ñ¼òdÕ«�dÕð�?cd¦�©ö�÷�R�®�®¯øùR�úûü)�

îï´ýþ ��WR�ËR��¦��¸q��WRµËR�þ�®÷�¦µ�¶�

��¢T#Í'#�N:��¹ºÓé(3j9:¥¦�|()D©ª�l&N?¢��

ÍP�l3ø×®,(³�§��Å#Û¿`iD�u�l=>N:ý0�P#½��úË#

�jÈ�#Û¿mCDº<G:Ë�¼DfQ�×%��(Ñ%8Pm�nÔG�P'~D�ÚG

=H�=:Þ�°æÄÄø�²�ÅDÉ��:Ô�¼#<�D|�N·f�lÖ�äDXYG=H

��

��¹ºÓé(3#12N¬s�l12PG�9:°�²¢�l��¼#:Üô\�%�õ\

� GÔPq12:°��²�Ä��: Ý¡¢¼#:��õ\òW¬õ\�� «±:°��²£¤¥

+:¢¥¦:§¨¼#:~{¯�³��ÈD�"P�l:� #�#�Û12YklH°�²9!"

¤õò�Ûò-&(ý~78ëN�=Yl@#:°��²9-&(ý~78ëò$a'bc(,d%ëN�

=Yl@#:°��²9-&(ý~78ëP�ÛN�=Yl@#jklH��¹ºÓé(39:Ô�¼

#©12#<�DPÔ��:!"¤õò�Ûò-&(ý~78ëò$a'bc(,d%ë�j

D'��:78\��þ��!"�Û0D¹�G�RPG�QlH

fg9:Ô�¼#©12P#U#��DÊ¼N\>:²;12DÉG��!"�Û0D�,Ê

�l�¸j�K:��ç�¶#·fDü��lzÞjklH

gï9:|��¶#·fjklH%�|�³Z��|�NêQ��:uväYªq�

�Y¼:��ü�G�Q�Q�!"o�0|�#â«ò+,ì#=>N:�#|¬Ye

fj�l�!"o�0#Â�<)Dº=QH�345678D|�N·f�lÖ�äN�Q

�9:XY#U�:�Y«Ö�jk:��uv0PQRUëD��Ql@#jklH

��

��

gó9:$õ�¶#·fjklHÅÔj iYÙÚ�=�: GÄPC�Ä#WXDE9Å

#�R�@#��Å:� #BD³t4�Ò8#T�N:Ô#12<�D�G=QH

gç9:�Í�úûjklHV®��Å:Ê��6@j"YÉ��QWNèÚ��:�@

VDÔ>�GDZl��#T�9:Ô#345678Y5l�¼�µÊ��Q=@#�Y:fg

@.�¦�Ô#�R�úûP#ETD9�:��\Niü�b(îD¢iNG�P�=QH

&'<%=>?@AB-./�%

?@CD�EFG�=>HIJK%

�0��89:;��%

The face to face communication is both verbal and non verbal informed; the verbal

part of the communication integrates linguistic and non linguistic information, the non

verbal part, built with face, body and voice, integrates socially learned and automatic

signs.

Since the language is precisely built to express ideas, cognates and affects, it can be

expected that a few quantity of events in speech are related to automatic expressions of

affects (real emotions as defined in the psychology field, more or less inhibited,

depending on the culture) and the massive majority of speech affective events are built

by language. Thus, from a “quantitative” view, the expressive speech is a flow of

continuous succession/superposition of affective speech rarely integrating in parallel

emotional voice. But even if to model expressive speech needs first to model the

“linguistic/phonetic” affects expressions, the rare emotional expressions events must be

finely studied: because the speaker could not inhibited it means that it refer to a

specially intense and crucial affective change in the speaker state.

The ICP participation in Crest was mainly aimed in designing dense multi-modal data,

tools and methods in order to study how emotional expressions in voice and affective

expressions in speech are carried together in the acoustic flow, how the different

cognitive levels of affects can be modelled in speech, in the phonetic, prosodic and

linguistic structure of speech. The ICP study focused on the verbal acoustic material of

expressive communication, but the corpus were carefully designed to represent together

face and voice in very precise timing relation, and verbal and non verbal parts of

interactions (specially data of “feeling of knowing”).

First step was to show that the simulation of emotions by actors is not confused by

listeners with authentic emotions. It implies that expressive speech must be absolutely

collected in authentic production. Nick Campbell’s team developed in Crest an

“extensive” systematic method to collect complete speech productions. In complement

of this approach, the ICP’s team adopted an “intensive” method, that is to collect data

which can be predicted as expected affective reactions of the subject to an elaborated

and hidden controlled situation. The relevant point of such an approach are:

- to compare on the same affective and cognitive contexts different subjects;

- to isolate emotions productions and other current affective productions;

- to study the “rare” emotional events;

- to get very high quality signals;

- to collect speech and face signals, augmented with articulatory signal (in order to

verify acoustic signal treatment), and physiological signals;

- to “trap” actors in authentic productions before they act the same situation in order

to show on large emotional variation that acted speech is discriminated by listeners

from the authentic speech.

��

��

1. A cognitive architecture for the different affective processing in the

communication competences of a human agent

Affects species recorded with E-Wiz

The affects are expressed in speech through different cognitive levels which are

expressed in discriminated kinds of affects (moods/emotions; intentions/attitudes;

feelings) (Aubergé et al., 2003):

1.1 The automatic affective processing

The direct expression of the variations of the speaker’s emotional states, independently

of the communication purpose. Our hypothesis is that this kind of expressions,

commonly described as speech expressions, is involuntary controlled by the speaker.

The time scale is not anchored in the linguistic events space but in the space of the

events that cause the emotions. These are external to the communication context (they

can be related by loops links, but anyway considered as external in our view).

1.2 The voluntary affective processing

� Attitudes, i.e. direct expression of the speaker’s intentions, voluntarily given by the speaker in

addition to the communication purpose and directly encoded as prosodic forms (Aubergé et al.,

1997).

� Indirect expression of affects, or expressiveness, is implemented as strategies for the

instantiation of linguistic structures. It operates as a meta-control of the linguistic functions of

prosody (choice of segmentation size, emphasis, focalization, etc).

The expression stream is generated in parallel to the linguistic and meta-linguistic

stream. These two parallel time scales are however integrated in the same speech

(prosodic) material. This point is surely decisive in particular to discriminate the

communicative vs. para-communicative streams (corresponding for example to the push

and pull effects of the Scherer (2001) model).

Figure 1: The place of affective processing in the cognitive architecture of a communicant agent

The communication system, driven by communication goals, uses a set of functions

��

��

which are valued globally to the system. The system is a set of modules, in interactive

organisation based on co-operation between the modules, typically in a multi-agents

architecture: the specific constraints and degrees of freedom of each module can be

consequently respected. The coherence of several modules, when they encode the same

function values, is made by a rendezvous between the different agents structures for a

same function value

2. E-Wiz: the « trick and talk » plateform

The corpus collection is a key point of the experimental methodologies that are

currently used in expressive speech technologies. We have proposed a way to build

authentic corpora for the emotional level of affective speech. In preceding reports, we

have recalled the strengths and weaknesses of in vivo vs. in vitro methods for

expressive speech, and explained why it is necessary to record authentic corpora of

spontaneous speech. Since the ICP goal, inside Crest, is, in particular, to finely measure

the acoustic variations of speech expressions, we needed some methods to catch such

speech in laboratory conditions for “perfect” signal recording.

2.1. Comparable multi-speakers affective productions

Considering the three bootstrapped levels of affects expressions, it was particularly

important to collect the direct emotions expressions in freezing the attitudes and

expressiveness variability (ceteris paribus) and to collect the direct attitudinal

expressions by freezing the expressiveness.

The ‘Wizard of Oz’ paradigm, widely used for the evaluation of multimodal interfaces,

consists in the imitation by a human partner, called the ‘wizard’, of the behavior of a

complex person-machine interface. The subject believes that he communicates with a

computer, whereas the apparent behavior of the application is remote-controlled by the

wizard. For the collection of emotional speech corpus, the wizard perturbs the

application’s normal behavior, in order to induce emotional states to subjects. Moreover,

it enables to control the phonetic and linguistic contents by the use of a command

language that constraints subjects’ vocal expression.

The key point to develop such scenarios is to define applications that greatly motivate

the subjects: as a matter of fact, their implication is a decisive factor of his reactions to

the perturbations, either positive or negative.

E-Wiz is written in Java language with a client-server architecture (Aubergé et al.,

2003). It enables the user to design induction scenarios, without any particular

computer-science knowledge. The common frame of such scenarios is to simulate the

behavior of a human-machine communication system using voice recognition in order

to collect direct emotional expressions in speech. Indeed, the hidden wizard is given the

possibility to remote control the application, according to the so-called ‘vocal

commands’ produced by the speaker. The platform is subdivided into three separate

applications, including an editor dedicated to the design of scenarios. This editor

application aims at generating configuration scripts describing the whole behavior of the

client-server applications for a given scenario. Then, a server program running jointly

with a client program directly uses those scripts for the actual corpus recording.

Scenarios designed thanks to that software can handle several types of multimedia data,

such as texts, images or sounds. Images and texts can be moved by the wizard to

produce a kind of slideshow on the client side. In order to facilitate the laying of objects

among pages with the editor, particular effort has been made on proposing a

user-friendly interface. For instance, editing and word-processing functionalities have

been implemented, to enable an intuitive use of the application. Moreover, the task of

��

��

the wizard may be lightened by making the behavior of some objects automatic. For

instance, sounds to be played may be linked with the opening of particular slides, and

objects moves may be processed on the client side to seem machine-produced. In

addition, automatic countdowns, which behavior when specific values are reached can

be predefined, may also be integrated to the slides.

2.2. The Sound Teacher application

E-Wiz scenarios developed for the collection of emotional speech are all based on the

same basic principle: subject have to interact with the computer using a command

language. The use of a strictly restricted lexicon enables us to collect different

emotional expression on the same words, in order to facilitate the acoustic analysis. The

first scenario, based on logical IQ tests, Top Logic, that was presented in the preceding

report, did not motivate the subjects enough and gave bad corpus.

Sound Teacher (see fig 2) is presented to the subject as a software enabling him to

improve his phonetic learning of languages. The subjects have been chosen to be

strongly motivated by this task. It is supposed to lie on the neuropsychological findings

of perception-action theory. It is based on the teaching of 4 vocal tract parameters

(opening, front/back, lips rounding, centralization). The subjects are trained to recognize

the parameters values when hearing vowels, and to produce them. The scenario is

organized in four steps, less to more difficult from the pretext task point of view, and

with positive to negative feedback for the Wizard of Oz task.

Figure 2: E-Wiz situation with Sound Teacher

The first step is to check the subject’s skills for production and perception of French

vowels for French subjects. An artificially positive feedback is given to the subject quite

higher than a supposed averaged score of the others subjects. Then, the subject must

��

��

learn vowels close to the French vowel system. The feedback is given as higher than the

five better performances of preceding subjects. He is informed that his high score

enables him to step to a phase of generalization to complex vowels. There, the feedback

becomes suddenly negative: the subject is given a score much lower than the average.

He is warned that those results are abnormal, and that his skills for vowels from the

French phonological system have to be checked again, since the Sound Teacher

software may have perturbed his competences. The last step is thus similar to the first

one, but the audio stimuli have been modified to perceptively strongly decrease the

vocalic contrasts so that the subject cannot perform the task. He is given scores as the

lowest of the preceding subjects. Some commentaries are asked regularly to the subjects,

taking as pretext a beta-version of the software. See on table 1 a summary of the

scenario.

Table 2: Sound Teacher scenario

Each recording session lasted around 50 minutes. For each session, the speech data

consist in the command words ‘next page’ (in French) repeated 50 to 60 times, and in

five monosyllabic words (to avoid timing and long-term prosodic effects) shared in the

phonological space, repeated 11 to 50 times.

17 subjects have been recorded. Part of them are professional actors, tricked in

spontaneous expressions. For them, an extra protocol has been used: immediately after

having been trapped by Sound Teacher, those subjects were asked to reproduce the

expressions of the emotional states they had been encountering during the experiment,

using actor’s methods. This task was performed both on the utterances used in the

spontaneous part and on semantically neutral sentences.

The collected emotions expressed by the 17 subjects are close to what was expected in

the pre-planed scenario: concentration, satisfaction, joy, relief, stress, anger,

discouragement, boredom, anguish. It has to be noted that highly coherent groups of

��

��

reaction appear within subjects, surely linked to their psychological profile. The first

emotional labeling is done by the subject himself after the experiment: he is given a

VHS video tape, as well as a pre-filled grid, with the task of describing the different

emotional states he has been feeling along the experiment. This labeling is being

validated by perceptive tests, as well as the labeling of acted productions.

2.3. The experimenta l protocol

Subjects have been recorded on DAT tape in a soundproof room, with an AKG

C1000S microphone, for high quality speech recording. Some references measurements

are kept in order to validate the nature, the intensity and the time location of emotional

variations expressions:

� visual signal, that is mainly movements of the face and the upper part of the

subject’s body ;

� bio-physiological signals (heart rate, galvanic skin response, respiration,

temperature, electromyography recorded with the Pro-Comp equipment) ;

� the articulatory signals related to voice quality (for now only electroglottographic

signal, recorded thanks to the experimental platform EVA2).

These signals can be analysed in parallel to perception measurements. They constitute

the main indices of “emotional timing” to determine the instants when the prosodic

movements, qualifying the emotion expressions, must be measured. Figure 3 describes

the experimental protocol.

Figure 3: the experimental protocol

��

��

3. Prosody modelling

3.1. Gradual cues vs. contours characterization

Bänziger et al. (2003) come back on the problem of “emotion signature in intonational

patterns”. They recall that this idea, proposed earlier, has been discussed and tested in

parallel with co-variation models, implying gradual parameter variations,

independently for F0 values and voice quality values. The main point to be decided is

whether the processing of affective vs. other cognitive information carried by the

prosodic signal are extracted/implemented following different morphological

mechanisms. The notion of pattern, particularly intonational pattern, in which the

emotion signature can possibly be implemented, depends on the adopted theoretical

approach.

Our central hypothesis is that the perceptive separation between affective vs. linguistic

treatments comes at the end of the prosodic treatment, and not just after the “parameter

extraction”, that is before the morphological (phonological) treatment. In this idea, the

identification of the affective vs. linguistic information is precisely derived of the

prosodic morphology; the prosodic analysis can decide about the nature of the encoded

function. Following this hypothesis implies that (1) a cognitively relevant model of

prosody is a key to identify the kind of processing (emotion vs. high level cognition)

through which the information is treated after the prosodic extraction, (2) this model

must be built following some morphological laws basically the same for all linguistic

and non linguistic functions encoded in the prosodic signal.

3.2. Acoustic analysis

Word and phoneme labelling of the spontaneous and acted corpus was performed

thanks to the Praat software by a single expert. Additionally, Praat scripts were

developed to extract stimuli together with corresponding labels.

F0 contours were calculated on vowels only, located from an expert phonetic labeling.

Values were extracted by means of a prosodic editor EdiProso developed at ICP and

running in a Matlab environment. The F0 extraction algorithm counts, after a signal

filtering, the times the signal goes down to a predefined threshold, set to 10% of

amplitude for that study. Smoothed F0 contours, averaged on 32 ms frames shifted by

10 ms, were calculated from the algorithm output. Flattened contours, plotted on ten

points to enable comparisons of vowels independently of duration, were also extracted.

Vowel duration values were calculated from the phonetic labeling. Those values were

converted from time units to a percentage of variation around the mean (intrinsic)

duration of the same vowel in the corpus, thus enabling cross-vowel comparisons.

Attack and final frequency values were also extracted and used to calculate the

declination line. In order to avoid calculation errors frequently occurring on signal

limits, attack and final locations were shifted from 10% prior to the extraction of values.

Mean and standard deviation of attack, final and duration were calculated for every

emotional label.

Table 2 presents the general characteristics of the contours. It is to be noted that the

neutral contour for the acted emotions and the “nothing” contour for the authentic

emotions confirms the hypothesis of the minimal intonation (reduced

segmentation/hierarchisation, focalization) since the attacks of both are at the same

level as the speaker’s basic vocalic F0 (which is the intonation reference in our

intonation model; i.e. we define here, as an anchor point of contours, the F0 level

which is the difference between the attack and F0 mean), the shape of the contour is flat

and the declination line corresponds to the “normal” articulatory effort on such

monosyllables.

��

��

Valence Arousal F0 level

semitones

F0 decl

semitones

F0dyn

semitones

norm dur

%

A anxiety N B 10 -1 1 -15,9

A deception N S 1 1 1,5 85,6

A disgust N B 3 0 1 142,0

A fear N B -4 6 6 14,5

A hot anger P B 15 3 3 29,2

A joy P B 11 0 1,5 16,2

A pos conc P S 10 -2 3 18,6

A pos surp N B -2 8 8 30,2

A weariness N S 8 1 1 -2,9

A sadness N B 10 -3 3 0,4

A satisfaction P S 21 -3 7 77,7

A worried N S 0 11 11 17,9

A neutral – – 0 0 0,5 1,2

anxiety/fear N B 2 7 7 -6,6

confidence P S 3 -5 6 23,4

joy/surprise P B -1 5 5 -12,6

weariness N S -3 2 2 -14,3

neg conc N S 2 3 3 -20,6

nothing – – 0 -2 2 -14,1

pos conc P S 1 -4 6 -1,3

Joy P B 1 5,5 5,5 -5,5

dec/surp P B -1,5 7,5 7,5 -26,6

anxiety N S 1 7,5 7,5 -7,5

Table 2: Characteristic values of contours, F0 level is the difference in semitones between the attack

and to the mean speaker F0, norm duration is the difference in % to 0. A emotion means acted emotion.

N is negative, P positive valence, B is big, S is small arousal, as evaluated by the speaker himself

The general dynamics of acted contours is lower (3,7 semi-tones) than the general

dynamics of authentic contours (5,2 semi-tones). The general F0 level of acted contours

is higher (6,4 semi-tones) than the authentic ones which are in average near 0 (but with

significant variation). The duration of vowels (minimal rhythm) is strongly higher for

acted speech (32% vs. –8,6%).

��

��

-6

-1

4

9

14

19

semitones

Acted joy

Acted positive

concentration

Acted anxiety

Acted sadness

Acted

satisfaction

Acted neutral

Figure 4: Acted satisfaction apart, the contours of acted joy, anxiety, sadness, pos concentration are

close in form to neutral.

-6

-1

4

9

14

19

semitones

Acted

disgust

Acted

weariness

Acted

deception

Acted

neutral

Figure 5: Acted disgust, weariness and deception have no specific prominence, but do not follow the

neutral basic declination line.

-6

-1

4

9

14

19

semitones

Acted fear

Acted

worried

Acted

positive

surprise

Acted neutral

Figure 6: Acted fear, surprise and worried have similar increasing with a final prominence.

-6

-1

4

9

14

19

semitones

weariness

deception /

surprise

joy /

surprise

anxiety /

fear

anxiety

nothing

joy

negative-

concentrati

on

Figure 7: Authentic deception/surprise, joy/surprise, anxiety/fear, anxiety, joy on one hand, negative

concentration and weariness on the other hand share similar shape cues.

-6

-1

4

9

14

19

semitones

confidence

positive-concentration

nothing

Figure 8: Authentic confidence and positive concentration have similar shape with a prominence in the

first third of vowel.

��

��

The differences in shape and gradient cues (mentioned in table 1) should be interpreted

as significant for expressing some cues of emotional/mental states as labeled by the

speaker himself.

In parallel, we symbolized the kind of contours very roughly in 9 classical classes of

contours (/ ; /¯¯ ; /¯ ; /� ; _/ ; ¯¯; ¯� ; �/ ; �_; �). In parallel we

measured quantitatively other parameters. We calculated for F0 the mean, standard

deviation, range, percentiles, min/max and jitter, for other source parameters the NAQ

as well as 11 spectral parameters. The only clear effect emerging from Anova

calculation is the effect of the contour /¥ on NAQ, jitter and spectral slope. But the

choice of symbolic classes of contours is first not univocal to define from the dynamics

of the shape and second may surely not be these classical symbols: the symbolism must

include some cues which can be observed on the preceding figures (4 to 8). In particular

the relevance of the place and threshold of glissando, psycho-acoustically validated but

irrelevant for linguistic prosody, could be evaluated for emotional prosody values, in

particular when the timing is not linked to linguistic units.

3.3. Analysis of the NAQ prosodic vs. phonemic variation

Two speakers were selected on the basis of clear emotional and comparable

productions. After the segmentation of interesting stimuli from the raw corpus, the

phonetic labeling was performed by an expert. Numerous productions of those two

speakers for words supposed to be monosyllabic revealed the presence of an unexpected

schwa at their end, making those words disyllabic. Schwas were therefore also included

in analyzes, as well as other vowels.

Acoustic analyses, implemented on Matlab routines, were carried out for every

stimulus of the corpus. Fundamental frequency and intensity were estimated thanks to

algorithms developed at ICP, and were used to calculate numerous distribution

parameters: mean, standard deviation, jitter, shimmer, range, percentiles, as well as

modeled f0 contours. Moreover, spectral analyses were implemented to calculate

spectral slope, Hammarberg index and average long-term voiced spectrum on 9

frequency bands. Eventually, duration of phonemes and syllables were calculated from

the phonetic labeling.

Amplitude-based parameters have been suggested to provide a more robust method

than time-based parameters for analyzing voice quality. The most widely used among

them is the Normalized Amplitude Quotient proposed by Alku et al (2003). NAQ can be

considered as a normalization of the “declination time”, defined as,

where UP is the peak-to-peak amplitude of the glottal flow, -EE is the value of the

negative peak of the glottal flow derivative and F0 the fundamental frequency.

Automatic calculation of the normalized Amplitude Quotient was performed thanks to

the algorithm developed by Parham Mokhtari, in the Nick Campbell CREST ESP

research group. This algorithm performs a calculation of NAQ from speech signal on

automatically detected syllabic reliability centers. This enables a fully automated

extraction of NAQ values, thus providing a measurement of voice quality on unlabelled

spontaneous speech.

Gobl and Ní Chasaide (2003) have proposed to extend amplitude-based parameters to

the estimation of time-based parameters. Therefore, the open phase of the glottal pulse

can be estimated by: EE

UP

EI

UPT +=Α

2

1

π

, where EI is the value of the positive peak of the

0F

EE

UP

NAQ ×⋅=

��

��

��

glottal flow derivative. π.UP/2.EI is considered as an estimation of the glottal flow

opening phase duration and UP/EE corresponds to the closing phase duration. Therefore,

OQ is estimated by T1A.F0. The same algorithm was also used to implement the

calculation of Open Quotient from amplitude domain OQA. Moreover, the estimation of

F0 performed by the algorithm at every detected reliability center was extracted in order

to be compared to other estimations of pitch.

Electroglottography is a measurement of impedance and gives information on the area

of the vocal folds contact. F0EGG can be reliably estimated from EGG signal. Henrich

(2002) proposes an autocorrelation method between EGG signal and its derivative for

the estimation of duration of the glottal pulse open phase T1EGG and the EGG Open

Quotient (OQEGG.).

When calculated from unlabeled continuous speech, NAQ is available only on

reliability centers, i.e. vocoids as defined by Mokhtari [12]. Therefore, locations of

these reliability centers were also extracted and matched to the expert phonetic labeling

of the corpus to ensure that detected segments are actual vocoids. Table 3 presents the

repartition of reliability centers according to the phonemic labels. 68% of them are

found in vowels, and 15% in sonorants. Except vowels, the nasal consonant [n] is often

detected as a reliability center, and will hence be taken into account for further analyses.

i e a o u schwa nasal-o� n others

9.4 11.6 14.7 7.3 8.8 3.0 13.2 8.3 23.7

Table 3: Repartition (%) of the reliability centers according to phonemic labels.

Table 4 shows the mean values and confidence interval of NAQ for each phoneme.

NAQ ranges from 0.07 to 0.32, which has to be compared with Alku et al. (2003) results

obtained from five males speakers: pressed (0.08-0.11), modal (0.11-0.17) and breathy

(0.23-0.35). Mean values of NAQ seem to be higher for higher oral vowels, however

this tendency is not significant. The phoneme schwa shows a higher NAQ. This trend is

due to a clearly bimodal repartition of NAQ values. Speaker 1 adds schwa on word

endings with a high F0 and a high NAQ (0.28), which correspond to a breathy voice.

Speaker 2 produces schwas with a modal voice: NAQ values are about 0.12, as for [e].

The choice of producing or not a final schwa seems to reveal a speaker-strategy related

to speech-act expressive values. The nasal vowel [o] shows NAQ values similar to high

vowel ones. The nasal consonant [n] has NAQ values about 0.19, which can be

interpreted as a breathy voice. All differences are significant except between [n] and [e].

However, it seems unrealistic that the phoneme [n] in “Jean” is always produced with a

breathy voice, whereas the vowel [o] is not. This might be due to its final position, but

high NAQ values are also measured when [n] is followed by a schwa. A possible

explanation is that nasality produces mainly low frequencies, thus attenuating higher

frequencies and increasing the spectral slope. Both nasality and breathiness acoustically

correspond to an increase in the spectral slope induced by supra-laryngeal settings for

nasality and laryngeal settings for breathiness.

��

��

Table 4: Mean values and confidence interval p<0.01 of NAQ for each phoneme

4. Perceptive validation of the corpus

4.1. The experimental protocol

In order to validate the emotional expression collected through such a paradigm, a

perceptive validation has to be carried out. It has to first validate the acted emotions: the

“big six”, and the emotions reported by the listener himself. The results of this test give

a first map of what listeners can efficiently perceive, and what kind of emotions cannot

be differentiated. Then, the spontaneous data can be evaluated on a pre-tuned set of

emotional category. This paper presents the results of the first step of the evaluation: the

analysis of acted emotional expressions.

Subjects

26 subjects have participated in this experiment, including 4 males and 22 females, from

19 to 45-years old, aged of 25 in average.

Figure 9: screenshot of the perception test answer page, showing the 14 emotional scales

��

��

The sentences proposed to listeners were extracted from the recording of one actor of

the corpus described hereunder. There are two reasons for that: first the acoustic

analyses made on the corpus are highly speaker-dependant, and the set of spontaneous

emotions reported by this speaker is quite open.

Then, two experts listeners rated all his productions, in order to select only the

best-acted performances, and to restrain the corpus for the listening test. In order to rate

each stimulus, the judges listened to each stimulus in a random order, first in audio only,

and then in the audio-video version, and gave each one a grade from 1 (very bad) to 4

(very good). Only the stimuli with a 3 or 4 grade were kept for the test. Then, a subset

of these stimuli was extracted with the following criterions:

-Stimuli were selected in order to propose a systematic variation of their length. For

each emotion, one stimulus is proposed with the following length: 1, 3, 5 and 7

syllables. This is made in order to test if the length influences the perception of

emotional expressions.

-One stimulus was selected to represent all the emotional variation (the “page

suivante” sentence), in order to test all emotional expression on exactly the same

linguistic structure.

-The 14 acted emotional expressions were tested, either the “big six” ones, and the 8

reported at the end of the spontaneous phase, i.e.: amusement, anger, anxiety,

deception, disgust, expectancy, fear, happiness, neutral, resignation, sadness,

satisfaction, surprise, worried.

This gives 70 different stimuli, presented both through an audio and an audio-visual

modality to listeners (resulting in 140 different stimuli).

The perception test was carried out in a quiet room, using a computer to play the

stimuli and to record the answers. Subjects listened to the stimuli via headphones, at a

comfortable hearing level.

They heard in a first time the audio-only stimuli, mixed in a random order (controlled

in order to avoid the successive presentation of the same sentence) different for each

listener. Then, they perceived the audio-video stimuli, in a different random order.

They always heard the audio only stimuli and then the audio-video ones, because

audio-video stimuli are used only as validation stimuli, to check if the audio expressions

match the facial ones.

When the listener heard one stimulus, he had to rate the perceived intensity of the

emotional expression for each of the fourteen labels proposed, on a scale from 0 (the

emotion was not perceived) to 10 (the emotion is very intense). In order to give his

answer, he had to use a set of 14 sliders corresponding to each emotion (cf. fig. 9).

Stimuli can only be heard once, and listeners were told to give their answer as

spontaneously as they could.

4.2. Results

The results of this perception test were compared amongst the 26 listeners, in order to

ensure the coherence of their answers. The correlation between all their answers for

each stimulus, for all pairs of listeners was calculated: all are significantly correlated

with p<.05. Once the inter-listener coherence is checked, the overall dispersion matrices

for the audio only and audiovisual condition were calculated (cf. fig 10 & 11).

4.3.General analysis

These two dispersion matrices reflect the first (and expected) result of this perception

test: results in the audiovisual condition are always equal or better than those in the

audio only condition; but the two conditions are quite coherent. A first analysis of the

��

��

differences between the two conditions shows that :

-Disgust seems difficult to recognize in the audio-only condition, whereas the

audio-visual one is extremely efficient. These findings are completely similar to

the conclusions of Scherer (2003) and Juslin & Laukka (2003) about disgust.

However, we noticed an important difference between the accuracy of listeners to

acoustically perceive disgust: about one half of them rated acoustic disgust as

efficiently as audio-visual one, whereas the other half did not perceive acoustic

disgust.

-Listeners did not use some categories: “expectancy” is recognized as “neutral” and

“deception” is recognized as “resignation”. For these two emotions, the face does

not give more efficient information.

-Listeners in the audio-only condition mainly use the neutral category, when they

don’t “understand” the emotional expression. This happens for emotion with a low

activation, such as “expectancy”, “happiness” (the happiness played by this

listener is a very low activated one), and “resignation”.

- “Anxiety”, “worried” and “fear” are mixed together, and listeners could hardly

made differences, in both conditions. There are also some confusion between

“amusement”, “happiness” and “satisfaction”, but not systematically:

“amusement” is discriminated in the audiovisual condition, whereas “satisfaction”

is mixed between “happiness” and “satisfaction”. As it was already said,

“happiness” is reported as “neutral” in the audio-only condition, but it is

distributed between “amusement”, “happiness” and “satisfaction” for the

audiovisual one.

The better recognized acoustic emotional expressions are “amusement” (even if it is

mixed with happiness), “anxiety” (mixed with “worried” and “fear”), “anger”, “neutral”,

“satisfaction” (mixed with happiness) and “surprise”.

Figure 9: Dispersion matrix of the audio-only

condition. The rows show input emotions, and

the columns the mean answer of listeners. The

intensity of the grayscale filling each square

reflects the perceived intensity of each emotion.

Figure 10: Dispersion matrix of the audiovisual

condition. The rows show input emotions, and

the columns the mean answer of listeners. The

intensity of the grayscale filling each square

reflects the perceived intensity of each emotion.

��

��

In order to more precisely analyse the results of this experiment, we will group the

results of the different emotions that were not distinguished by listeners, in order to

extract the cognitively pertinent classes of vocal expression of emotion. Then will be

presented the results concerning the influence of the stimuli length on the emotional

expressions.

Figure 11: dispersion matrix for the 8 new categories obtained after grouping together the non-pertinent

ones. The rows show input emotions, and the columns the mean answer of listeners. The intensity of the

grayscale filling each square reflects the perceived intensity of each emotion.

��

��

��

��

In order to have a more precise view of the relevant emotional expressions produced

by this speaker, we have grouped together 9 emotional labels into 3 new and more

general labels, and exchanged the answers given in two categories (“Deception” and

“Resignation”):

- “Fear”, “Anxiety” and “Worried” are grouped together in a general “Fear”

category

- “Amusement”, “Happiness” and “Satisfaction” are regrouped inside the “Joy”

category.

- “Neutral”, “Expectancy” and “Deception” are grouped in a global “Neutral”

category.

This results in a new dispersion matrix, with 8 emotional categories (cf. figure 9). The

��

��

perceptive distinction for these categories is quite good. Thus, this set of label, and the

grouping made to obtain them is very important in the perspective of evaluating the

spontaneous data.

Influence of the stimulus’ length on the perception of emotion

The last index that has to be analysed concerns the effect of the stimuli length on the

perception of emotional expression. In order to obtain this information, we have

grouped together the answers obtained for each stimuli of a given length. This results in

4 groups of length, for the 1, 3, 5 and 7-syllables stimuli. For each group of length, the

average intensity given to each of the 14 emotional labels was calculated, in order to

test if the listeners’ answers differ from one length to another (cf. figure 10).

The correlation between the results of the four stimuli length was calculated, and all

correlations are significant (p<0.05), indicating that the length of stimuli does not

change the answer, for each emotional label.

4.4.Summary of perceptive analysis

These results are conceived as a first sorting of the collected data, as the expression of

emotion raised a lot of very basic question, such as (1) the ability of human to act an

emotion, or to perceive the difference between acted and spontaneous speech; (2) the

cognitive pertinence of each emotions' label, in one language, several language, or even

different culture; or (3) the relation between one emotion and its expression in speech

(e.g. what is the intelligibility of acoustic contours for each emotional function).

This experiment deals with the second question, by pointing out labels’ grouping, and

by rating the relative efficiency of labels and acted productions. It could also bring some

information to question 3, by comparing the acoustical analysis and the listeners'

answers. Moreover, these first results underline that the length of stimuli does not

change the ability of listeners to rate the emotional expressions.

5. The Japanese attitudes

The attitudes are directly encoded into prosodic contours are cover a very large

spectrum of affects. The can be related as well to the “Belief Desire and Intention”

premises of the theory of interaction dialog as to the linguistic or pragmatic features

defined as intentions. Attitudes not completely learned in the development of the child

before 7 years (Clément, 97), some expressions of attitudes seem universal (as the

surprise value), some are completely different from a language to another one, and some

attitudes values (that is the social concept represented) are specific to some languages.

We studied how common Japanese attitudes can be perceived and interpretated by

French listeners, naïve in Japanese. The studied attitudes have be chosen as relevant in

litterature (Schoshi, 04) :

- doubt

- evidence

- surprise

- authority

- irritation

- admiration

- arrogant / impoliteness

- serious/sincerity

- politeness/kyoshuku

- politeness

- declaration

- question

��

��

In order to study the eventual disturbing of stress in the French perception, the corpus

was built of distributed lexical stress on varying length utterances, as shown in fig 14

Figure 14: The linguistic corpus structure

The corpus was recorded in quiet room by a Japanese teacher.

5.1. Validation of the corpus on Japanese listeners

15 Japanese listeners held a perception test organised as a closed forced choice. The

table 4 shows that each attitude has a score significatively over the chance (chance =

100 / 12 < 9%).

Attitude χ2

Admiration 111,3 (ddl : 11, p.>.001)

Arrogant/impoliteness 579.0 (ddl : 11, p.>.001)

Authorithy 293,1 (ddl : 11, p.>.001)

Declaration 478,0 (ddl : 11, p.>.001)

Doubt 353,0 (ddl : 11, p.>.001)

Evidence 249,4 (ddl : 11, p.>.001)

Surprise 392,9 (ddl : 11, p.>.001)

Irritation 838,2 (ddl : 11, p.>.001)

Kyoshuku 127,3 (ddl : 11, p.>.001)

Politeness 453,7 (ddl : 11, p.>.001)

Question 657,9 (ddl : 11, p.>.001)

Sincerity/serious 176,5 (ddl : 11, p.>.001)

Table 4. Japanese Subjects. Results of χ2

test on the mean results of answers / attitude

(all length and stress positions) related to chance..

Identification Score (Japanese Subjects)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Déclaration

Question

sim

ple

Evidence

Irritation

Arrogance

Autorité

Doute/Incrédulité

Surprise

Adm

iration

Politesse

Sincérité,sérieux

Kyoshuku

Figure 15:Identification scores for Japanese listeners.

N˚ Stress /nb syl Phrase

1 1 /1 Me

2 1 /2 Nara

3 1 / 5 (3 +2) Narade neru

4 1 / 7 (4 +3) Nagoyade nomimas

5 2 / 7 (4 +3) Narashide nomimas

6 3 / 7 (4 +3) Matsuride nomimas

7 0 / 7 (4 +3) Naniwade nomimas

��

��

Percentages

Input Attitudes

Total AD AR AU DC DO EV SU IR KYO PO QS SIN

AD 21,9% 0,0% 0,0% 0,0% 1,9% 0,0% 1,0% 0,0% 1,9% 4,8% 0,0% 3,8%

AR 1,9% 72,4% 5,7% 10,5% 5,7% 21,0% 0,0% 2,9% 1,9% 0,0% 1,0% 0,0%

AU 0,0% 9,5% 51,4% 2,9% 1,0% 9,5% 1,0% 11,4% 15,2% 0,0% 1,9% 1,0%

DC 4,8% 5,7% 10,5% 65,7% 0,0% 12,4% 0,0% 0,0% 2,9% 8,6% 1,9% 9,5%

DO 0,0% 0,0% 0,0% 0,0% 56,2% 0,0% 14,3% 0,0% 0,0% 1,0% 3,8% 0,0%

EV 8,6% 5,7% 17,1% 7,6% 0,0% 45,7% 5,7% 0,0% 10,5% 1,9% 1,0% 3,8%

SU 9,5% 0,0% 0,0% 0,0% 14,3% 4,8% 59,0% 0,0% 1,0% 1,9% 1,0% 1,9%

IR 1,0% 5,7% 4,8% 1,0% 13,3% 2,9% 4,8% 85,7% 12,4% 0,0% 1,0% 1,0%

KYO 11,4% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 0,0% 24,8% 9,5% 1,0% 27,6%

PO 26,7% 1,0% 4,8% 11,4% 0,0% 1,0% 0,0% 0,0% 2,9% 64,8% 8,6% 17,1%

QS 0,0% 0,0% 0,0% 0,0% 7,6% 2,9% 14,3% 0,0% 0,0% 1,0% 77,1% 1,9%

Attitu

de

re

co

gn

itio

n

SIN 14,3% 0,0% 5,7% 1,0% 0,0% 0,0% 0,0% 0,0% 26,7% 6,7% 1,9% 32,4%

Figure 16: Confusion Matrix ; Japanese listeners.

AD(admiration), AR(arrogant/impolintess), AU (authorithy), DC (declaration),

DO(doubt), EV (evidence), SU (surprise), IR (irritation), KYO (kyoshuku),

PO(politeness), QS(question) et SIN(sincerity/serious)

- Very well recognised attitudes : irritation, question, arrogance, declaration,

politeness, surprise and doubt.

- Well recognised attitudes :authorithy, evidence, sincerity.

- Less recognised attitudes: Kyoshuku and admiration.

5.2. Perception of Japanese by French subjects

15 French listeners held the same perception test. The table 5 shows the score

compared to the chance: the French subjects could not discriminate all the attitudes.

Attitude χ2

Admiration 120.9 (ddl : 11, p.>.001)

Arrogant/impoliteness 117.3 (ddl : 11, p.>.001)

Authorithy 556.4 (ddl : 11, p.>.001)

Declaration 291.9 (ddl : 11, p.>.001)

Doubt 120.3 (ddl : 11, p.>.001)

Evidence 112.3 (ddl : 11, p.>.001)

Surprise 362.3 (ddl : 11, p.>.001)

Irritation 467.7 (ddl : 11, p.>.001)

Kyoshuku 273.6 (ddl : 11, p.>.001)

Politeness 135.8 (ddl : 11, p.>.001)

Question 149.3 (ddl : 11, p.>.001)

Sincerity/serious 59.0 (ddl : 11, p.>.001)

Table 5: French listeners. χ2

test on the mean results of answers / attitude (all length and stress

positions) related to chance.

��

��

Taux d'identification des attitudes (Francais vs. Japonais)

��

��

��

��

��

��

��

��

��

��

� ��

�� !

��

"#�$%��&

'�()$��&

Figure 17: Identification score of French vs. Japanese listeners

Results show that most of attitudes (authorithy, irritation, surprise, declaration,

admiration, politeness, question and evidence) were identified over chance. However,

the global score of French was lower the Japanese listeners score ( 35% for French vs .

55% for Japanese).

percentage

Chance 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3% 8,3%

Input Attitudes

Total AD AR AU DC DO EV SU IR KYO PO QS SIN

AD32,4% 0,0% 0,0% 0,0% 9,5% 0,0% 1,0% 1,9% 0,0% 3,8% 1,0% 1,9%

AR1,0% 19,0% 14,3% 0,0% 9,5% 11,4% 0,0% 9,5% 28,6% 0,0% 0,0% 5,7%

AU0,0% 23,8% 70,5% 2,9% 0,0% 7,6% 0,0% 9,5% 20,0% 0,0% 0,0% 8,6%

DC0,0% 21,9% 7,6% 50,5% 1,9% 21,9% 1,9% 0,0% 1,0% 15,2% 21,9% 14,3%

DO6,7% 2,9% 1,0% 1,0% 25,7% 2,9% 21,9% 4,8% 4,8% 1,0% 3,8% 7,6%

EV10,5% 16,2% 1,9% 19,0% 1,9% 28,6% 0,0% 1,9% 2,9% 7,6% 15,2% 5,7%

SU16,2% 0,0% 0,0% 2,9% 26,7% 2,9% 52,4% 1,9% 0,0% 5,7% 1,0% 1,9%

IR0,0% 7,6% 3,8% 0,0% 0,0% 0,0% 0,0% 65,7% 41,9% 0,0% 0,0% 1,0%

KYO7,6% 1,9% 1,0% 1,9% 7,6% 3,8% 0,0% 0,0% 0,0% 12,4% 3,8% 17,1%

PO11,4% 3,8% 0,0% 5,7% 1,0% 4,8% 1,0% 1,9% 0,0% 32,4% 12,4% 19,0%

QS2,9% 0,0% 0,0% 2,9% 12,4% 2,9% 21,9% 0,0% 0,0% 2,9% 32,4% 1,9%

Attitu

de

s re

co

nn

ue

s

SIN11,4% 2,9% 0,0% 13,3% 3,8% 13,3% 0,0% 2,9% 1,0% 19,0% 8,6% 15,2%

Figure 18: French subjects : confusion Matrix

��

��

- Very well recognised: authorithy and irritation. Moreover these attitudes have

no confusion with others.

- Well recognised :surprise, declaration, admiration, politeness, question and

evidence.

- Low recognised attitudes: arrogance, doubt and sincerity.

- Not recognised attitude: Kyoshuku.

It is not surprising that the three politeness variants of attitudes were specially difficult

for French since three social concepts are not conventionally encoded (sicerity 15%,

kyoshuku 0%).

The negative attitudes authorithy and irritation are well interpreted by French,

authorithy is even better recognised by French (70%) than by Japanese (51%). On the

contrary, the arrogance/impoliteness is bad recognised by French (12%) and very well

by Japanese (72%). The analysis of prosodic contours show that it is not because of

prosodic difference of encoding arrogance but surely because of social distance between

the interpretation of arrogance in the two cultures.

�

��

�� !��"#$%&'()

*�+�,-.�/0123�43"�56789:;<=1>�?@��?ABCDE�F#GHI+�

�:� ��,-.�/0123�43:��JK�L�M:9�1NOPQRAST

UVWXYZ�L[\��]^:_`��Da�FbY��cMdef:gh�Tijkl

=LDmghDE�noLpEqjN�r�sUmtu�OPvwNPxyzUVW{��

|}~N�rcMde:��m�bY��b]^Wgh}��Lqj��*1��671

9@UV��]^��f:s��1�]^:�OLZ��:��Da�F�

� ��56789��L�T:��D� ��F��:��'#��_

`]^��+��:� $%(#$ ¡¢£¢¤¢� ¥¦� §¨� %©ªª¤ £«¨¢£© � (¨¬§¦+�k*�®1¯:� �%°

#�¬£ £¢±� %©§§¦²¦� °¤³§£ +?:´��Da�F��'D�µ:¶T:·¸WST�� ¹

º»¼�)½e¾\»/¿�ÀÁÂ»��eÃ:ÄhLZ��:ÅÆÇ�WÈÉ�F� ��

]^:Ê.6Ë��LTr�ÌÂ��W´��L�É@A��ÁÂ�ÄhN�59)*�

:��_��L��'DÍ�F$%(D��OPQR:ÎÏ�Ð��TijM:ÑÒ�OP

QR@Ó:ÔÕ?ÎÏUV[\N.¼�ÖWÍ�F�%°D��M×LØ�ÔÕ?Ù:�Ú[

\]^Nk�Û�ÜÀ:��WÍ�F�

�:��Lgh��cM¼�)�ÝÞ�ß:��àPcMDÙá:âãâ�Nä

Ö�åæWç}�´�}èPxvwNyzUVWéê��¼�)?@�F�¼�)ë�¿�

��D´��LÔìÖ}�íî:n:cM�1��D´��ïð?@�F��ñòó�

ôõ»]^�O@ÓLmghDE�m:Da�F�

�%°��ö÷��x��øLM×��Lù��úûb:ü2¿ý�9Da�A��

àPOP¼�)Nþ�Ël��/01Lù��M:��Wà�?}�:��LTr�

�i�qj�Dr@rF�

$%(��ö÷��x��ÎÏ:.¼�Ö:úûbü2¿ý�9Da�A��n:

ÎÏLTr��i��}�r@rF��'��ö÷��x�OPQRUV[\NÇ¹

�.¼�Ö:ü2¿ý�9Daj��:_`��?´L��è@Ähf:��:þ19�

��

��

Ë�/01:ÇdÖWw�F�

�� ]^�OPcM:c��åæL÷��M×UV�ÎÏUV�OP�zUV@Ó

:��».¼�ÖLØ��Px�y?àPvwUVW��}��1�]^?cMdef:

ÄhW��}��è@k�Û�ÜÀN[\.�Ë��W�O��F�:]^LØ��OPQ

R:�*��8¿NcM{��R?�vw»yz»sU@Ó:s�UV:XYZ�@��

1�WÇ¹�L¹!}�"#?de??mL�$z=¼%k&':UV[\]^:bT?}��

Øj�1/�%(@cMUVLØ�)Ë��1*1)�*7�¿W��F�

�

��

�� !"#��

Speech technology research depends to a very large extent on the quality of the data

upon which it is based. Nowadays, very little research makes explicit use of heuristic

knowledge or intuition, yet almost all research that is directed for speech technology

uses data that is artificially limited. That is, engineers and scientists base their

research on speech recordings that have been specifically prepared for technology

research, and are, without exception, constrained in the types of speech that they

illustrate. They are usually recorded in clean conditions, usually using professional

speakers to produce clear examples (though sometimes noise is added, after recording,

by use of specialised ‘noise databases’). These databases are not representative of the

speech of ordinary people.

Even the large telephone-speech and ‘spontaneous’-speech databases that are currently

being used for speech recognition research are constrained to contain clear examples of

the target speech (e.g., proper names, numbers, command sequences, etc), or are

well-rehearsed beforehand (e.g., oral conference presentations), though the recent

developments in ‘Call-Home’ data collection do make use of unconstrained

conversational data (for English). By lacking a ‘real’ context, and by not having a

‘participating listener’ present, these databases fail to represent the ordinary speech of

the ordinary person, and as such provide unrepresentative data for speech technology in

an Advanced Media Society.

Our goal in producing a 1000-hour conversational speech corpus for the JST/CREST ESP

project was to minimise these constraints so that the normal everyday speech of

non-professional speakers could be analysed for their paralinguistic characteristics.

After extensive testing, we decided to use Minidisk recorders for their unintrusive

portability, rather than the higher-quality (but heavier) DAT Walkman. Tests proved

that the speech recordings, though

ATRAC compressed, yielded data that

could be processed using standard

speech-analysis techniques.

Volunteer (paid) subjects wore a small

head-mounted studio-quality

microphone for extended periods

throughout the day to record their

everyday spoken interactions (see for

example the photographs on the right).

Others attended special premises

where they could speak to each other

over a period of months without

��

��

face-to-face contact, using telephone links, while recording locally to DAT. We are

indebted to the generosity of these volunteers who provided us with so much speech

without restraint or embarrassment. Overcoming Labov’s famous Observer’s Paradox

in this way was our first important achievement: the recording of ‘natural’ interactive

speech.

�

Figure 3.7.1.� Transcribing the corpus. Utterance breaks were determined using a

‘yen-per-line’ principle to maximize the number of lines while at the same time

producing minimal meaningful utterance units.

�

Public-domain ‘Transcriber’ software (figure 1) was used by a large number of

volunteers to produce a written representation of the recordings. Conventions were

agreed so that the resulting text would be readable to both humans and machines,

resolving ambiguities resulting from multiple possible kanji-kana mappings, and

annotating non-verbal speech noises as well as marking non-speech sounds. This was

the most time-consuming and expensive aspect of our research. Again, we are

thankful to the many hard-working volunteers who had to listen to every word (often

several times) to provide time-stamped access to the conversational speech utterances.

The difference between fluent interactive speech

and conventional text is very great, and it

required considerable effort to remain accurate

to the acoustics while also being readable as text.

This work produced a corpus from a database of

recordings.

Previous work at ATR resulted in the CHATR

system of waveform concatenation for

high-quality speech synthesis, and given a large

speech corpus, it is thereby possible to

reproduce the voice and speaking styles of any

speaker using this method. Our next task was

��

��

to adapt the technology to work with a conversational speech corpus of extremely large

size (in one case, including almost 5-years worth of daily conversation from one

dedicated volunteer!). However, this required the development of many new techniques

for input and unit-selection.

Figure 3.7.2.� Producing conversational speech by using a small speech synthesizer

from the same speaker as a bootstrap device for selection from a large speech corpus.

Candidate phrases are filtered using voice-quality and prosodic characteristics.

By having the speaker also read a relatively small (one-hour) phonetically-balanced text

to produce a CHATR synthesis database, we are able to synthesise any utterance from

text input and to use that acoustic signal as a target for searching the conversational

speech database (figure 2). For common A-type utterances, several candidates are

usually found, and these are filtered by their acoustic features to select the one having

the most desirable paralinguistic features to match the intended characteristics of the

synthesised utterance. I-type utterances are not commonly repeated and require a

standard CHATR synthesis interface for their generation. This too has been

implemented for the same speaker, though it still remains to integrate the two in a

smooth imperceptible manner.

Figures 3 & 4 illustrate one of the

acoustic features used in this

filtering process: ‘AQ’ is a measure

of voice quality that correlates with

paralinguistic differences in the

meaning or intended use of an

utterance. Until now, AQ was only

measured from clear sustained

vowels, but the software that we

have developed renders it accessible

even from fluent conversational

speech. The combination

of phonatory and other prosodic

��

��

features such as speaking-rate, pitch-range, and loudness, limits the list of potential

candidates that have an identical text-transcription to just the few that have the

appropriate intended meaning for an utterance to be used in conversational speech.

�

Figure 3.7.3.� Different phonation types. Pressed voice often sounds tense, while

breathy voice can sound much more relaxed. These characteristics are present in the

speech waveform but can be very difficult to detect automatically. Our software

overcomes this problem to produce a measure of voice quality in running speech

automatically.

�

Figure 3.7.4.� NAQ: Normalised Amplitude Quotient, proposed by Alku and adapted by

us for use with fluent conversational speech, distinguishes between different modes of

phonation. It can indicate speaker-state and listener-relationship information.

��

��

�

Figure 3.7.5. � Software to detect reliable centres (quasi-syllables) in a fluent

conversational speech signal. By performing a robust formant analysis, we are able to

estimate the vocal-tract parameters and to generate a speech-source measure that

indicates the amount of vocal-tension in the speech.

By estimating the vocal-tract characteristics from the speech signal, we are able to

produce a parametric representation of the speech and speaking-style features that

enables us to tag the speech utterances in a way that can be matched with the subjective

paralinguistic impressions of our human labelers. (figure 5) These features can be

used bi-directionally, both for labelling (i.e., recognition) and for synthesis (by

unit-concatenation of syllable or phrased-sized chunks of natural speech).

This work is still experimental but we are encouraged by our results and are actively

pursuing this method of unit-selection for large corpus-based conversational speech

synthesis. Figure 6 shows results for a Japanese female speaker. The inset below the

figure gives an indication of the unit size.

�

��

��

Figure 3.7.6.� Speech-to-Speech synthesis, showing results from unit-selection based

on acoustic targets. The top signal shows a natural-speech waveform (and related

acoustic parameters). Below is an equivalent utterance generated by the new method.

�

Whereas CHATR showed that concatenative speech synthesis can almost perfectly

replicate the voice and given speaking style of a known human speaker, the present

method exceeds that performance by using not phone-sized segments for concatenation,

but in the case of A-type utterances, whole phrases. The ‘synthesis’ sounds completely

natural because it is concatenating large chunks of speech at natural pauses in the

speech, and no longer has to model phrasal prosody by rule, but can concentrate instead

on discourse appropriateness by selection from among multiple candidates.

Having improved the acoustic

quality of the speech synthesis, the

remaining task was to enable input

so that the paralinguistic features of

an utterance could be specified. In

a concept-to-speech system, these

can be generated as part of the input

and passed as markup along with the

text. However, for use in a

conversational speech synthesiser, it

is necessary to have real-time

interactive-input interface.

Keyboard input (or an equivalent

assistive input device) is still required for I-type utterances, but for the A-type

utterances, which make up approximately half the number of transcriptions, the

interface designed by the Prosody Research Team (Group 3) was implemented and

tested using both portable telephones and notebook computers.

��

��

�

Figure 3.7.7� The tap-to-talk speech synthesis interface offers a layered menu of

A-type utterances that can be quickly accessed by the jog-dial to provide rapid

conversational speech utterances. Text input is not implemented, but would make use

of the existing key pad. Unfortunately, although the software works in real-time, the

network delay tested using a 3-G FOMA handset introduces an unacceptable lag in

speech output (see http://feast.his.atr.jp/i for the downloadable software interface).

Figure 7 shows one implementation of

our AESOP FOMA keitai speech-synthesis

interface. The vertical bar can be

adjusted to set the ‘Self’ parameter, and

the horizontal bar the ‘Other’ parameter

(currently at 5 levels each). The

discourse ‘Events’ are selected by

selecting among the icons in screen

centre until an utterance has been fully

specified. The phone transmits the

parameterised feature-settings, and a

server sends back the appropriate speech

waveform for replay over the telephone’s

loudspeaker or earphone.

The notebook computer interface (see figure 7 in section 3.3 ) takes advantage of a

larger screen size to offer a much wider range of icons for faster utterance selection, but

employs the same underlying speech unit selection technology. Text input is

provided on the notebook, using phrasal units when available and falling back to

CHATR–style unit selection otherwise.

These novel interfaces allow fast retrieval of conversational chunks from the corpus

and enabler the user (whether human, computer, or robot) to take part in a conversation

in real time, and to express human non-verbal speech sounds fluently and effectively.

�

��

��

��

��

!�"#�$%&'()#�$*+'��

�� ,-� �� ,. �

)#�$%'/&0��12��34564

7849:;</��

!��

��=�� -> ��

")�$?+'@9:ABCD*:EFGH

/IJ�HKLMN2%OP�12��=

��

�$AB/QRDST��

��=�� , �� -�

URQVR�$")?%'(WVRQVR

�$")?%'DXYZY�[:�$")

?%'�;<�MN2\H&P�12��

=��

]^_`_�a��

bcd�=�� , ��

)#]^_`_*'�a�bcd�=��

��

efghij�$�

�� ,� �� ,k� �

)#%'/efghij�$�a��U/�

��

efghij�$�

lmno��

")?*'/efghij�$�a��Ul

mno��apqrs/tefghij

�[R�$�a��Ulmno��

uvwxlmno�� "#�$\'()#�$%*'�a� yz{�

&O+|4}~/lmno��

��2�

uvwxlmno�� . ��

��p��m�@�ap��C�/

��D��Y2uvwxlmno�

��"#�$%'()#�$+'��

y�� $�l�� , |�

�� R

y��$"#%'�a�(yz{&*&|4��a

��|4�R4��4URe�hw|4

��p��m��_�w� !�l

��

�� ¡�2�

�¢e�hw��

£¤¥/�¦��ST��§�/d�(

¨©��(ª©��(«©��(¬©��(

¨�«©��(¨�¬©��(ª�«©�

�(ª�¬©��F�¢e�hw��

)#�$*'��

�� !

�

��

��

��

�� !!"# �� $%&�'$ !!"()* +,

-./��01234356789:�;<=�> �� !!"# ?�@ABCDE��FGHCIJK

�@LM567�M�NOCPQ�RS��TJK�@/7UV+W@X8MYZ[\/7

�

��

� ��

+� IBM& A,-‘expressive speech synthesiser’WO{}�r�Fè�}��D��ü

2¿�.�/( "?}��sU:~:{�Da�F/0� IBM:1�À2�6�

(http://www.research.ibm/com/tts)3

�Most speech synthesis has a neutral, one-size-fits-all expression, regardless of what it's saying.

The new IBM expressive speech synthesizer has a range of expressions, so you can tune the

speech to fit its content. Here are some examples."

��L��IEEE : Journal �Transactions on Speech & Audio Processing"�Special

Issue on Expressive Speech Synthesis :å4W�Dùj��5{xAGuest Editor?}

�:67W8��r�F#http://www.ewh.ieee.org/soc/sps/tap/sp_issue/ess.html+

9:D��EU:FP6� ;!�56789� ECESS,#European Centre of Excellence for Speech

Synthesis.+cMde:COE:<=>?:@AA�Dr�F/0�ECESS:1�À2�6

(http://www.ecess.org/) �

�Currently the main market segments of voice driven interfaces are: Network-based Servers,

Mobile Terminals, and Consumer Devices. Network-based Servers represent the largest market

segment, dominated by interactive voice response (IVR) systems. This market is predicted to

increase from 1440 Million € in the year 2001 to 2030 Million € in the year 2006. The

number of voice driven mobile phones, the largest market segment in mobile terminals, where

new services based on speech technology are most visible, will increase from 104 Million

phones in 2004 to 252 million phones in 2005. Comparing the various speech technologies with

respect to revenue ASR is dominant. Speech output is mostly realised by recorded prompts due

to limited speech quality of the available TTS systems. In future application adequate TTS

technology will gain comparable revenue as ASR."

1�À2�6Aç�Ø�L�cM]^Wgh}è�� ¿LØ��B-z:CD�EF�FGH

I�5WBC}�r�Fiè�cMde]^LØ�CD��JB�cM"#?�Q@gDWB

CDE�?KL�r�F

�

��

ìM��?}��ND�äOÌÂP:åQRS��#T+#UV;Â WXôY3�ÎÏL

Z[}ècM\ßUV[\:$zÖ"+?�<�ß��#]^ôY+:�Corpus of Spoken

Japanese"A_?��F`x?mLcM¼�ë�¿:C4�56789�a�ècM�a�

èOPQR:~Wbr�Ó=�m�cdeDa��f�:n:f�:P}Y"¼�)Wti

��

��

@rF�gD�+�:cM?ºhW?mLC4�� DARPA:�Lifelog"�56789A� i

�r�A*®8�jk*l1¼%1�mnLØ��i�<=>A��r@rF

�Known as LifeLog, the project has been put out for contractor bids by the Defense

Advanced Research Projects Agency, or DARPA, the agency that helped build the Internet and

that is now developing the next generation of anti-terrorism tools. ….. Each LifeLog user could

"decide when to turn the sensors on or off and who would share the data," she added. "The

goal ... is to 'see what I see,' rather than to 'see me.' "" Washington Post, June 2003.

( http://www.darpa.mil/ipto/Programs/lifelog/ )�

/0��ÎÏLZ[}ècM\ßUV[\:$zÖ"�56789:1�À2�6�3

�&':UVÖ:�oLpr�cM\ßq�r]^:$zÖAsÈDa�FtB:cM\

ßUV[\:��D��cM:ÎÏ�Auvi�wZLa�èA�ÎÏA\ßUV�x

y�vw�yz�sU?r�èý®»z\ßUV:{|L;<@��WÃè}�r��?Wé

ê��?�}~@QRÖL<�}èÎÏ:�d�@��A�<Da�F�:Ø�@��

�RS��ÎÏ:.¼�Ö��Q�ÔÕ�þ�ý¿½e?r�è_`W�É�??mL�ÎÏ

:��cMde�"#:��Z>Ww�Fiè�ÎÏ:JK�Äh?}�cMàP/¿

�Àa�r��:��W�É�F��ÎÏ��:Oo??mL�cM\ßUV[

\]^:��@�oL��T�m:Da�F"

/0��CSJ"�56789:1�À2�6�3

��ßP}\�þ�ý¿��ß:�OcMW;�LaTÉ��:��hUVW�

÷}èP}\��h:¼�)ë�¿Daj�2004-�:�eóL�×�?mLúû�;:

�OcM��h¼�)ë�¿L@�m:?BCi�ri�F�:þ�ý¿��UVÁ�;Â

:]^��ôYW��·¸x?}��<Í��n�<�ß��?�<Í��n��

d��A��}�E�r�äÌPÌÂ]^��d�� z��¡n�P

}\�:\ß�»ý®\ß�¢�:£ÕL_¤��P}\�ÁÂ�:¢¥"�56789

(1999-2003):b�?}�¢¥i�E�ri�F�:�56789:[¦��@P}\�#�

OcM+WÁÂ�L[\��èÉ:_§]^W�¨��?Lù�ri�A��ßP

}\�þ�ý¿��Ù:èÉL�<©ª«@¼�)ë�¿?}�¬¤��ùj�Ù:¢

¥½��;?}��<�ß��AÔ®}�ri�F"

>¯�56789:cM¼�)�°±@��ïðDa�AÓ=�m�c�@àP²³LZ

��Pièm:D�@��M×:��@>�ü�/01Wtum:D�@rF�A

JST/CREST ESP¼�)?:;E@²´�Da�FPxvw»OPÍµN²³�¶@ÓLØ�

�no�·v#LM:¸QWÍ�F��nL?��¹�@�?�A��iD:cMUV[

\]^D��:Ø�@UV�i�è�ti�r@rF

+�: TalkBank �56789?CHILDES mìM:��WÍ��r�3

TalkBank is an interdisciplinary research project funded by a 5-year grant from the National

Science Foundation (BCS-998009, KDI, SBE) to Carnegie Mellon University and the

University of Pennsylvania. Additional support comes from NSF ITR Grant 0324883 : “The

��

��

goal of TalkBank is to foster fundamental research in the study of human and animal

communication. It will construct sample databases within each of the subfields studying

communication. It will use these databases to advance the development of standards and tools

for creating, sharing, searching, and commenting upon primary materials via networked

computers.” ( http://talkbank.org/ )

The CHILDES system (Child Language Data Exchange System) provides tools for studying

conversational interactions. These tools include a database of transcripts, programs for computer

analysis of transcripts, methods for linguistic coding,and systems for linking transcripts to

digitized audio and video. ( http://childes.psy.cmu.edu )

�N��cM¼�)ë�¿¾\��Aº÷}�E�r�F+�: LDC (see

http:www.ldc.upenn.edu/)� 9:: ELRA (http://www.elra.info/) ?´L��D��»

GSK \ß¼½¾'� (http://www.gsk.or.jp/)� A¼�)¾\LTr��Q@��WÃè��

?W[�?}�¿À:eÃÁ:¾\:ÂÃ:ª��Aa�F

��

��56789D�O}ècM[\]^��Lqj��*1��6719@UV��

]^��f:_§"#�1�]^:ghAª�Da�F/0�r�T:ÄW_?�F

9:D�2004-ÅÆØj�FÇÈÉz: FP6 ;!��56789� �CHIL - Computers In

the Human Interaction Loop " A � É � � r � F / 0 CHIL : 1 � À 2 � 6

(http://chil.server.de)�3

�CHIL - Computers in the Human Interaction Loop - is an Integrated Project (IP 506909) under

the European Commission’s Sixth Framework Programme. It is jointly coordinated by

Karlsruhe University and the Fraunhofer Institute. The project was launched on January, 1st

2004 and has a duration of 36 months. In total the project costs amount to more than 24 million

EUR. The vision of the CHIL project is to develop and explore a fundamental shift in the way

we use computers today. We aim to realize computer services that are delivered to humans in an

implicit, indirect and unobtrusive way. We wish to free people to interact with people. Therefore

we re-position machines to be in the background, discretely observing the humans and - like

electronic butlers - attempting to anticipate and serve their needs. Computers in the Human

Interaction Loop (CHIL) aims to introduce computers into a loop of humans interacting with

humans, rather than condemning a human to operate in a loop of computers. This will give

humans more time to do what they really like: communicate and work productively with other

humans."

+� MIT@ÓD��Intelligent SpacesÊ��":��A�Dr�3

/0�MIT�56789:1�À2�6#http://www.ai.mit.edu/projects/aire/+�

�Agent-based Intelligent Reactive Environments

A Research Group at the MIT Computer Science and Artificial Intelligence Laboratory

��

��

aire is dedicated to examining how to design pervasive computing systems and applications for

people. To study this, aire designs and constructs Intelligent Environments (IEs), which are

spaces augmented with basic perceptual sensing, speech recognition, and distributed agent logic.

aire forms a core component of MIT's pervasive computing project, Project Oxygen"

�?´L�+�:¿)1*Ë�¯;ÂN>�8�;ÂD��Q@��A'Ì��RWI

ÍL�É��r�F

�The Meeting Recorder Project (http://www.icsi.berkeley.edu/Speech/mr/mtgrcdr.html)

Despite recent advances in speech recognition technology, successful recognition is limited to

co-operative speakers using close-talking microphones. There are, however, many other

situations in which speech recognition would be useful - for instance to provide transcripts of

meetings or other archive audio. Speech researchers at ICSI, UW, SRI, and IBM are very

interested in new application domains of this kind, and we have begun to work with recorded

meeting data.� The first stage in investigating speech recognition for meetings is to collect some

data. At ICSI, we have equipped a meeting room with a multichannel, studio-quality recording

system and have begun to collect pilot recordings of meetings, primarily between speech group

members. At the time of writing (2001 February), we have collected 40 hours of 16 channel

pilot data, and ten hours has been hand-transcribed. See this information on Meeting Recorder

data collection including both the mechanics of the meeting recorder setup at ICSI and some

initial forays into processing the recordings. The data were then transcribed, using a set of

transcription conventions designed for speed and accuracy of data input and encoding. "

�� !"�

��5{x:Îr�/0:Ø�Da�3

“I hope that ATR will be able to find funding to continue the support of this research so that

it can be implemented both in speech synthesis systems and in ambient intelligent environments.

While its applications to future speech synthesis are obvious, I also foresee useful applications

in sensor equipment for monitoring the affective states of people in intelligent spaces under the

ubiquitous-computing framework”.

“In addition to being used as a communication aid for the speaking-impaired, it is likely that

the speech synthesis component of this research will find application first of all in humanoid

robots, enabling them to communicate with humans in a more acceptable manner, softening the

interaction by increased use of non-verbal speech”.

ÌÂ]^N&'f:é��ÏÐÑÃLTr�

In the age of ‘ubiquitous computing’ and ‘ambient intelligence’ people will increasingly be

confronted with automated devices and services that are equipped with interactive speech

interfaces. Although current speech synthesis and recognition are well-tuned for linguistic

processing, they are not yet capable of processing paralinguistic information, such as

‘tone-of-voice’, and are concerned only with ‘what has been said’, rather than including (as

��

��

humans do without thinking) information about ‘how it was said’. Machines need to be made

aware of and sensitive to these differences in speaking-style and to the clues about the speaker’s

state and feelings that are present in human speech. The technology produced as a result of this

research will become an important component in this evolution towards an Advanced Media

Society.

åL3

ATR has been invited to join ECESS (http://www.ecess.org) as a non-European partner (the only

other such partner is the Pattern Recognition Laboratory of the Chinese Academy of Sciences in

Beijing) and we expect to see the implementation of our JST/CREST-funded technology in their

speech synthesizer. Since this project is being coordinated by Siemens and Nokia, so we

foresee strong marketing potential and immediate applications in the European Community for

this emerging technology.

��

��

��

��$%� � �

�

�

��

�

� � ��

� � � � ��

� �

� �� !��

�

� � "#$%&'()��*+,� �

�

*+-./�

012�3456��

�

78��9:;<��

��=>� �

� �?@AB��

�

� � CD��"#�E��

� � � � ��F��.G��

� �

� � ��

� � �ICPGrenoble� � �

� ��

� �

� � HIJ0K�LM5NO��

� ��

� � � � �� !"#$%&'��

� �

� � P��

� ()��

� � � � *+,�-.��

� �

� �LQRSTU��

�

"#$%&'()��*+,� �

� � � � � � � � � � � � � � � /01"2345�67��

�

�

�

��

��

��&'(�)� �

*��+,��

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

½¾¿4vÀhÁj�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯�ÉÊ¶·Ë ÌÍÎÏ�

�Ð� 0� -ÑÒ�

�Ð�-0� ,Ñ�

ÓÔÕ�Ö×Øh�

ÙÚz ¶·¯�¶·ÛÜ� Ý��±Þ�

�Ð� 0� .ÑÒ�

�Ð�-0� ,Ñ�

ßàáâ�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯�¶·ãäË ¶·å�$ãæ�

�Ð� 0� -ÑÒ�

�Ð�-0� ,Ñ�

ç[èé�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯�¶·ãäË êÑÒ`ëì�

�Ð��0� �ÑÒ�

�Ð��0� ,Ñ�

íîïð�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯�¶·ãäË ñØò�óhò�

�Ð��0� �ÑÒ�

�Ð�-0� ,Ñ�

-./01��

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

ôõö÷�øïùúû�ÇÈ

��ü��ýþ� ��Ð/�ï�

�Ð� Ñ� .ÑÒ�

�Ð�-Ñ� ,Ñ�

��øïùúû�ÇÈ

��ü��ä<� =��

�Ð� Ñ� ÑÒ�

�Ð�-Ñ� ,Ñ�


��ü��ä<� ��ê_`Á_w�

�Ð�,Ñ� �ÑÒ

�Ð�-Ñ� ,Ñ�

àÔ�Å�øïùúû�ÇÈ

��ü�� B�� $��êj¡�

�Ð� Ñ� >ÑÒ�

�Ð��Ñ� ,Ñ�


��ü�� B��

�Ð� Ñ� >ÑÒ

�Ð��Ñ� ,Ñ�

��â�7�øïùúû�ÇÈ

��ü�� B�� $��êj¡�

�Ð�,Ñ� �ÑÒ

�Ð��Ñ� ,Ñ�


��ü��B�� /�� !4�Ð�

�Ð��Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

�õè"�øïùúû�ÇÈ

��ü��B�� $��êj¡�

�Ð��Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

#Ó6$�øïùúû�ÇÈ

��ü��B��

��Ð%/&'(Ð�

�Ð��Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

)õ*+�øïùúû�ÇÈ

��ü��B�� =� !4,-�

�Ð��Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

W./0�øïùúû�ÇÈ

��ü��B�� fh`12fw/�ï�

�Ð��Ñ� �ÑÒ

�Ð�-Ñ� ,Ñ�

3Ô45�øïùúû�ÇÈ

��ü��B�� Ð��(Ð�

�Ð��Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

î67�øïùúû�ÇÈ

��ü��B�� Ð��89�

�Ð��Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

:;7� <=�� B�� Î>� !4,-��Ð� Ñ� >ÑÒ�

�Ð��Ñ� ,Ñ�

?@A7�øïùúû�ÇÈ

��ü��B�� !�

�Ð� Ñ� ÑÒ

�Ð�-Ñ� ,Ñ�

BÔCD�øïùúû�ÇÈ

��ü��B�� Ðê_`Á_w

�Ð�,Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

EFGH7�øïùúû�ÇÈ

��ü��B�� êj¡�

�Ð�,Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

IïJKL�øïùúû�ÇÈ

��ü��B�� bM�Ághò�

�Ð�,Ñ� �ÑÒ�

�Ð��Ñ� ,Ñ�

��

��

íîïð�øïùúû�ÇÈ

��ü��B��

�Ð�,Ñ��ÑÒ

�Ð�-Ñ� ,Ñ�

NOPK�øïùúû�ÇÈ

��ü�� B�� =�,-�

�Ð��Ñ� ,ÑÒ

�Ð�-Ñ� ,Ñ�

Q[��øïùúû�ÇÈ

��ü��B�� =�,-�

�Ð��Ñ� -ÑÒ

�Ð��Ñ� ,Ñ�

RSKT�øïùúû�ÇÈ

��ü��B�� Ð�

�Ð��Ñ� ÑÒ

�Ð��Ñ� ,Ñ�

�UV7�øïùúû�ÇÈ

��ü��B�� d�&'(Ð�


�Ð��Ñ� ,Ñ�

WXE$�øïùúû�ÇÈ

��ü��B�� /��4�Ð�


�Ð��Ñ� ,Ñ�

ÔWYZ�øïùúû�ÇÈ

��ü��B�� Ághò�


�Ð��Ñ� ,Ñ�

EF[ð�øïùúû�ÇÈ

��ü��B��


�Ð��Ñ� ,Ñ�

�\]ì�øïùúû�ÇÈ

��ü��B�� $#^_�


�Ð�-Ñ� ,Ñ�

`C(�øïùúû�ÇÈ

��ü��B�� a^_�


�Ð��Ñ� ,Ñ�

�bc�øïùúû�ÇÈ

��ü��B�� ê_`Á_wdÐ�


�Ð�-Ñ� ,Ñ�

eõf �øïùúû�ÇÈ

��ü��B�� ê_`Á_wdÐ�


�Ð�-Ñ� ,Ñ�

WgÊ �øïùúû�ÇÈ

��ü��B�� ê_`Á_wdÐ�

�Ð��Ñ� -ÑÒ

�Ð�-Ñ� ,Ñ�

23456��

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

W�Eð� hà�� ýþ� iR��jkl��Ð� 0� £ÑÒ�

�Ð�-Ñ� ,Ñ�

mnoð� hà�� äýþ� =�d��Ð� 0� £ÑÒ�

�Ð�-Ñ� ,Ñ�

p q� hà�� ýþ� ��jkl��Ð� 0� £ÑÒ�

�Ð�-Ñ� ,Ñ�

rÔ�q� hà�� B�� ê_`st4 !��Ð�,0� ÑÒ

�Ð��0� �Ñ�

uvw� hà�� xV:ýy iR��jkl��Ð�,0��ÑÒ�

�Ð�-Ñ� ,Ñ�

SÓzK� hà�� äýþ� iR��jkl��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

}�{é� hà�� äýþ� ê_`st4 !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

Q[|â7� hà�� t�}~y =�d��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

��D�� hà�� äýþ� ��jkl��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

W��7� hà�� B�� !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

��7� hà�� B�� !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

�Ôz7� hà�� B�� !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

��

��

�� hà�� B�� jkl��Ð� 0��ÑÒ�

�Ð�-Ñ� ,Ñ�

W��7� hà�� B�� !��Ð��0��ÑÒ�

�Ð�-Ñ� ,Ñ�

ç�f�� hà�� B�� !��Ð�,0� ÑÒ

�Ð��0� ,Ñ�

#�� hà�� B�� !��Ð��0� ÑÒ

�Ð��0� ,Ñ�

(��7� hà�� B�� !��Ð��0� �ÑÒ�

�Ð�-Ñ� ,Ñ�

�� hà�� B�� !��Ð��0� �ÑÒ�

�Ð�-Ñ� ,Ñ�

�Ô�7� hà�� B�� !��Ð��0��ÑÒ�

�Ð�-Ñ� ,Ñ�

�� hà�� B�� !��Ð��0��ÑÒ�

�Ð�-Ñ� ,Ñ�

�6â3å� hà�� t�}²Ë ê_`�!ãä��Ð��0��ÑÒ�

�Ð�-Ñ� ,Ñ�

�

��

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

Veronique Auberge ICP Grenoble ýþ� iR/��#��Ð� 0� -ÑÒ�

�Ð�-Ñ� ,Ñ�

Albert Rilliard ICP Grenoble ¶·Ë� ��_`��Ð� 0� >ÑÒ�

�Ð�-Ñ� ,Ñ�

Anna Tcherkassof ICP Grenoble B�� >��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

Amandine Fouard ICP Grenoble B�� ê_` !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

Cecile Brichet ICP Grenoble B�� ê_` !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

Marie Caithard ICP Grenoble B�� ñØò�×dÐ��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

Aude Noiray ICP Grenoble B�� ê_` !��Ð�,0� ÑÒ�

�Ð�-Ñ� ,Ñ�

Ludovic Lemaitre ICP Grenoble B�� ê_`st4 !��Ð� 0� >ÑÒ�

�Ð�-Ñ� ,Ñ�

Sylvie Mozziconacci ICP Grenoble B�� ì 7��Ð� 0� ÑÒ

�Ð��0� ,Ñ�

Daniel Hirst Aix-en-Provence ýþ� ��iR�¡¢ì��Ð� 0� .ÑÒ�

�Ð�-Ñ� ,Ñ�

£�m¤ Edinburgh�� B�� d�w`fjj�a��Ð� 0� -ÑÒ

�Ð��0� ,Ñ�

��

��

7��

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

¥O§�� ¦§¨�� ýþ� fh`12_w��Ð� 0� .ÑÒ�

�Ð�-Ñ� ,Ñ�

©Ôuâ� �ª§¨��«{¬�z�

¶·Ë�ê_`�>4st�

�Ð� 0� .ÑÒ�

�Ð�-Ñ� ,Ñ�

®¯7� �¦§¨�� ¶·ãäË �Ághò��Ð� 0� ÑÒ�

�Ð��0� ,Ñ�

�

��

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

:è°� ±¯�� äýþ� =�/|"²��Ð� 0��ÑÒ�

�Ð�-Ñ� ,Ñ�

³´âµ7� ±¯�� ¶·ãäË� d�¶·/¶·��Ð�,0� �ÑÒ�

�Ð�-Ñ� ,Ñ�

:;7� ±¯�� ¶·ãäË� Î>� !¸,-��Ð��0� ,ÑÒ�

�Ð�-Ñ� ,Ñ�

N�¹�7� `<�� B�� ê_` !��Ð��0� >ÑÒ�

�Ð��0��Ñ� �

ºõµ{� `<�� »� ê_` !��Ð��0� >ÑÒ�

�Ð��0��Ñ� �

�� !

®'� ¯°� ±²� ³´µ�¶·¸¹� º»�¼�

¼½¾¿½À�ÁÂÃ¿Ä½¾�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯

«{¬�z�

¶·Ë��ÄÅ¢ì�

�Ð�,0� �ÑÒ�

�Ð��0� �Ñ�

ÆOfÇ�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯

«{¬�z�

¶·Ë�d�`fóhò�

�Ð�,0� >ÑÒ�

�Ð��0� ,Ñ�

Ó�ÈjØw�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯

«{¬�z�

¶·Ë��

�Ð��0� ÑÒ�

�Ð��0� Ñ�

z½É��¿½ÉÊ�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯ËÌ¶·Ë ��Ð�

�Ð��0� ÑÒ�

�Ð��0� -Ñ�

ÍÂÎÏÐÑÎ� ��VÂ�Ã§Ä�

ÅÆÇÈ¶·¯

«{¬�z�

¶·Ë��ÄÅ¢ì�

�Ð�,0� �ÑÒ�

�Ð�,0� �Ñ�

��«¿ÑÎÏ�Ò½ÎÏ�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯

«{¬�z�

¶·Ë�� !4,-�

�Ð� 0��ÑÒ

�Ð�,0��Ñ�

ÚÎÏ½�ÚÓ¾½Ã¿À� �Ô�� B�� /ÕÖ��Ð� 0� .ÑÒ�

�Ð��0� ,Ñ�

�Ð��0� ÑÒ�

�Ð��0� �Ñ�×½¾ÄÀÑÄ�

¼ÊÄØÎÏÙ¾�óÚhÛh�� ËÌ¶·Ë ��Ð�

�Ð��0� ÑÒ�

�Ð��0� �Ñ�

ÜÉÏ½�ÝÂÑÓÎ½� óÚhÛh�� ËÌ¶·Ë ��Ð��Ð��0� ÑÒ�

�Ð��0� �Ñ�

Þ�ßà� �Ô�� B�� áâjd��Ð� 0� .ÑÒ�

�Ð�,0� >Ñ�

�N§Ä�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯¶·ãäË �Ághò�

�Ð��0� �ÑÒ�

�Ð��0� ,Ñ�

ã�äâ�VÂ�Ã§Ä�


�Ð��0� �ÑÒ�

�Ð�-Ñ� ,Ñ�

��

��

å[�câ�VÂ�Ã§Ä�


�Ð�,0� .ÑÒ�

�Ð�-Ñ� ,Ñ�

�Ocâ�VÂ�Ã§Ä�


�Ð�,0� �ÑÒ�

�Ð��0� �Ñ�

¨Ô�æ�VÂ�Ã§Ä�


�Ð��0� �ÑÒ�

�Ð��0��Ñ�

Wç7�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯¶·ãäË sè�

�Ð�,0��ÑÒ�

�Ð�-Ñ� ,Ñ�

NOéjm�VÂ�Ã§Ä�


�Ð��0� �ÑÒ�

�Ð��0� ,Ñ�

@�ê�VÂ�Ã§Ä�


�Ð�,0� .ÑÒ�

�Ð�-Ñ� ,Ñ�

V æµ7�VÂ�Ã§Ä�


�Ð��0� �ÑÒ�

�Ð��0� >Ñ�

ë�ìbíî7�VÂ�Ã§Ä�

ÅÆÇÈ¶·¯¶·ãäË ê_`�!�

�Ð��0� ÑÒ�

�Ð��0� -Ñ�

�

�

��

� ��8�9�:;�<�'=>?�@� �

VWX� YZ� [,�\]

^_`a�

�

�Ð� 0>Ñ�

��Ò£��Speech & Emotion Belfast 95 Satellite of Eurospeech

�Ð�,0�

��Ñ ��ESP Groups Meeting

VÂ�Ã§Ä

ÅÆÇÈ¶·¯ 25 Research Planning

�Ð��0�

.Ñ��CREST Meeting

øïùúû�ÇÈ

��ü�� 30 Research results

�Ð��0>Ñ�

��Ò��

ATR-CREST

Workshop I,II

��

@U¥(±¯C 65 Satellite of LP2002

�Ð��0��Ñ

.�Ò>��

ESP/CREST

Group Meeting

VÂ�Ã§Ä

ÅÆÇÈ¶·¯ 15

¶·ïðñò¡ó

k�ô_

�Ð��0 Ñ�

��Ò ��

1st

CREST ESP

International

Workshop

hà�� 85 Reporting Results

�Ð��0.Ñ�

-�Ò >��Voqual Workshop Geneva 75 Satellite of Eurospeech

�Ð��0>Ñ�

�Ò��

Eurospeech � Special

Session Geneva 80

Public discussion of results

and related research

�Ð��0�

��Ñ�-��1st

JST Symposium JjÈêõJö÷Þ

@`<C 218 Poster presentation

��

��

�Ð��0�

��Ñ ��

CREST ESP

Symposium

ICP Grenoble,

France 80 French Team Workshop

�Ð��0,Ñ�

,�Ò ��SP2004 conference øïøùú�å 250

�mB��¶·�

jûA�

�Ð��0�

�Ñ��JST Symposium û�ßüý 178 Oral presentation

�Ð��0��Ñ�

��Ò>��

ICSLP Special

Session Jeju Island, Korea 110

Speech & Affect

Session ofICSLP-04

�Ð�-0�

,Ñ�þ�

Crest ESP Final

Workshop

VÂ�Ã§Ä

ÅÆÇÈ¶·¯ 50?

Depending on funds being

available

�

��ABCD��E@�

��

Á¾�z½É��¿½ÉÊ�

@\yz{� ÇÈËC��yÙy�/��Ð/¶·�

VÂ�Ã§ÄÅÆ

ÇÈ¶·¯�

�Ð��0�

ÑÒ-Ñ�

Ý¾k�¬ÉÙ½ÎÂ¾½�yÉÓ½ÎÂ�

¼¾ÂÊÙ��Â¾�

�Î��«½À�Î½��¾½ØÉ�

�_¿¾ñ/º»(k�ô_VÂ�Ã§ÄÅÆ

ÇÈ¶·¯�×��0.Ñ�

Ý¾k��Ñ�

¼¾ÂÊÙ��Â¾�

�Î��«¿�½ÏÂ � ��y�


ÇÈ¶·¯�×��0.Ñ�

Ý¾�«¿ÉÎ��¿¿� �

{Ù�Ù½¾�¿Ù¾�

yz�z ��y�


ÇÈ¶·¯�×��0.Ñ�

Ý¾k��ÐÂ¾Î�Í¾½Î�Ä¾ÂÀ�

Ý¾Ù�ÄÂ¾� �

� �«zz��ÂÊ��z× ��Ù�ÙÎ� �

��z�«{¬�z��_¿¾ñ/�

º»(k�ô_�

VÂ�Ã§ÄÅ

ÆÇÈ¶·¯�×��0 Ñ�

Ý¾k��½ÎÙÄ�«½¿Î�

{Ù�Ù½¾�¿Ù¾�

ÁÂÄÂ¾ÂÉ½��½ÓÂ¾½ÄÂ¾Ù��

��z�«{¬�z��_¿¾ñ/�

º»(k�ô_�

VÂ�Ã§ÄÅ

ÆÇÈ¶·¯�×��0 Ñ�

¼¾ÂÊk�yÎÄÂÎÙ�yÑ�¿ÉÎ�

�Î�Ù¾�ÄÙ��Ù�ÍÙÎÙ�Ù�

��z�«{¬�z��_¿¾ñ/�

º»(k�ô_�

VÂ�Ã§ÄÅ

ÆÇÈ¶·¯�×��0,Ñ�

Ý¾k�yÉÓ¿Ù�«¿½�½�Ù�

{Ù�Ù½¾�¿Ù¾�

z¾ÎÄ��«ÂÉÉÙÏÙ �Ú¾ÙÉ½Î��

��z�«{¬�z��_¿¾ñ/�

º»(k�ô_�

VÂ�Ã§ÄÅ

ÆÇÈ¶·¯�

×��0 Ñ�

×��0,ÑÒ�Ñ�

Á�kÜÉÏ½�ÝÂÑÓÎ½�

{Ù�Ù½¾�¿Ù¾�

ÁÑÙÎ�¿ÙÎ��Îk�ÍÙ¾À½Î��

��z�«{¬�z��_¿¾ñ/�

º»(k�ô_��

��Ðfh`12_w/¶·

VÂ�Ã§ÄÅ

ÆÇÈ¶·¯�

×��0 Ñ�

×��0�Ñ�

Ý¾k�×½¾ÄÀÑÄ�¼ÊÄØÎÏÙ¾�

{Ù�Ù½¾�¿Ù¾�

ÁÑÙÎ�¿ÙÎ��Îk�ÍÙ¾À½Î��

��z�«{¬�z��_¿¾ñ/�

º»(k�ô_��

��ÐÇÈ/¶·�

VÂ�Ã§ÄÅ

ÆÇÈ¶·¯�

×��0 ÑÒ�Ñ�

×��0,ÑÒ�Ñ�

�

�

�

��

��

��

�� !"�

��#$� ��%&'()�*+�,-.�/0123��4456�!78��,9

:;<=>&'9 9 ?@AB;C�DEFCG9

:H<IJ&'9

KLM�IJNO9 ?@APFC�DEQRCG9

STUV�&'9 9 ?@AHFC�DEQBCG9

W�XU&'YZ[\9]^_`9abcd_e?HRRHf�HRRBf�HRRFfG9

:B<ghij?@AkC�DElijmnCG9

:F<opqrs9 9

Kopqrt9;nC?DE�uvG9

Swx9 9 2y9

W��z9 9 {_|`b}_9~^__��t9�\��9HRRFe9��`�9HRRBe��9HRRB�9

9 9 9 ��V�@��?HRRHf��G9

9 9 9 �X�Y~��c�_t{�`_}9:��_9[�<e9HRRF�9

9 9 9 LM�mN�Y~�_�_`e9{[ae9HRRFfe;R�9

9 9 9 Z[\9 ]^_`9abcd_tLM&'�HRRFf�;;�9

9 9 9 ��N�Y�Z�~[e9HRRFfeP�,9

9 9 9 s9

�

��

�� ! "#$% &#'1. Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro Shikano, Nick Campbell,

"Cross-language Voice Conversion Evaluation Using Bilingual Database", IPSJ Journal, Vol.43, No.7,

pp.2177-2185 (2002-7)

2. Kazuki Adachi, Tomoki Toda, Hiromichi Kawanami, Hiroshi Saruwatari, and Kiyohiro Shikano,

``Designing target cost function based on prosody of speech database," IEICE Trans. Inf. and Syst.

2005�

�

3. �� “�� !"#$"%&

'()*+,-�./01&23,”456789:;<=>�Vol.J87-D-II�No.2, pp.447-455

(2004-2)

�

�(�)*��

+,-)*./ � !"(#$% (#'1. Hiromichi Kawanami, Tsuyoshi Masuda, Tomoki Toda, Kiyohiro Shikano, "Designing speech

database with prosodic variety for expressive TTS system," Proceedings of International Conference

on Language Resources and Evaluation (LREC2002), pp.2039-2042 (2002-5)

2. Mikiko Mashimo, Tomoki Toda, Kawanami Hiromichi, Hideki Kashioka, Kiyohiro Shikano, Nick

Campbell, "EVALUATION OF CROSS-LANGUAGE VOICE CONVERSION USING

BILINGUAL AND NON-BILINGUAL DATABASES", Proceedings of 7th International

Conference on Spoken Language Processing (ICSLP2002), Denver, pp.293-296 (2002-9)

3. ?@AB5, CDEF, Nick Campbell, “=G&HI�J0*+,-&K�,” 67LM:;N

OP;*Q<=R (2001-9)

4. ��, ��, �� , ��, ��, “CHATRST� U&STRAIGHTVWX

��YZ,” [\�]:;^_NOP;`a<=R, 1-2-20, pp.245-246 (2001-10)

��

��

5. bcde, fghi, Nick Campbell, “+jk6lm�./0�]�no,” [\�]:;^_

NOP;à<=R, 2-2-3, pp.261-262 (2001-10)

6. ?@AB5, CDEF, Nick Campbell, “*+,-&K��J0*+pqr&st,” [\�]

:;^_NOP;à<=R, 2-2-7, pp.269-270 (2001-10)

7. ��, ��, �� , ��, ��, “*+,-&uv0!"#$"%VWX�

� wxyz&{|,” 456789:;}~��7�, Vol.101, No.603, SP2001-122, pp.61-68

(2002-1)

8. �� , ��, “�� !"#$"%&��)1&��no&st,” [\

�]:;�_NOP;à<=R, 2-10-12, pp.287-288 (2002-3)

9. ��, ��, �� , ��, ��, “��v!"#$"%&'()2

3,” [\�]:;�_NOP;à<=R, 2-10-13, pp.289-290 (2002-3)

10. ��, ��$�, “��V��"��H��&{|,” [\�]:;�_

NOP;à<=R, pp.387-388 (2002-3)

11. ?@AB5, CDEF, ��$�, “��./0uv0+,)*+%#��q&*

+pqrYZ� X¡,” [\�]:;�_NOP;à<=R, 2-10-17, pp.297-298 (2002-3)

12. h¢£F5, ��, �� , ��, ��$�, “[¤¥�q�./0 ¦K

§)¨�©q&b]�ª«0{| ,” [\�]:;�_NOP;à<=R , 1-10-16,

pp.261-262 (2002-3)

13. ¬®¯, ��, �� , ��, ��, "°±²!"#$"%VWX�*+³±²

wx&{|," 67LM:;��7�, 2002-SLP-42-5 (2002-7)

14. P�´z, h¢£F5, µ¶·, �� , ��, ��, “¤��]�!��J0¤�:

¸j&*�23yz&{|,” [\�]:;à<=R, 1-6-1 pp.209-210 (2002-9)

15. ¹º�», ��, �� , ��, ��, “GMM�¼½ ¦K§VWX�k6Q

¾&YZ)1&23,” [\�]:;à<=R, 1-10-24, pp.277-278 (2002-9)

16. ¬®¯, ��, �� , ��, ��, “°±²VWX�¿"À%$"%ÁÂ�J0

*+³±²wxyz,” [\�]:;à<=R, 2-10-15, pp.315-316 (2002-9)

17. ¹º�», ��, �� , ��, ��, "GMM�¼½ ¦K§VWX�k6�

wx," 456789:;}~��7�, SP2002-171, pp.11-16 (2003-1)

18. ¹º�», ��, �� , ��, ��, “GMM�¼½ ¦K§yzVWX��

&k6YZ,” [\�]:;à<=R, 1-6-23, pp.267-268 (2003-3)

19. P�´z, h¢£F5, µ¶·, �� , ��, ��, “¤�:¸j&*�23yz�

./0+jÃÄ}~&ÃW,” [\�]:;à<=R, 3-6-18, pp.363-364 (2003-3)

20. cGÅ, CDEF, ��$�, “*+,-V��ÆÇpqrYZyz&{|,”

[\�]:;à<=R, 1-6-2, pp.225-226 (2003-3)

21. È�ÉÊ, CDEF, ��$�, “�ËvrÌ&� �HVWX��HÍ�,” [\�]

:;à<=R, 1-6-14, pp.249-250 (2003-3)

22. ��, CDEF, ��$�, “�ÎÏÇ�� wx�./0 F0&ÐÑV��ÏÇ

¿%�&{|,” [\�]:;à<=R, 1-6-15, pp.251-252 (2003-3)

��

��

23. )õ*+, ��, ½¾¿4vÀhÁj, “d��K�/��iR�Á j�a� ¥j��

!,” �[��~�²|t, 2-6-8, pp.313-314 (2003-3)

24. EF[ð�½¾¿ vÀhÁj��, "�89� speech-to-speech ��Ð/2�/�^

Ü��a��<"," �7�¡§Ä��ÇÈ¶·¡ó, SP2003-82 (2003-8)

25. RSKT, àÔ�Å, ��, �³�, ôõö÷, "�¶T��F��ê_`Á_w��

s�2��^ !�=µ�µ7�"#," �[��~�²|t, pp.221-222 (2003-9)

26. ÔWY$��½¾¿4vÀhÁj, "=��tR%��¡/Î>��êj¡/

&',”�[��20030(), pp.233-234 (2003-9)

27. ��, ��, ½¾¿4vÀhÁj, ”*�+�jF0/,*�-¹�2�89��Ð��

��/�ï/&',” �¡¢ì��¶·¡ó, 2003-SLP-50-9 (2004-2)

28. RSKT, àÔ�Å, ��, �³�, ôõö÷, "�� T��ê_`Á_w��s

�2�Ð��/¨.a¡," �[��~�²|t, 1-7-5, pp.221-222 (2004-3)

29. WXE$, àÔ�Å, ��, �³�, ôõö÷, "GMM�Å/��a^_��01Å2

34�5"," �[��~�²|t, 1-7-26, pp.263-264 (2004-3)

30. ÔWY$, ��, ½¾¿4vÀhÁj, "d�67��/tR%��¡/ !�ap

X/89," �[��~�²|t, 1-7-10, pp.231-232 (2004-3)

31. EF[ð, ��, ½¾¿4vÀhÁj, "�89�Speech-to-Speech ��Ð�os�2�

a^_��/�a:¡�;," �[��~�²|t, 1-7-25, pp.261-262 (2004-3)

32. RSKT, àÔ�Å, ��, �³�, ôõö÷, "�� T��ê_`Á_w��s

�2��,-�/.a:¡ª<<"/&'," �7�¡§Ä��ÇÈ¶·¡ó, SP2003-199,

pp.37-42 (2004-3)

33. WXE$, àÔ�Å, ��, �³�, ôõö÷, "GMM�Å/��a^_%/01Å2�

5/=s," �7�¡§Ä��ÇÈ¶·¡ó, SP2003-200, pp.43-48 (2004-3)

34. �\]ì, ��, �³�, ôõö÷, “F0�`_h/ª>?@Ð ��ñîA

#/µ7�âB,” �[��~�²|t, 3-2-20, pp.355-356 (2004-9)

35. �bc, ��, ½¾¿4vÀhÁj, “=��/C©��C©D�F©E/ !,” �[��

��~�²|t, 2-2-6, pp.283-284 (2004-9)

36. eõf , ��, ½¾¿4vÀhÁj, “12fwF_¿�s©2bGd��#�

¡/ !,” �[��~�²|t, 2-2-7, pp.285-286 (2004-9)

�

-=�F�G)� � HI � �JKLM� NJO�1. Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano, ''Voice Conversion Algorithm Based on

Gaussian Mixture Model with Dynamic Frequency Warping of STRAIGHT Spectrum,” Proceedings

of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2001),

SEECH-P8, pp.841-8444 (2001-5)

2. Mikiko Mashimo, Tomoki Toda, Kiyohiro Shikano and Nick Campbell, “Evaluation of

Cross-Language Voice Conversion Based on GMM and STRAIGHT,” Proceedings of 7th

European Conference on Speech Communication and Technology (EUROSPEECH2001),

pp.361-364 (2001-9)

��

��

3. T.Toda, H.Saruwatari, K.Shikano, ''High Quality Voice Conversion Based on Gaussian Mixture

Model with Dynamic Frequency Warping'', Proceedings of 7th European Conference on Speech

Communication and Technology (EUROSPEECH2001), pp.349-352 (2001-9)

4. Hiromichi Kawanami, Tsuyoshi Masuda, Tomoki Toda, Kiyohiro Shikano, "DESIGNING

JAPANESE SPEECH DATABASE COVERING WIDE RANGE IN PROSODY", Proceedings of 7th

International Conference on Spoken Language Processing (ICSLP2002), pp.2425-2428 (2002-9)

5. Kei Fujii, Hideki Kashioka, Nick Campbell, "Target Cost of F0 Based on Polynomial Regression in

Concatenative Speech Synthesis", Proceedings of 15th International Congress of Phonetic Sciences

(2003-8)

6. T. Shiraishi, T. Toda, H. Kawanami, H. Saruwatari, K. Shikano, "Simple Designing Methods of

Corpus-Based Visual Speech Synthesis," Proceedings of 8th European Conference on Speech

Communication and Technology (Eurospeech2003), pp.2241-2244 (2003-9)

7. H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, K. Shikano, "GMM-based Voice Conversion

Applied to Emotional Speech Synthesis," Proceedings of 8th European Conference on Speech

Communication and Technology (Eurospeech2003), pp.IV-2401-2404 (2003-9)

8. Kazuki Adachi, Tomoki Toda, Hiromichi Kawanami, Hiroshi Saruwatari, Kiyohiro Shikano,

"Perceptual Evaluation of Quality Deterioration Owing to Prosody Modification," Proceedings of the

4th International Conference on Language Resources and Evaluation (LREC2004), pp.2159-2162

(2004-5)

9. ��â�7, àÔ�Å, ��, ôõö÷, Nick Campbell, “H�EI J�êj�Å/��

a^_"/�UiR�%/=s,” �[��()ÌV��~�²|t, 1-P-17, pp.389-390

(2001-10)

10. #Ó6$, àÔ�Å, ��, �³�, ôõö÷, “µ7�a��¢m§�Å/��_�wÁ

_wd�K&'�Ð,” �[��~�²|t, 2-Q-13, pp.399-400 (2003-3)

11. W./0, ��, ôõö÷, ½¾¿4vÀhÁj, “LMÍ6:N�ÕÖ�a�O��,”

�[��~�²|t, 3-Q-12, pp.175-176 (2003-3)

12. ��, ��, ½¾¿4vÀhÁj, “�89��Ð��F0/PQ�8R�289�

wx,” �[��~�²|t, 2-Q-6, pp.323-324 (2003-9)

13. ��½¾¿4vÀhÁj, "�89��Ð/��iR�¡jF0

/,*�Å/��ï," �[��~�²|t, 2-P-16, pp.361-362 (2004-3)

14. W./0, ��, ½¾¿4vÀhÁj, ôõö÷, "O��@NAM ��C��Shh

òD"/�;," �[��~�²|t, 3-Q-1, pp.145-146 (2004-3)

15. �UV7, àÔ�Å, ��, �³�, ôõö÷, "�&'�_�w�s©2d�K&'�Ð

��T/&'," �[��~�²|t, 2-P-12, pp.353-354 (2004-3)

PQ./�� KRSTUVUW�

��XVG)� � HI � YJKLM� YJO�

��

�

��Z[G)�

*A�KZ[\]� HI � �JKLM� �JO�1. Yasuharu Den (Chiba Univ.). Are word repetitions really intended by the speaker? Proceedings of the

ISCA tutorial and research workshop on Disfluency in Spontaneous Speech (pp. 25-28). Edinburgh,

UK. Aug, 2001.

��

��

2. Michiko Watanabe (Univ. of Tokyo/JST). The usage of fillers at discourse segment boundaries in

Japanese lecture-style monologues. Proceedings of ISCA tutorial and research workshop on

Disfluency in Spontaneous Speech (pp. 89-92). Edinburgh, UK. Aug, 2001.

3. ³´âµ7(`<�/JST). ~§/U�Sò�hxVW��1õ�_/Ö�PX�F©E.Y15

P�[��ÌV��Z[t (pp. 85-90). 200109Ñ.

4. ³´âµ7(`<�/JST). U�Sò�hxVW��1õ�_4\]^/��F©E.

�[��()¶·d��~�²|tI (pp. 277-278). 2001010Ñ.

5. ³´âµ7(`<�/JST). _`~��1õ�_�sa1�`h/ ¥. Y16P�[��

�ÌV��Z[t (pp. 145-150). 200209Ñ.

6. Yasuharu Den (Chiba Univ.). Some strategies in prolonging speech segments in spontaneous

Japanese. Proceedings of the ISCA research workshop on Disfluency in Spontaneous Speech (pp.

87-90). Goteborg, Sweden. Sep, 2003.

7. ³´âµ7(`<�/Macquarie�/JST)4:è°(±¯�)4bcde(`<�)4fQÄ�(`<�).

gVW/¤¥j1õ�_/Ö�a1.Y18P�[��ÌV��Z[t (pp.65-70). 200409

Ñ.


h9i/jké�,µ�lQ</Zm�1õ�_Dnoµí�. �[��20040()¶·

d��~�²|tI (pp. 463-464). 200409Ñ.


gVW/p_q41õ�_DtQR�$/lQr��noµí�. �[��20050v)¶·

d��. 200503Ñ.

�

-=�F�G)� � HI � �JKLM� ^JO�1. Michiko Watanabe (Univ. of Tokyo/JST). The function of filled pauses as discourse segment

boundary markers in Japanese monologues. Proceedings of ISCA workshop on Temporal� Integration

in the Perception of Speech (p. 51). Aix-en-Provence, France. Apr, 2002.

2. Michiko Watanabe (Univ. of Tokyo/JST). Fillers as indicators of discourse segment boundaries in

Japanese monologues. Proceedings of Speech Prosody 2002 (pp. 691-694). Aix-en-Provence, France.

Apr, 2002.

3. Michiko Watanabe (JST) & Yasuharu Den (Chiba Univ.). When and why do speakers prolong their

speech segments? Proceedings of the 1st JST/CREST International Workshop on Expressive Speech

(pp. 71-74). Kobe, Japan. Feb, 2003.

4. Michiko Watanabe (Macquarie Univ./Univ. of Tokyo/JST). The constituent complexity and types of

fillers in Japanese. Proceedings of the 15th International Congress of Phonetic Sciences (pp.

2473-2476). Barcelona, Spain. Aug, 2003.

5. Michiko Watanabe (Macquarie Univ./Univ. of Tokyo/JST), Yasuharu Den (Chiba Univ.), Keikichi

Hirose (Univ. of Tokyo), & Nobuaki Minematsu (Univ. of Tokyo). Clause types and filled pauses in

Japanese spontaneous monologues. Proceedings of 8th International Conference on Spoken

Language Processing (pp. 2981-2984). Jeju Island, Korea. Oct, 2004.

�

�

�

�

�

��

��

GQ_`��KIabcdefghi��j�klm��

��XVG)� � HI � �JKLM� YJO�1. vÀhÁj ½¾¿, d��/��s��|"¤�/tu, pp 161-182, in ��j|"III, �v�

�Ö�, 2002.

2. Carlos Toshinori Ishi, "Analysis of autocorrelation-based parameters in creaky voice", Accoustic

Science and Technology, pp299-302 2004

3. Nick Campbell, Donna Erickson, "What do people hear? A study of the perception of non-verbal

affective information in conversational speech", ��¶·Y8wY1Å, pp9-28

��

��Z[G)�

*A�KZ[\]� HI �nJKLM�nJO�1. Li Chiung Yang., "Prosody and Topic Structuring in Spoken Dialogue", In Proceedings of 6th ICSLP

2000, Volume 1, pp. 126-129., 2000.

2. Mihoko Teshigawara, Emi Zuiki Murano, "Articulatory Correlates of Voice Qualities of Good Guys

and Bad Guys in Jpapanese Anime: an MRI study", INTERSPEECH2004-ICSLP, pp1249-1252

3. Mihoko Teshigawara, Random Splicing:A Method of Investigating the Effects of Voice Quality on

Impression Formation, Speech Prosody2004

4. Mozziconacci, S., : "The Expression of Emotion Considered in the Framework of an intonation

Model", ISCA (International Speech Communication and Assosiation) ITRW on Speech and

Emotion, pp.45-52, 2000.

5. Yang, L., Campbell, N., Linking Form to Meaning: The Expression of Emotion and Recognition of

Emotions Through Prosody, 4th ISCA Tutorial and Research Workshop on Speech Synthesis, in

CD-Rom proceedings, 2001.

6. Yang, L., Prosody as Expression of Emotion, in CD-Rom proceedings, ORAGE 2001.

7. Yang, L., Visualizing Spoken Discourse: Prosodic Form and Discourse Functions of Interruptions,

2nd SIGdial Workshop on Discourse and Dialogue, 2001.

8. Nick Campbell, "Speech & Expression; the Value of a Longitudinal Corpus", 4th International

Conference on Language Resources and Evaluation, pp183-186

9. Carlos Toshinori Ishi, "A New Acoustic Measure for Aspiration Noise Detection", 8th International

Conference on Spoken Language Processing, pp941-944

10. Ishi, C. T., Mokhtari, P. and Campbell, N.: "Perceptually-related acoustic-prosodic features of phrase

finals in spontaneous speech", in Proceedings of the 8th European Conference on Speech

Communication and Technology (Eurospeech'03), Geneva, Switzerland, pp.405-408. (2003).

11. Ishi, C.T., Campbell, N., Analysis of acoustic-prosodic features of spontaneous Expressive Speech,

First International Phonetics & Phonlogy, UNICAMP Campinas, Brazil, 2002, 9

12. Ishi, C.T., Hirose, K., Minematsu, N., "Using Perceptually-related f0-and Power-based Parameters to

identify Accent types of Accentual Phrases", Speech Prosody2002, 2002.4

13. Campbell N., "Labelling natural conversational speech data" pp273-274 ASJ2002, 2002.9

14. Campbell, N. and Mokhtari, P.: "Voice quality: the 4th prosodic dimension", in Proceedings of the

15th International Congress of Phonetic Sciences (ICPhS'03), Barcelona, Spain, pp.2417-2420.

(2003).

��

��

15. Campbell, N., "Towards a grammar of spoken language: incorporating paralinguistic information",

ICSLP-2002, Denver, Colorado.

16. Campbell, N., Analysis of emotional speech ? what constitutes a representative corpus?, II Seminario

Paranese de Processamento de Sinais, 2001, UFPA, Brazil.

17. Campbell, N., Systems for Speech Synthesis, II Seminario Paranese de Processamento de Sinais,

2001. UFPA, Brazil

18. Campbell, W.N., Marumoto, T., "Automatic labelling of voice-quality in speech databases for

synthesis", In Proceedings of 6th ICSLP 2000, pp. 468-471, 2000.

19. Nick Campbell, "Getting to the Heart of the Matter; Speech is more than just the Expression of Text

or Language", 4th International Conference on Language Resources and Evaluation, pp - ( *

Keynote Speech)

20. Nick Campbell, "Perception of Affect in Speech-towards an Automatic Processing of Paralinguistic

information in Spoken Conversation", 8th International Conference on Spoken Language Processing,

pp881-884

21. Nick Campbell, ACCOUNTING FOR VOICE-QUALITY VARIATION, Speech Prosody 2004

22. Nick Campbell, Communicating affect in our speech-analysis of a large acoustic database, Seminaire

de l’AFCP/l3 Journee Parole Expressive

23. G3n - Nick Campbell, Listening between the lines; a study of paralinguistic information carried by

tone-of?voice, International Symposium on Total Aspects of Languages@TAL2004C

24. Nick Campbell, Modelling affect in speech communication, The 1st Chinese Conference on

Affective Computing and Intelligent Interaction, Dec 2003

25. Nick Campbell, The role of speaker-listener relationships in determining speech prosody, 6thNWCL

International Conference PROSODY AND PRAGMATICS

26. Erickson, D., Mokhtari, P., Menezes, C. and Fujino, A.: "Voice quality and other acoustic changes in

sad speech (grief)", in Proceedings of the IEICE/ASJ/IEEE Interdisciplinary Workshop on Speech

Dynamics by Ear, Eye, Mouth and Machine, Kyoto, Japan, pp.43-48. (2003).

27. Erickson, D., Ohashi S., Makita Y., Kajimoto N., Mokhtari P., "Perception of naturally-spoken

expressive speech by American English and Japanese listeners" pp31-36 JST/CREST Workshop2003

28. Mokhtari, P., Iida, A., Campbell, N., Some articulatory correlates of emotion variability in speech: a

preliminary study on spoken Japanese vowels, ICSP-2001, pp.431-436, 2001.

29. Mokhtari, P., Pfitzinger, H. R. and Ishi, C. T.: "Principal components of glottal waveforms: towards

parameterisation and manipulation of laryngeal voice-quality", in Proceedings of the ISCA Tutorial

and Research Workshop on "Voice Quality: Functions, Analysis and Synthesis" (Voqual'03), Geneva,

Switzerland, pp.133-138. (2003).

30. ÆOfÇ, ½¾¿ vÀhÁj, d�«1�8R�2d�`fóhò !, 3-10-10, pp347-348, �

[��20020()¶·d��, 2002.9

31. ÆOfÇ, ½¾¿4vÀhÁj, "��<d��Å2j�2d�xy�z/ ! -�{©�j��K��

W��-", �[��20030(), pp.275-276(200309Ñ)

32. ÆOfÇ, ½¾¿4vÀhÁj, =��¥|��`_h/d�uhp, �[��

� 20040 v)¶·d��~�²|t(229-230), 200403Ñ

��

��

33. ÆOfÇ, ½¾¿4vÀhÁj, }ÔK~, ��d�`fóhò !, �¡¢ì��

¶·¡ó(2002-SLP-40-19), pp.109-114, 2002.

34. ÆOfÇ, ½¾¿4vÀhÁj, ¥|��`_hbMÕÖ<"(Ö�Zm)

35. ÆOfÇ, ½¾¿vÀhÁj, JST/CRESTd�¶·ñØ�2¿x/��=�ê_`Á_w, 3C5-11,

:�µ7��2002, 2002.5

36. íîïð, "�89��Ð/2�/��Áj�¡�a�ä8��<", " øïùú

û�ÇÈ��ü��B²| (2002- �3)

37. Mokhtari, P., Perceptual validation of a voice quality parameter AQ automatically measured in

acoustic islands of reliability, �[��20020v)¶·d��~�²|tpp.401-402, 2002.

38. íîïð, ��, Nick Campbell, "�$��, " �[��()

ÌV��~�²|t, pp.261-262 (2001-10)

39. ë�ìbíî7(Ó�ÈjØw��(½¾¿� vÀhÁj("Intraspeaker Voice-Quality Variability

with Interlocutor", �[��20040()¶·d��(pp279-280

40. Ó�ÈjØw�� "Creakyd�/��/ ! ", �[�� 20030() ,

pp.235-236(200309Ñ)

41. Ó�ÈjØw��, ��Y/bM&Ö��_`/��, ��¶·�2004/12/10

42. Ó�ÈjØw��, �Q��/��Y�,\µ��_`/&'�, �[��2004

0()¶·d��,

43. Ó�ÈjØw��, ½¾¿ vÀhÁj, i/67�±Þ, �[��20040v)¶·d�

�

44. Ó�ÈjØw��, ½¾¿ vÀhÁj, ��¤Td��/��/ !, 1-10-23,

pp275-276, �[��20020()¶·d��, 2002.9

45. Ó�ÈjØw��, ��i/��4��/ !, pp311-312�[��

�2003v)¶·d��

46. Ó�ÈjØw��(��Y/bM&Ö��_`/��(�7�¡§Ä��2004

08Ñ(Ä�Ç¡Vol.104 No.253(pp19-23

47. Ó�ÈjØw��(�Q��/��Y�,\µ��_`/&'�(�[��2004

0()¶·d��(pp295-296

48. ë�ìbíî7(Ó�ÈjØw��(½¾¿� vÀhÁj("Intraspeaker Voice-Quality Variability

with Interlocutor", �[��20040()¶·d��(pp279-280

49. Campbell, N.: "Voice characteristics of spontaneous speech", �[�� 20030() ,

pp.231-232 (200309Ñ)

50. ½¾¿ vÀhÁj, :��DØ�¾xj��jQ, �KT��¤, Y4PK1�_�×s:/�, Ø

�¾x/� 2003 Sep

51. ½¾¿4vÀhÁj, "Predicting the prosody of speech for synthesis", Y14P�Ã,*��,N�

»\��, 2002.11

��

��

52. �bc(��T(½¾¿� vÀhÁj(�=��/C©��C©D�F©E/ !�(�[�

��20040()¶·d��(pp283-284

�

-=�F�G)� � HI � �JKLM� oJO�1. Mokhtari, P., Pfitzinger, H. R., Ishi, C. T. and Campbell, N. (2004). "Laryngeal voice quality

conversion by glottal waveshape PCA", in Proceedings of the Spring-2003 Meeting of the Acoustical

Society of Japan, Atsugi, Japan, Paper 2-P-6

2. Iida, A., Campbell, N. "Developing an AAC Device with Natural Speech Output - The First Stage:

Deciding on Discourse Labels, " CREST International Workshop on Expressive Speech Processing,

Kobe, Japan, pp. 103-106.

3. Li-chiung Yang. 2001. "Prosody as Expression of Emotion". Proceedings of ORAGE, 2001,

Aix-en-Provence, France, June 2001

4. Li-chiung Yang., "The Expression and Recognition of Emotions Through Prosody". Proceedings of

ICSLP2000. Volume 1, pp. 74-77, 2000.

5. Carlos T.C., Towards Automatic Detection of Creaky Voice, Workshop on Speech & Sound

Processing in Relation to Auditory Representation, 2003,Aug

6. Ishi C.T., Campbell N., "Acoustic-prosodic analysis of phrase finals in Expressive Speech" pp85-88,

JST/CREST Workshop2003

7. Ó�ÈjØw��, Using Perceptually-related F0 and Power-based Parameters to identify Accent

types of Accentual Phrases� Speech Prosody 2002, pp.407-410., 2002.

8. Campbell, W.N., "Expressive Speech Processing: The JST CREST ESP Project", 1st International

Symposium User-System Interaction, Institute for Perception Research (IPO) Holland, 2000.

�

pqrst��:'uv�� Kw�xyTUTUz�

��XVG)� � HI � �JKLM� �JO�1. Iida, A., Higuchi, F., Campbell, N., Yasumura, M., “A corpus-based speech synthesis system with

emotion,” Speech Communication, Vol. 40/1-2 pp. 161-187, 2003.

2. Iida, A. and Campbell, N., “Speech database design for a concatenative text-to-speech synthesis

system for nonspeaking individuals, ” International Journal of Speech Technology, Vol. 6, Issue 4,

pp.379-392, 2003.

3. ©Ôuâ, £��KL, ��|:, ½¾¿4vÀhÁj, ¥O§�, �=��/2�/��

�Ðwu×/�dj"#�, �Ú_Fhfh`12_w��²|�,� Vol.2, No.2, pp. 169-176,

2000.

4. Iida, A., “A Study on Corpus-based Speech Synthesis with Emotion,” �ª§¨��ü��4�

êõJ¶·û�Ö²|, 2002.

5. ©Ôuâ, ��j�:#�3��2��Ð<"j�óÚ½�_h��%/¦s�,

Sophia Linguistica, No. 50, pp.179-195, 2004.

�

�

�

�

�

��

��

��

�� 1. Iida, A., Iga, S., Higuchi, F., Campbell, N., Yasumura, M., “Designing and Developing a

Conversation Assistive System with Speech Synthesis and Emotional Speech Corpora,” In

Proceedings of ISCA (International Speech Communication and Association) ITRW on Speech and

Emotion, Belfast, U.K., 2000, 9, pp. 167-172.

2. Iida, A., Mokhtari, P., Campbell, N., “Acoustic correlates of monosyllabic utterances of Japanese in

different speaking styles,” In Proceedings of 15th ICPhS, Barcelona, Spain, pp. 2861-2864, 2003.8.7.

�

�� 1. Iida, A., Campbell, N., “A database design for a concatenative speech synthesis system for the

disabled,” In Proceedings of ISCA 4th International Workshop on Speech Synthesis, Perthshire, U.K.,

2001,8.31, pp. 188-194.

2. Iida, A., Sakurada, Y., Campbell, N., Yasumura, M., “Communication aid for non-vocal people using

corpus-based concatenative speech synthesis,” In Proceedings of Eurospeech 2001, Aalborg,

Denmark, 2001, 9, 7, pp. 2409-2412.

3. Iida, A., Campbell, N., Developing an AAC Device with Natural Speech Output� The First Stage:

Deciding on Discourse Labels,” The First, CREST International Workshop on Expressive Speech

Processing, Kobe, Japan, 2003, 2, 22, pp. 103-106.

�

�� !� �

��

• 2000,11,10 �� IT��6��

• 2001,1,1 �� !��"#$%&'(�)*+��39��

• 2001,1,1 ,-(�� ./0 IT120� �C 3��

�

�"#�

��34�

�

$%&'�

5678(�

• 2001,1,27 RKB 9�:;� 8(<=>?@ABCDE�%FGHIJKLM�

��NOPQRALS��STUV�

WXYZ�

�

• 2003,7,9 �[\]�/�O3Q^�P_�à>bcCde@fghi�j

k�lmnopq�r@sBC�tC

• 2003,7,26 �u^vwxyz{|}~4P��à>bcCde@

fg��lmpq��y8��25��p+

• 2004,7,24 ��v��d�5��jk��

13��q��+� ¡¢�£¤�lmpq

�

()*+,-�.� �/012�34526�

¥¦�§¨©Cª�«��¬q�à>bcCde@��Q®q¯�°�¯

±�²�³´µ¶E·�OPP_��¸/¹�º�»¼\0½«¾½¿À/³³0�

Q®¾ÁÂy/ÃP/�ÄÅ½�¸¹/ÆÇ_ÈQ0ÉÊË½ÌÍPÎ�ÊÏ½�ÐÑ

ÒÓ�ÔÕÖ×K� �«��½ØÈVÙÚÛÜ�ÔÕÖ×K� ��ÝCÞ�ßàV

��

��

ãäåÝæ! çÓèÕ�é¥êëìÖ×ØÙì,íî�ç!Nï�n^YãäåÝæ�ð

ñò! Ö×ØÙìóô´Ù�ç!N��õö��Y÷ø�ùÈ&!4úûó�Å5Ë

��rÂ4ü�ý�þ��ý�Å5Sb¿ �º¿�b¶��T`��¿�°Ë�

�

��XVG)� � HI �oJKLM� �JO�1. mnoð4�Ôz74W�Eð4Q[|â7� 20010�Ñ\�� Èu�gjñØ�êõ�, ��[�µ

û��Y18P��d�²|t , pp. 52-53.

2. u vw� 20010\Ñ30� �¡<��R�¢£�,�ù¤��åRi�sY¥¦ÌV�åR�

��È�§ , ö¨��Ö�Ý, pp. 204-207.

3. Q[|â7� 2002011Ñ%�� \Í�©g/ñØ�êõ��`J�J�[RýªVÂhp�«

×²|t@nwC �WV¬3xVR�ü�pp.124-131.

4. Q[|â74mnoð� 20020%Ñ22�� [R/��Qm�j(�[R�5$/ì

�1�, Department of Japanese Studies, The Chinese University of Hong Kong and Society of

Japanese Language Education (ed.), Quality Japanese Studies and Japanese Language Education in

Kanji-Using Areas in the New Century, Himawari Publishing Company, pp. 455-461.

5. W�Eð� 2002011Ñ%�� WVR/M^��®�¤��Ñ¯iR �Y31w�Y12Å�

pp.74-79.

6. W�Eð4uvw4�Ôz7� 20020+Ñ31�� `_hjR/kl��Chinese Language

and Culture, No.3, pp.1-34.

7. u vw 2002010Ñ15� �WVR/��sd�°1j�â/^±�, ��Ð140�Ã,*�

�,N�»\��~�²|t , S, 49.

8. mnoð(²)� 2002011ÑG�� K�j�X��/iR� . `<: ³F´µ¶.

9. mnoð� 2003010Ñ10�� Í�jµ�·�óÚ½Èuõ¸4wx�u�_�, �¹|º·�»jý

¼/¶·· , Y48w, Y12Å, `<: º½Ý, pp. 54-64.

10. �Ôz74uvw4W�Eð 20030+Ñ31� ��±^\Í�©��|"j��[R

|"��(²), ��[R|" , Y+w, Y%Å, `<: �v��Ö��pp.100-116.

11. Q[|â7� 20040+Ñ31�� Ö¾FxI��1Y�¿nD�âfhxÀ_hs2Á�89��

��ÔÂÃ)7��[R¶·Sh`_¡ó , Y12Å, �ÔÂÃ)7��[R¶·Sh`_,

pp. 41-53.

12. Sadanobu, Toshiyuki 20040ÄÑ30� "A natural history of Japanese pressed voice," ��¶

· , Y\w, Y%Å, �[��, pp. 29-44.

13. mnoð� 20040&Ñ30�� [R/�Qm·2Á�89·�, ��|"¶·�(²), �|"j�

�IV , `<: �v��Ö�, pp. 35-52.

14. u� vw� 20040&Ñ30�� M^ÅR|��WVR×_Æä^ a /��jkl¸67�, �

�|"��(²), �|"j��Ç , `<: �v��Ö�, pp. 53-76.

15. mnoð� 2004010Ñ25�� óÚ½�_hýª/ÈÉ#jÊË�, ��[Rýª , Y

123Å, �[Rýª��, pp. 1-16.

16. u� vw� 2004011Ñ�� ¡<��Qn�g¢£�Ì�� !�, �[WVR��(²),

�WVR� , Y251Å, pp. 1-13.

��

��

17. �� 2005�� , ��, �34�, ��, ��: ��, pp.

30-37.

�

��

��

� � !"#�� $%%$�&�$%� '()�� *� +,!+��--.��/�.012345�

-6 �

�

$ � �� $%%7��%� 8�9:;�<�=>? *� �<@A.01+5+�@AB6 �

�

��

1. ��CD8EFG� 2000�11�26� 8��'(HIJK9:; LM8�-NO�P

QR , The Fifth International Symposium on Japanese Studies and Japanese Language Education

01ST!U�-(!+)6

2. �� 2001�10�27� .V�!�WX XY , Z[��-.�26\�.]^_àb

cdefLgh�U 01ij�-6

3. !"#�� 2002�8�4� !+��kl , mn@A]^_àbckl�mn 01opqr

+��-6

4. �� 2002�s�15� tuvw�xyz , !+{\|}~��, !�8U

�C8��@A+,��^��01��r+�-�(!+)6

5. Sadanobu, Toshiyuki 2003. 2. 22. "Expressive speech and grammar: with special reference to

pressed voice," The 1st JST/CREST International Workshop on Expressive Speech Processing, Japan

Science and Technology Corporation, pp. 55-60.01op�-6

6. !"�GC�� 2003��22� 8�C!+�mVHIJK��CL��

�� , The 1st JST/CREST International Workshop on Expressive Speech Processing, Japan


7. ��C!"#�C��G� 2003�2�22� !+�� ¡¢��£�yz�¤¥¦

��§; , The 1st JST/CREST International Workshop on Expressive Speech Processing, Japan


8. Sadanobu, Toshiyuki 2004. 8. 22. "Voice quality and grammar: with special reference to Japanese

pressed voice," The 6th symposium of Nordic Association for Japanese and Korean Studies

(Goteborg, Sweden)

9. �� 2004�10�30� .V�<L�V¨©�_¥ Hª«h , Z[��-.�29\�.

]^_àbc�V¨©�_¥L.V�< 01�¬r+��-6

10. ��GC!"�G� 2004�10�30� ¨©�®¯K¨©�,®¯°«¨©� , Z[��-.�

29\�.]^_àbc�V¨©�_¥L.V�< 01�¬r+��-6

11. �� 2004�10�30� r+±-NOH�V¨©�_¥²³´µ�¯K¶:· , Z[��-

.�29\�.]^_àbc�V¨©�_¥L.V�< 01�¬r+��-6

12. !"�G� 2004�11�7� 8�C!+�=>�VHIJK¸¹º»�� , 8!+�-

.�54\¼+�.01�¬�-6.

�

�

��

��

��01V�� K��H�O�

� ��XVG)� � HI � oJKLM� �JO�1. Rilliard A & Aubergé V (2003), Prosody evaluation as a diagnostic process: Subjective vs. objective

measurement ». International Journal of Speech Technology, Kluwer Academic Publishers..

2. Aubergé V. (2002), Prosodie et émotion, 2e Assises nationales du GDR I3 (Information Interaction

Intelligence), Cépaduès-Editions, 263-274

3. Aubergé V., Cathiard M. (2003), Can we hear the prosody of smile? Numéro special Emotional

Speech, 40, Speech Communication Review.

4. Audibert N, Aubergé V, Rilliard A., Rossato S. (2003), Capturing the emotional prosody in live but in

lab, Prosody & Pragmatics.

5. Morlec, Y., G. Bailly, & V. Aubergé (2001) Generating prosodic attitudes in French: data, model and

evaluation. Speech Communication, 33(4): p. 357--371.

��Z[G)�

*A�KZ[\]� HI ��JKLM� �JO�1. Aubergé V. (2003), “Expressive Speech in France”, 1st JST/CREST Int WS on Expressive Speech

Processing, Kobe, 10-19.

2. Aubergé V. (2003), Expressions, attitudes et expressivité: une architecture cognitive distribuée pour

les voies parlées des émotions. Interfaces Prosodiques 2003, Nantes, 319.

3. Aubergé V. (2003), Integration of emotional, pragmatic and meta-linguistic affective information in a

superpositional functional Gestalt model of prosody, Prosody & Pragmatics, Preston.

4. Aubergé V. (2002), A Gestalt morphology of prosody directed by functions : the example of a step by

step model developed at ICP, Proc of 1st Int Conf on Speech Prosody 2002, Aix-en-Provence, France,

151-155

5. Aubergé V (2001), Le sourire parlé, Actes du Colloque Emotions, Interactions et Développements,

121-125/

6. Aubergé, V. (2000) Modélisation de la prosodie par formes globales : amont ou aval de la phonologie

tonale in Journées d'Etudes sur la Parole. Aussois - France. p. 281-284.

7. Aubergé, V. and L. Lemaître. (2000) The prosody of smile. in ISCA Workshop on Speech and

Emotion. Newcastle - Ireland. p. 122-126.

8. Aubergé V., Audibert N., Rilliard A., (2004) Acoustic Morphology of Expressive Speech: What about

Contours, Int Conf on Speech Prosody, 91-95, Nara.

9. Aubergé V., Audibert N., Rilliard A., (2004) E-Wiz: a trapper protocol for hunting the expressive

speech corpora in Labs, 179-182, LREC , Lisbon.

10. Aubergé V, Audibert N, Rilliard A. (2003) Why and how to control the authentic emotional speech

corpora, Proc of Eurospeech, Genève, 185-188.

11. Aubergé, V & Rilliard, A. (2000), Prosody evaluation: quality measurement or diagnostic?,

Workshop COST 258, Stockholm, février.

12. Brichet C. & Aubergé V. (2004) Domaine de la fonction de focus dans la perception prosodique, JEP,

Fès.

��

��

13. Brichet C. & Aubergé V. (2002) La prosodie de la focalisation en français : faits perceptifs et

morphogénétiques, XXIVèmes Journées d'Étude sur la Parole, Nancy, 24-27 juin 2002

14. Brichet, C. & V. Aubergé. (2001) La focalisation en français : morphologie de la prosodie. in Actes

des Journées Prosodie. Grenoble - France

15. Rilliard A. & Aubergé V. (2004) Evaluating an authentic AV expressive speech corpus, 175-178,

LREC,.

16. Rilliard A. & Aubergé V. (2002) Towards a linguistic validation of a prosodic generation model,

Proceedings of the first International Conference on Speech Prosody, 607-610.

17. Rilliard, A. and V. Aubergé. (2001) Mesure de l'intelligibilité de la démarcation prosodique, Actes des

Journées Prosodie, Grenoble, 483-487.

18. Rilliard, A. & Aubergé, V., (2000), Perception and Analysis of a Reiterant Speech Paradigm: a

Functional Diagnostic of Synthetic Prosody, Proc of 2nd International Conference on Linguistic

Resources and Evaluation, Athènes, Grèce, pp.661-664.

19. Rilliard, A. & Aubergé, V., (2001), Prosody evaluation as a diagnostic process: subjective vs.

objective measurements, 4th ISCA Workshop on Speech Synthesis, Atholl, Scotland.

20. Rilliard, A. and V. Aubergé. (2000) Perception and Analysis of a Reiterant Speech Paradigm: a

Functional Diagnostic of Synthetic Prosody. in International Conference on Language Ressources

and Evaluation. Athens - Greece. p. 661-663.

�

-=�F�G)� � HI � YJKLM� nJO�1. Aubergé V (2001), Prosodie et fonctions: libertés morphologiques et contraintes fonctionnelles, Actes

des Journée Prosodie, 35-39

2. Aubergé V & Lemaître L. (2000), Audio-visual expression of amusement : some over-added

information, Workshop COST 258, Aix, septembre.

3. Audibert N, Aubergé V, Rilliard A (2004), EWiz : contrôle d’émotions authentiques, JEP, 49-52,

Fès.

4. Audibert N, Aubergé Rossato S. (2004), Paramétrisation de la qualité de voix : EGG vs. filtrage

inverse, JEP, Fès.

5. Rossato S., Audibert N. & V. Aubergé. (2004) Emotional Voice Measurement : A Comparison of

6. Articulatory-EGG and Acoustic-Amplitude Parameters, Int Conf on Speech Prosody, 53-57, Nara.

�

�� KIabcdefghi��j�klm��

��XVG)� � HI � �JKLM� �JO��k� ½¾¿� vÀhÁj(��iR�¡��FYÍ/��_`_�(|"j��Ç��|"¶·

�²(�v��Ö�(�� Î,��

�

k� ½¾¿� vÀhÁj(��|"/ÏY¤�/�Ð�F©E�(|"j��Ç��|"¶·�²(�

v��Ö�(�� - Î ->�

�

��Z[G)�

*A�KZ[\]� HI � nJKLM��JO�1. Campbell, N., "Building a corpus of natural speech - and tools for the processing of expressive

speech - the JST CREST ESP Project", �_¿¾ñ��i¯/û�j��, `<��, 2001

2. Campbell, N., "Databases for Concatenative Speech Synthesis", Univ. of Munich, 2000

��

��

3. Campbell, N., "Databases of Emotional Speech", in Proc ISCA (International Speech

Communication and Association) ITRW on Speech and Emotion, pp. 34-38, 2000.

4. Campbell, N., “Future Directions for Speech Synthesis-a personal view”, IEEE speech Coding

Workshop, Oct 2002 (* Keynote Speech)

5. Campbell, N., "Integrating Different Prosodic Systems in Speech Synthesis", Prosody, 2000: Speech

recognition and synthesis workshop, 2000

6. Campbell, N., "Recording Techniques for Capturing Natural Every-Day Speech" LREC2002, May

2002

7. Campbell, N., "What Type of Inputs will we need for Expressive Speech Synthesis?", IEEE2002

Speech Synthesis workshop, Santa Barbara, 2002.9

8. Campbell, N., "tap2talk: an Interactive Interface for Large Speech Corpora" pp223-224 ASJ2003.3

9. Campbell, N.: "Towards Synthesising Expressive Speech; Designing and Collecting Expressive

Speech Data", in Proceedings of the 8th European Conference on Speech Communication and

Technology (Eurospeech'03), Geneva, Switzerland, pp.1637-1640. (2003).

10. Gerard Bailly, Nick Campbell, Bernd Mobius, ISCA Special Session: Hot Topics in Speech Synthesis,

8th European Conf. On Speech Communication and Tech.(Eurospeech2003)

11. Nick Campbell, "Advances in Conversational Speech Synthesis", Advances in Speech Technology

2004(11th International Workshop),

12. Nick Campbell, "Extra-Semantic Protocols; Input Requirements for the Synthesis of Dialogue

Speech", Affective Dialogue Systems, pp221-228

13. Nick Campbell, "Speech & Expression; the Value of a Longitudinal Corpus", 4th International

Conference on Language Resources and Evaluation, pp183-186

14. Nick Campbell, Specifying Affect and Emotion for Expressive Speech Synthesis,

CICLing-2004(Fifth International Conference on Intelligent Text Processing and Computational

Linguistics) (* Keynote Speech)

15. G7n - Nick Campbell, Synthesis Units for Conversational Speech-Using Phrasal Segments-", �[

��20040()¶·d��(pp337-338

16. Nick Campbell, User Interface for an Expressive Speech Synthesiser, �[�� 20040 v)

¶·d��~�²|t(pp253-254), 200403Ñ

17. ½¾¿� vÀhÁj(��_�wÁ_w��ÐÇÈ/MX[ 4Ñ]―�I_��_�w�a��

��Ð―�(�7�¡§Ä��Vol.87 No.6 pp497-500

18. ½¾¿4vÀhÁj, Collecting Really Spontaneous Speech, ��-¹�2��iR�¡¢ì/

¨1¡�Y2PÌÍ�§, pp.155-158, 2002.

19. ½¾¿4vÀhÁj, DAT vs. Minidisc Is MD recording quality good enough for prosodic analysis?,

�[��20020v)¶·d��~�²|t, pp.405-406, 2002.

20. Mokhtari, P. and Campbell, N.: "Quasi-syllabic and quasi-articulatory-gestural units for

concatenative speech synthesis", in Proceedings of the 15th International Congress of Phonetic

Sciences (ICPhS'03), Barcelona, Spain, pp.2337-2340. (2003).

21. Mokhtari, P., Campbell, N., "Automatic Detection of Acoustic Centres of Reliability for Tagging

Paralinguistic Information in Expressive Speech", LREC2002 2002.5

��

��

22. Mokhtari, P., Campbell, N., "Automatic Characterization of Quasi-Syllabic Units for Speech

Synthesis based on Acoustic Parameter Trajectories: a proposal and first results", 1-10-5, pp 233-234,

ASJ2002, 2002.9

23. Mokhtari, P., Campbell, N., "Some Properties of the Glottal AQ Parameter Automatically Measured

in Expressive Speech", LP2002

�

-=�F�G)� � HI � �JKLM� �JO�1. Carlos Toshinori Ishi, "A New Acoustic Measure for Aspiration Noise Detection", 8th International

Conference on Spoken Language Processing, pp941-944

2. Mokhtari P., "A proposal for acoustic-articulatory gestural units in concatenative speech synthesis"

pp253-254, ASJ2003.3

3. Campbell, N., "Recording and Storing of Speech Data" LREC2002, Satellite Workshop, May 2002

4. Mokhtari P. "Automatic processing of expressive speech: physiologically-motivated but robust

analysis" pp97-102, JST/CREST Workshop2003

�

2��G)�

� � � Long-Term Research

Since 1989, the Advanced Telecommunication Research Institute near Kyoto has conducted

some of the world's most significant, long-term research in human-machine communications.

Now the Institute is being restructured, and more than a decade of quiet research is coming to

fruition ...� � pp.12-23. Sept 2001.

bcd�effgh�

Computers get emotional

Kentucky.com, KY - Dec 9, 2004

... Nick Campbell, a speech synthesis researcher at the Advanced

Telecommunications Research Institute in Kyoto, Japan, says it

helps to understand how the speech ...

Synthesizing human emotions

Baltimore Sun (subscription), MD - Nov 29, 2004



first helps to understand how the ...

Computers get emotional

Lexington Herald Leader, KY - Dec 9, 2004



helps to understand how the speech ...

Synthesizing human emotions

Baltimore Sun (subscription), MD - Nov 29, 2004




��

��

No laughing matter

The Scotsman, UK - Dec 9, 2004




No laughing matter

Electric New Paper, Singapore - Dec 8, 2004

... Mr Nick Campbell, a speech synthesis researcher at the

Advanced Telecommunications Research Institute in Kyoto, Japan,

says it first helps to understand how ...

( Get the latest news on "nick-campbell speech" with Google Alerts )

��HI � oJKLM� �� JO�

* I �d�$s¼½¾¿½À�ÁÂÃ¿Ä½¾ �Ò�Ã�«½À�ÓÙÉÉ�

'Ós��/��¨©ÄÔ#LÕµ» �Ömµ�2�/×z�apX/2

�/ñØò�×(��ÄÅ/��¨©ÄÔ#LÕµ» �Ömµ�2�/

×z�apñØò�×(T�p�`|�gØÕÖ×z�apñØò�×�

Ö�ÙÅs �� Ú��,>��

Ö��s �� k��k��

�

d�$s½¾¿� vÀhÁj�

'Ós��Ð×znp�hÛÚ_`ñØò�×�

Ö�ÙÅs ��,Ú��

Ö��s ��,k�,k� �

�

d�$s�_Ü×��¿`g(Ó�ÈjØw��(� �

Ü_x×¾x1õÝõhi_(½¾¿� vÀhÁj�

'Ós�a�êj(ÐD"(�a^_D"(Þp�XY�/2�/�hÛÚ_`ñØ

ò�×(´ßñØò�×�àè�2àèáÍ(np´ßñØò�×�a�ñØò

�×éY2�hÛÚ_`�

Ö�ÙÅs ��,Ú�,� �>�

Ö��s ��,k� k ��

�

d�$sÆOfÇ(½¾¿� vÀhÁj�

'Ós��¡ !×znpX/â¢ì×z�

Ö�ÙÅs ��Ú��-�

Ö��s ��k�,k�.�

�

d�$sÓ�ÈjØw��(½¾¿� vÀhÁj�

'Ós��ê_`/��YãäbM&Ö×z�ap��YãäbM&ÖñØò�×�

Ö�ÙÅs ��Ú �-..-�

Ö��s ��k�>k��

-LM�� Ú��,>�/åæçÖ��

�Ö�ÙÅs¼«z��¼�,��>��

�Ö��s ��,k� k ��

� � � � �èmVsJ�gÈ(È]é�

� � � � �

��

��

��

�� @� ¡D¢�£K¤ �D��3x@�¥¦§¨K�

��©ª�

H5iJ�jklmnopqrst�uXvwxyz{�|mn}w�~�u�t�

�*+��2�w��t��r��k��w�mn��#u�

s}wu��u��xn��#��m�x�o��x�

��*+��r��q�� ¡��H�¢Q£¤¥¦�§¨q©

ªtG«¬�®¯�°�*+�±�(²³´w��t�ªµs�¶Uk·¸¹�n�

ºu»_�¼®½�a¾a¿k©ªÀUÁ��xn�

Â:.GÃP��ÄÅuÆx�t̂ Çuw��£��ÈÉq©ª�Fq©nkt

Ê��H5iJ�jkËxt©nx£H5iJ�jkÌÍÎn�£tÏÐÑÒÓÔÕÖ

�^Çw¼×�ØÙÚªÛq©ªÜÝwrnHÔ��q©n�

��uÆx�tÎÙuÞ[ußànáâkq��xrx�£ãäq©nkt�å

æçE�èqtÁ�uéarÈ·��TU*+wxy�ª£ê�ët��r*+�

°�±x�wxmn�ì5í�í�rHIJ0K�LM5:;<�tî��2�k

��x()(²*+�·ß��

ïð��ñ�wòut��róP��ô��õ��kt

Ê��öÆ��÷�E£tøù�^kúûu¶Uq�n��w��tüý�TU*

+þ�wrn��x�m��tG«�q�n}w��u�sÎn�ªtG«q�rx

}w�~�uÎn�kî ¡�*+��2��?�q©ªt�æq©nw�mn�

�

��«� ��¬)EC®��¯>°9±²³´µ¶®�

��

�¦£t�°w��÷�EÎn q©��kt��/��/�÷

��/��/r��k��q©nw�½��ok�½�tÊ�-�ª

u��kµs��n}wkq��G«q�t}��°�¶Ý��u�p

�k��ª�xw�ykt}��æk�VÇ�è½½nw��½�t � !

��uÆx�t}��VÇ£"#æu$%k©nw�m��&xq©n�Ê�

èt ��'(½�£)*�+Æ*+R��q©nkt}��VÇq��,u��

�t}��'(��-Î}w.�/�0ª1p�w��n�

��

^23�45uÆx�t:;��*+£64�45�ª^7q©n�8t*+3�

µs£H�¢QÒ�j�9�:�;}��<6�=|u>ª�p��t_µs�?õ*

+/ÃÔ�í@j�k}�AB�CuÏâ9r*+DE�F��tG��H

Iu�$%k©��

JK½�LYr��/�¨r�MtºuAB�Nê?õ*+/�OP�OQ�Ëy}

wkq��î��2��Ò�j�R�S��¶UwòutT��mU�*+

Ô��V�WuAB��m��}�AB£�*+¼®uX��tÁ�uYZ

��

��

u[m��n�}�\7?õ*+/�kÃk�Ê�"q]�æ^_�+Æq©ëy�

`3uÆx�t"#"aÖb�\]u¶U�tî��2��*+ckT��

��u�.În}wkq��drn�E�ef�gÝu�hwijuî*+�k[

�[l�tÊ�"#m$u��t��*+��mU�nEq��o�up

�n}wqtÁ�un7r"#æq�r�s�q��

Y;uÆx�tG«t"aKqi �Ò�j6�QF�kçË]q©n�}�£î

��2�k}��°u�tuuª¸�o��q©n�v�uw��¶w£�xë�

oktTuyx��}�*+�µ�E�µ��æz{¨��|În}w£tµ�r¶

w��n}wuÆrkn�}~æ^Ç�ÄÅ��-�t�x�£tXîæÄÅ��Lu

rn�

��

*+��jut�xH5iJ�j��¼��<�În�ªt~�r��tø��<

��u��x�Ê}qtÔ��5��Ã�Ç�XvP��Ò�j6

�Q�ºu��i r*+k�¹��xn�qtÊ��NO��x�

¹�*+-./t ^*+/�gÝ½��t}��2��z½Î�8t��@

��íÃ��S��ír��ø��ª�x��ðt��uP��t

�yrn½��rx�qt��u*+��ÀUk~�r�� x�

�¡¢ ¡�*+��2��În}w£t*+-./uw��t½rª�£

¤�Â¥n�Ê�u·��ß�ti��è�¶w�ÕÖu¦CÎn�§�Â¥

n�

��

}�*+�TUuP��t¨©æuö2��2�Aªk©n�Ê�øÆ

£ºu�¹��yªt�yøÆ£G««¿]q©n��

=§¬�®æ:;&'*+±�¯ç°Ð±²³´µ¶·q£t}��2��

� !ÌÍ��(îw��t¸¹:;�º¨t»5��.�TUu�½y��

¼£t²³´µ¶�½¾J��5HIJ0K�LM5�ó[ôk¿8n�À1��*+

±�Á��S��½�Â�

�%��³�, a�a�õ^�CD�ághC��eDF��"

·5�¡��mn_5��§¨��b�n^§�ì^��ý�|;��

_5�¡�Å¸cdAefFghC|;��«�E��z��¡�mn��_

5� %�ý��¿�C8E !Câ«F�"#��$�%û��&FD'(ì

P)ì��«�E��õ*�_5�+�ý¬,âD�¬�mn�-.þ

��ý&FD/F��«�E��0m��1_�23�¿4567àF

�8â��9¬�:;|;ì��|;�<È�P)=>àF�^�?��:;

��|;µ¶��ì@��cdAefFghC|;�P)A9â�àF��

�|;� µ¶^�§�ì�ý�|;�P)B&B��àF�0^��

¿ùC5^�D��^³EF�GH��&«ùIJ0^��¿ùC5^��ì

R��Kýì�ýì{|�L��MN��&«��$�O 5��¶�'P��Ê

��

��

A9â�4Qn�RF !Câ�¡zg�8��²�SäT¸��4"b5��«�

E��mn�U°��¶�½º,89âåbV�DWFf�F��¿��

�RI°��XY��}� Z�16�ý[18�ý!

«¿]�®æÃB*+¯ç�Ä� =>�6�*+ ICORPj@��2�

£tóE-Å�LM534V4: Ambient Intelligenceu�p��.Gs!u�nÂÆ

:;<�LQRS�*+ôq©n�

óE-Å�LM534V4ô�*+£t��oF�Ò�j�Ò�juÇ��H5iJ

�ju0ª¸êóÂÆ»5�ô��wt��ÈxUu�nì5í�í�:;<��

*+q©n�^Ç��kÉÎÂ::;��ËÊ:;��/ËÐ:;r��óÂÆæ:

;ô�ÌU�u<�În�8�()*+q©n�øU£tÌÍu�p�t��ÍÎu�

n�æ:;��Ï�Ðß�ÅÒ�E�Ëxt�yøU£t��u�p�t�9Ç�Ô�

ÑqÒS�±��Ëy��Óu£.�rx��:;tÆ¹ª��aÔ��:;�è�

�� !q.GÎnì5í�í�rHIJ0K�LM5:;�(u�t¢<��:;�

��u]mn�}��£ÑÒÓÔÕ�.��oFâF�ÃÏÐÑÒÓÔÕÖ�z

À:;��.�TUu�p�t��.:�ºê.GÖ½r��UE�tÁ�u

£t��u0ª×s@5Rq��5�r:;&'��.�(²ÌÍ»5��u�

¶Uq�n��q©n�

�

��·� �� ¸¹ºfg��»¼½¾´P¿�� 3¡<ÀÁ�

�¼t*+-./��y}w��Ø��q.G��x�

“It has been unfortunate that the researchers employed for this project by the JST have not

been entitled to the same rights and privileges as other ATR researchers working in the same

building. This has resulted in some of my best researchers opting to join ATR when the project

was coming to an end. Similarly, since my own salary has not been paid by the JST project,

this has raised some problems concerning the use of my time at ATR. I hope that these small

difficulties can be smoothed out for future projects with similar funding arrangements, for

although it has been a work of great interest to me, it should not place an undue burden on the

laboratory in which.I am employed”.

óATRq� CREST�Ùc*+c£tãärk�ti¼®�Ú�Ùc*+/wdrn

Û¶¼u©ªt*+��2��j�Ü½�i¼®.�*+ÝuÞn}w�tßà

rsÁ��i¥st-./�á|kî��2�½�£Nâ��rx}wkåãw

r��q©��Nä�rxÃªUq©n½��rxktüýtJSTk}�åã

uÆx�t��rå�±x�æ�xw�y�ô

“The staff of the Kyoto office of the JST have been extremely tolerant and helpful. I

would like to thank all the Kyoto staff for their continued patience with my requests and to

praise their efforts to comply with even the most difficult of them. This research would have

been impossible without their help. I would especially like to thank them for finding ways for

��

��

me to employ so many young people to help with this work, since I believe that such a

labour-intensive project is not the norm for JST-funded projects. They have made the official

arrangements of the project smooth and have greatly eased the burden of paperwork that might

otherwise have distracted from the research. They have also proved excellent at reading

English!”.

óJST�çè�§,�Qj1�uésÂê��x�¨�r�ëì±Äu}�.�rk

x¥��»írîïuP��ðñÐ�Tò�txëxëróqt}�*+kQS�Òu

�6q�n�yuô7��soÁ��Äutî:;��2�u^æ0�Ò±½

rªµx?õQj1�ð�0ª¸ê°Ðè��ÁuP��tçè�§,�7rs��

£��q©��q©ëy�õ�q�PTu�ösT¥�soÁ��}w�té÷�Âê

În�ô

“Finally, I would like to thank the advisory committee, and the staff of the JST in Tokyo.

Their help and advice has been most encouraging, and their positive attitudes always most

refreshing. The responsibility of managing such a large research project has weighed very

heavily upon me at times, and it is a credit to their professional support and management that it

has run so smoothly. I have been quoted in the press as saying that one of the greatest

strengths of research in Japan is the breadth and depth of its funding. Without such long-term

fundamental support, we would be seeing only small incremental improvements to the

technology, instead of the paradigm shifts and new openings that come from deep basic

research”.

ó�ýutÔrí@øùcÖw JSTî�u�nú@û5QuÆx�t½ü�x*+q

©��Á�÷>��NOÃ�F£ýér��q©n�þ��yt ��G��8��

¹�tÙgm$£�x�w��x��Á�utJKw��tXî��*+

��øÄÅ£t}��5�j�Stéx��5ÒÓ5�q©n�}�q��r¢<û@

SL��k��n�ô

�

��Â�Ã;�ÄÅÆ�

�

�

��

��

�

�

�

��

��

�

�

��

��

�

�

�

�

��

��

˘ˇˆ - JST · ˘ˇˆ˙˝˛˚˜ !"#$%&’()*+,-./012"$ˆ3 ˜ 456789%:2; 0ˆ?@; ˆa- ˜ b cd ef...

Documents

Transcript of ˘ˇˆ - JST · ˘ˇˆ˙˝˛˚˜ !"#$%&’()*+,-./012"$ˆ3 ˜ 456789%:2; 0ˆ?@; ˆa- ˜ b cd ef...