Post on 31-Jul-2015
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual
Entailment from Cardinalities and SMT
Sergio Jimenez and Claudia Becerra
(a participating system in the Cross-lingual Textual Entailment, CLTE, TASK-8)
Alexander Gelbukh
Instituto PolitécnicoNacional, Mexico
Soft Cardinality
A=, ,
B= , ,
|A|=3
|B|=3
Classical(integer)
Soft(real)
|A|’2.9
|B|’1.3
Cardinality: number of different elements in a collection, i.e. set definition.
C= ,= |C|=1 |C|’=1.0
Soft Cardinality
|𝐴|′=∑𝑖=1
|𝐴|
𝑤𝑖 (∑𝑗=1
|𝐴|
𝑠𝑖𝑚(𝑎𝑖❑ ,𝑎 𝑗
❑)𝑝)− 1
inter-elementssimilarity
elementsweights
“softness”control
When
word-to-wordsimilarity
idf termweighting
Word-to-word similarity functions
• Character q-grams𝑠𝑖𝑚 (𝑡𝑖 ,𝑡 𝑗 )=
|𝑡 𝑖∩𝑡 𝑗|−𝑏𝑖𝑎𝑠∝max (|𝑡 𝑖|,|𝑡 𝑗|)+(1−∝ ) min(|𝑡𝑖|,|𝑡 𝑗|)
• Edit-distance𝑠𝑖𝑚 (𝑡𝑖 ,𝑡 𝑗 )=1−
𝐸𝑑𝑖𝑡𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑡 𝑖 ,𝑡 𝑗 )max [𝑙𝑒𝑛 (𝑡 𝑖 ) , 𝑙𝑒𝑛(𝑡 𝑗)]
• Jaro-Winkler𝑠𝑖𝑚 (𝑡𝑖 ,𝑡 𝑗 )=
13 ( 𝑐𝑙𝑒𝑛(𝑡𝑖)
+𝑐
𝑙𝑒𝑛(𝑡 𝑗)+𝑐−𝑚𝑐 )
c is the number of characters in common with in a sliding window of size
m is the number of order mismatches between the common characters
Language-pair Model
T1(EN)
T1(EN)
T2(FR)
T2(FR)
SMT
SMT
T1t
(FR)
T1t
(FR)
T2t
(EN)
T2t
(EN)
translate
Text
Pre
-pro
cess
ing
• Tokenizing• Stemming• Stop-words removal• idf term weighting
Feat
ure
Extr
actio
n
4-w
ay c
lass
ifica
tion
mod
el
Goldstandard
SVM
Submitted Systems
• RUN1: 4 language-pair models (es-en, fr-en, it-en, de-en) each one trained with 1,000 text pairs. SVM using C=1.0• RUN2: same as RUN1 but optimizing
C for max. accuracy.
Circular Pivoting Translations
T1(EN)
T1(EN)
T2(FR)
T2(FR)
SMT
SMT
T1t
(FR)
T1t
(FR)
T2t
(EN)
T2t
(EN)
SMT
SMT
T1tt
(EN)
T1tt
(EN)
T2tt
(FR)
T2tt
(FR)
SMT
SMT
T1ttt
(FR)
T1ttt
(FR)
T2ttt
(EN)
T2ttt
(EN)
Original feature set: 2 comparable text pairs x 14 features= 28 features
Extended feature set: 2+2+4 comparable text pairs x 14 features= 112 features
Original feature set: 2+2 comparable text pairs x 14 features= 56 features
Single Multilingual Modelen de
en es
iten
fren
1,000 feature vectors
1,000 feature vectors
1,000 feature vectors
1,000 feature vectors
4,000 features vector
training data set
4-w
ay c
lass
ifica
tion
mod
el
Single Multilingual Model Results
4.6% better than best official
5.3% better than best official 1.3%
better than best official
4.4% below best
official
6.0% better than best official
baseline
Conclusions
• Soft Cardinality + SMT + SVM seems to be a good combination for CLTE.
• A single multilingual model produced improved results than language-pair models.
• Additional circular pivoting translations produced slightly improved but consistent improvements.
• Character q-grams seems to be better than Edit-distance and Jaro-Winkler.
Soft Cardinality at *SEM and SemEval
• STS-2012, official 3th out of 89 systems• STS-2013-CORE task, 18th out of 90 systems
(4th un-official)• STS-2013-TYPED task, top-system UNITOR team• CLTE-2012, 3rd out of 29 systems (1st un-official)• CLTE-2013, among the 2-top systems• SRA-2013, among the 2-top systems
, , 1.3’