EDRAK: Entity-centric Data Resource for Arabic Knowledge

28
EDRAK: Entity-centric Data Resource for Arabic Knowledge Mohamed H. Gad-Elrab Mohamed Amir Yosef Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany 30 th July 2015

Transcript of EDRAK: Entity-centric Data Resource for Arabic Knowledge

Page 1: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK: Entity-centric Data Resource

for Arabic Knowledge

Mohamed H. Gad-Elrab Mohamed Amir Yosef Gerhard Weikum

Max-Planck-Institut für Informatik

Saarbrücken, Germany

30th July 2015

Page 2: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Comprehensive Arabic Data Resource!

2

If only we have a

Page 3: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Outline

• Resources Use-cases

• Related Work

• EDRAK Resource

• EDRAK Creation

• EDRAK in Numbers

• Evaluation

3

Page 4: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Resources Use-cases

• Entity Linking / Named Entity Disambiguation

4

Angela_Merkel

خطة إنقاذ اليونانبدعم األلمانيالبرلمان تطالب ميركل

ألمانيا - السياسية - انتخابات -االجتماعيالديمقراطي الحزب –

Germany – Politics –Elections .. etc

Context

ميركلأنجيال دوروتيا - أنغيال

أنجيال - ميركل المستشارة األلمانية –Angela Merkel – Merkel … etc

Names

Person, German_Politician, ..etc

Types

Merkel calls the German parliament to support the Greece bailout plan

Page 5: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Resources Use-cases

• Entity Linking / Named Entity Disambiguation

• Dictionary-based NER

• Entity Summarization

• Question Answering

• Fine-grained Semantic Type Classifier

• ….

5

Page 6: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Existing ResourcesResource Name Entity-

Aware?Building source Arabic

Names Size

Context info?

JRC-Names(Steinberger et al. 2011)

No Wikipedia + News 17K No

Arabic Lexical NEs (Attia et al. 2010)

No Wikipedia + WordNet 45K should

CMUQ-Arabic-NET(Azab et. al. 2013)

No Wikipedia + News 60K No

Google-Word-To-Concept(Spitkovsky and Chang, 2012)

Yes Wikipedia + Web 800K No

BabelNet(Navigli and Ponzetto, 2012)

Yes Wikipedia + Concepts Translation

NA Yes

AIDArabic(Yosef et. Al 2014)

Yes Wikipedia (Eng & Ar) 495K Yes

6

Page 7: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK

7

Entity Catalog

(2.4M Entities)

Names DictionaryKeyphrasesDictionary

Weights

Semantic Types

Entity-Entity Similarity

Page 8: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

Yago3(English & Arabic)

Culture Specific

Prominent Entities

Entity Catalog

8

Page 9: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Manually from the Arabic Wikipedia• Page Titles

• Anchor Text

• Redirects

• Disambiguation Pages

Names Dictionary

9

Page 10: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation Names Dictionary

10

• Limitation

Missing Arabic Names

Have Arabic Names

Page 11: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation Names Dictionary

11

Populating Arabic names for Entities that exist only in the English Wikipedia, and compile more names

for entities in the Arabic Wikipedia

External Resources

Named Entity Translation

Transliteration

En. Entity Names

Generated Ar. Names

Names Dictionary

Page 12: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Approach 1: External Resources• Entity-aware: Google Word to Concept (GW2C)

• Web Hypertext Anchors to Wikipedia

• Name-Dictionaries:• JRC-Names

• CMUQ-Arabic-NET (Azab et al. 2013)

Names Dictionary

------ ------ ---- --------- --- --- ----- ------ ----- ----- -------- --------- ------ ----- ---- ----- ---- --- -- ------ -------------- ---- --- ------- --------

-- ------ -------------- ---- --- ---- --- -------- --- -- --- ---- ---- ---- ----- ---- ------------ ---- --------- --- --- ----- ------- ---------

-- ------ ---- --- ------- ---------- ---- --------- --- --- ----- ------ ----- ----- ------------ -------- ---------

-- ------ -------------- ---- --- ---- --- -------- --- -- --- ---- ---- ---- ----- ---- -------------- ----- ---- ----- ---- --- -- ------ ------

--------- ---- --- ------- ----- ------ --------- --- -- ------ -------------- ---- --- ------- ---- ----

------ ------ ---- --------- --- --- ----- ------ ----- ----- ----------- ---- --- -- ------ -------------- ---- --- ------- -

Web pages

12

Page 13: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Approach 2: Entity Names Translation• Statistical Machine Translation (SMT)

• Services target full text

• Name Entities are mistranslated• E.g. “Nolan North is an American actor”

• E.g. “Robert Green”

• SMT Systems do not consider types

Names Dictionary

13

Page 14: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Approach 2: Entity Names Translation• Entity-Names SMT

Names Dictionary

14

Christian Schmidt ?

Christian Dior ديوركريستيان

Eric Schmidt إشميتإريك

Page 15: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Approach 2: Entity Names Translation• Type-Aware Entity Names SMT

• Wikipedia Cross-Languages links + QCMU-Arabic-NETs

• Persons, Non-persons and Full back

Names Dictionary

Arabic Entity NameArabic Entity NameArabic Entity Name

Non-PersonsTranslation

Model

English Entity

is Person?

ParallelPERSONnames

PersonsTranslation

Model

yesNo

Pick top-k

Arabic Entity Name15

ParallelNON-

PERSONnames

Page 16: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Approach 3: Persons Names Transliteration• Persons Names

• Unseen Names

• Capturing several Arabic possibilities

• Ex: Tony ( (توني /طوني

• Transliteration as Character-Level SMT• Training: En-AR Persons Names.

Names Dictionary

16

A n g e l a SPACE M e r k e l اليجنأ SPACE لكريم

A l b e r t SPACE E i n s t e i n تربلأ SPACE نياتشنيأ

Page 17: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Manually from the Arabic Wikipedia• In-link Pages Titles

• Anchor Texts

• Categories

• Citations

17

KeyphrasesDictionary

Page 18: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK Creation

• Arabic Keyphrases Generation

18

KeyphrasesDictionary

Named Entity Translation

Transliteration

En. In-link Titles

KeyphrasesDictionary

Named Entity TranslationEn. Categories

Page 19: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK in Numbers

• AIDArabic Resource (Yosef et al. 2014) vs EDRAK

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

Entities Count Unique Names Entity-NameEntries

UniqueKeyphrases

Entity-keyph.Entries

AIDArabiic EDRAK

19

Page 20: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK in Numbers

• Entities per Semantic Type

1,220,03252%

360,10816%

359,07115%

199,8469%

196,3058%

PERSON LOCATION ARTIFACT EVENT ORGANIZATION

20

Page 21: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK in Numbers

• Example

21

Page 22: EDRAK: Entity-centric Data Resource for Arabic Knowledge

EDRAK in Numbers

• Example

22

Page 23: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Evaluation

• Manual Assessment • 55 Native Arabic Speakers

• Distributed over many areas

• 150 Names for annonator

• Fairly distributed sample

23

Page 24: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Evaluation

• Manual Assessment • Precision @1

24

0102030405060708090

100

First/LastNames

labels PERSON labels NON-PERSON

RedirectsPERSON

Redirects NON-PERSON

Categories

Type-Aware Combined Transliterated Categories Translation

Page 25: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Evaluation

• Manual Assessment

• Results precision changes according to the source.• Highest: First/Last Names

• Lowest: Redirects (NON-PERSON)

• No real difference between Type-Aware and Combined SMT.

• Transliterated names confusion• Ex. Johannes, Friedrich

25

Page 26: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Conclusion

• EDRAK offers 2.4M Entities with Potential names, Contextual keyphrases and Semantic Types.

• EDRAK is not limited to the Arabic Wikipedia• External Resources

• Type-Aware Entity Names Translation

• Person Names Transliteration

26

Page 27: EDRAK: Entity-centric Data Resource for Arabic Knowledge

Download EDRAK

http://www.mpi-inf.mpg.de/yago-naga/aida/

27

Thank you!

Page 28: EDRAK: Entity-centric Data Resource for Arabic Knowledge

28