ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories...

26
ACE: Automated Content Enhancement An evaluation of natural language processing approaches to generating metadata from the full text of ABC stories © 2015 Australian Broadcasting Corporation ABC Research & Development Technical Report ABC-TRP-2015-T1

Transcript of ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories...

Page 1: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Page of 1 9

ACE: Automated Content Enhancement

An evaluation of natural language processing approaches to

generating metadata from the full text of ABC stories

© 2015 Australian Broadcasting Corporation

ABC Research & Development Technical Report ABC-TRP-2015-T1

Page 2: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

�ACE: Automated Content Enhancement

An evaluation of natural language processing approaches to generating metadata from the full-text of ABC stories.

Principal researcher Charlie Szasz Report prepared by Viveka Weiley & Charlie Szasz

It seems obvious, but we have to put the audience at the centre of what we do. If we are not delivering distinctive and quality content that finds its way to the people who pay for us, then we are not fulfilling our basic function.

ABC Strategy 2015 – The 2nd pillar – "audience at the centre"

Introduction: The Challenge of Discoverability

Audience at the centre is more than a goal – it’s a reality. Audiences already expect to find the stories they are most interested in delivered to them at the time, place, device, platform and format of their choice.

The ABC produces a huge volume of high quality stories on a broad range of topics. However with the proliferation of alternative media outlets, new platforms for consuming and producing media and increasingly diverse audience expectations, traditional means of bringing those stories to audiences are no longer sufficient.

Page of 2 9

Broadcast at the centre Home page at the centre

Audience at the centre

Page 3: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Solving for relevance

The task then is to help people find the most relevant stories, however they want them. This is more than just a technical challenge. To respond we must look into what people want, and into our own methods for organising and presenting stories to them.

ACE is focused on the latter question. In this project we are prototyping and demonstrating systems to radically improve discoverability of relevant ABC stories. Our first prototype is available now and has been demonstrated to stakeholders across the ABC.

We looked into the former question with the Spoke (Location, News and Relevance) pilots . The 1

Spoke engine aggregated stories from across the ABC as well as from third parties, tagged according to location and topic. We then built a mobile app to present those stories to pilot users according to their preferences. Within Spoke we implemented a machine learning system so that it could improve its responses over time, and embedded comprehensive metrics followed up by audience interviews to tell us how well the Spoke content reflected their preferences.

This analysis taught us a lot about what people care about; in particular their strong preference for highly relevant stories based on their locations and topics of interest, and the granularity of that expectation. For example, we discovered that few people are interested in sport as a category; instead they tend to be interested in certain sports and averse to others; they may want to read every article available on a particular team, but find any information on a sport that they don’t like to be intensely irritating. Similarly, people may be interested in science but not technology or vice versa, and very interested in local business but not at all in national or global business.

This also gave us insight into how well the ABC’s current and future metadata practices can help meet these emerging audience expectations.

Current metadata practices

The curated home page is no longer the sole entry point; as users turn to search, aggregators and emerging platforms, new discovery methods must be accounted for. This situation complicates the task of delivering the greatest value for audiences.

Every ABC story is tagged with metadata, primarily used by authors to control how content appears in current websites. To the extent that it accurately describes stories it can also be used as discovery metadata, to help search engines and aggregators find and present those stories.

There are two main gaps in our metadata. First, the existing fields and ontologies do not afford sufficiently detailed description for discovery and recommendation engines to deliver personalised results. Second, those fields that are available are often left empty by authors and editors.

For personalised content delivery and location and topic-based recommendations to deliver audience value, much more granular and comprehensive metadata will be required.

See ABC-WHP-2015-A1

Page of 3 9

Page 4: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Solving the metadata problem

Sufficiently granular and comprehensive metadata could be delivered entirely manually: by training authors and editors in more comprehensive metadata entry, and through policy direction. This would however be a labour-intensive solution, in a context where those people have many competing demands on their time and attention.

Another manual option would be the employment of metadata subeditors. This sufficed for the Spoke pilot, where a single part-time editor was able to add sufficient location metadata to create localised feeds for two regional centres. To scale up to the whole country and to add the task of increasing metadata granularity however would not be sustainable.

At the other end of the scale is a fully automated system, using artificially intelligent expert systems to extract meaning from the full text of the articles and apply metadata tags. As the success of algorithm-driven solutions such as Google’s PageRank demonstrate, machine learning systems and automated content analysis can be a powerful and scalable means of improving discoverability.

As computation becomes exponentially cheaper and more powerful, a wide variety of useful machine learning and expert systems are emerging. These systems first appear in the world as research projects; and then as each approach matures its products coalesce into proprietary solutions and finally into open source and commodity platforms.

Better metadata through Natural Language Processing

Natural Language Processing (NLP) techniques in particular have the potential to become an important and useful tool for sorting through large volumes of stories, increasing discoverability.

NLP is a technology based on artificial intelligence. It can take large volumes of text, and summarise and codify it using a deep understanding of the structure of language as well as databases that link words and phrases to their meanings. It allows metadata to be created automatically. In the past this has been too computationally expensive for our uses, but in recent years advances in NLP techniques and the raw speed of computing have changed that reality.

This presents an opportunity, as NLP is now at the point where open source and commodity platforms are becoming available.

In recent years a number of NLP engines have been released as SaaS (Software As A Service) offerings, presenting APIs (Application Programming Interfaces) which can be used to analyse large datasets and provide sample results. This makes it possible to use third-party NLP engines as a core component of a technology stack to provide customisable content analysis services for an organisation.

Applications include increasing discoverability for textual content like news stories, but also anything for which a transcript is available, such as radio interviews and iView content.

Page of 4 9

Page 5: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Evaluation of automatically extracted metadata

In order to explore the options for NLP engines, ABC R&D first conducted an overview of all offerings available in the open market. We then selected the top three candidates for testing by building a prototype to facilitate analysis.

The ABC ACE prototype

The result is the ABC ACE prototype, a custom-built research apparatus which connects the three NLP systems to a corpus of ABC content aggregated from across the organisation and provides a user interface for exploring the results generated by each NLP engine.

The prototype presents a query interface to afford exploration of the dataset as augmented by the NLP systems. For example, a user can find stories that are in the category of Business_Finance, which must contain the entity BHP Billiton with relevance at least 0.7 (where 1 is highest) and should contain the concepts Iron Ore and Mining with relevance at least 0.5.

This system is designed to expose as much data and functionality as possible in order to support comparative evaluation of the three systems, and also to open up the possibility space to reveal the breadth of possibilities for NLP. Accordingly it presents a comprehensive range of controls which may be daunting for the non-expert user.

Future prototypes may explore specific use cases for NLP, intended for specific user populations and with custom-built user interfaces. In the meantime the ACE prototype system is live and available to use on request for further exploration. To gain access or schedule a demo please contact us.

Charlie at the ACE console. Left: code. Centre: ACE prototype. Right: DBPedia entity detail page.

Page of 5 9

Page 6: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Research Procedure

Using the ACE prototype over 2600 ABC stories were analysed using the three chosen NLP services, AlchemyAPI, OpenCalais and TextRazor.

All stories were analysed for:

Named entities, fuzzy and disambiguated e.g. John Howard - Person (fuzzy)

John Howard - Australian politician, http://en.wikipedia.org/wiki/John_Howard(disambiguated)

Concepts e.g. Prime Minister - http://dbpedia.org/resource/Prime_ministerConcepts don’t necessarily match exact words or phrases in the text, they derived from meaningand linked to entries in various knowledge bases ( wikipedia, dbpedia ).

Categories or Topics (Taxonomy) e.g. law, govt and politics / governmentOnly AlchemyAPI provides hierarchical categories, TextRazor and OpenCalais only derive top level.The taxonomy extracted by all three services loosely match the ABC’s own.

Sentiment Only AlchemyAPI provides sentiment analysis. Due to limited API calls only named entities were analysed for sentiment. Overall sentiment of stories were not established.

Engine Comparison – Initial Conclusion

As a result of this analysis we determined that all engines detected and identified a similar number of named entities, concepts and topics (within an order of magnitude). Only Alchemy API could provide sentiment.

Out of all available NLP APIs the Alchemy API stands out as the most robust and promising choice at this point, largely due to its capability to connect with their News API, connecting to a corpus collected from over 75,000 news organisations. The capacity for sentiment analysis could also be a useful feature, particularly with regard to providing metadata for a recommendations system.

Further: as all NLP engines are driven by the same underlying concepts, it is possible to build an application architecture which is independent of which NLP engine provides the underlying results. The ACE prototype demonstrates this principle in practice, allowing users to switch between any of the three engines under review.

Page of 6 9

Page 7: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Confirmation study with prototype system users

We identified one potential confounding factor: false positives. Any data analysis procedure, including the natural language processing systems under review, can be considered on a range from more specific to more sensitive. A highly specific test will produce fewer results more accurately, whereas a highly sensitive test will produce more results, but with a higher chance of detecting false positives.

False positives will be missed by entirely automated systems and can pollute results. For example: a story may mention the ABC, and the engine may come up with a disambiguated link for it which is wrong – for example ABC Learning Centres when the article means the Australian Broadcasting Corporation, or vice versa. Another example: a story about a police sting operation may be misclassified as Arts and Entertainment, because the word “sting” has been misidentified as the musician. Sometime false positives are obvious – “Council of the European Union” for a story about a local council. Others are not: for example “A Private Function” will appear as a topic match, when the story mentions a private function at a club. Only on further investigation will you see that this match links to a DBPedia entry about the 1984 British comedy film starring Michael Palin and Maggie Smith.

To account for this factor we chose a subset of analysed stories (150) and recruited test users from the R&D team to check them for false positives. This was a time consuming process – appropriate for our experimental testing but not for real-world use in editorial workflows. We also determined that while the number of true false positives is low, they can have damaging editorial effects: for example flagging a story on a death as entertainment.

Quality Issues in Detected Metadata

Through these automated and user driven analyses we identified the following concerns:

Inconsistent named entity disambiguation e.g. AlchemyAPI disambiguated “ABC” over 20 different ways in the stories analysed while over 90%of the instances found are referring to the Australian Broadcasting Corporation.

Too many fuzzy named entities, not enough disambiguated ones. Creates noise. Ontologies for entity types are ambiguous and vague. e.g. “driver” was detected as - Position (OpenCalais type), meaning a person’s occupation. In the context of a traffic accident story this is misleading.

Errors with potential for editorial harm The machine learning systems can produce errors that a human editor would find embarrassing: for example AlchemyAPI categorising a murder story as “Arts and Entertainment”.

Page of 7 9

Page 8: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Metadata Quality Findings

Even accounting for confounding factors, all the engines in this study are within the same order of magnitude in results and accuracy. None of them can be reliably used without editorial oversight, especially for Australian content. Even highly customised, proprietary solutions trained on Australian media (for example Fairfax’s Fizzing Panda) could only achieve 84% accuracy when it comes to disambiguating entities.

Notwithstanding this issue, the right workflow could enable content creators and editors to select quality, NLP generated metadata with minimal effort, enhancing the power of authors and editors to make content richer and more discoverable.

Conclusion

Our initial aim was to explore the usefulness of NLP systems for augmenting metadata at the point of publication. We are now satisfied that those systems are now effective at the point of publication; and that their value can go beyond this point. NLP systems can provide tools useful at each stage of the story production process: research, writing and publishing.

We have also demonstrated that by including a human editor in the loop, results can be obtained that are more useful than either a purely manual or entirely automated approach would deliver. Our recommendation therefore is to ensure that any system using NLP results to augment news stories include a human in the loop. Future research can therefore most fruitfully focus on the human experience of using the engines to enrich content.

Our early user tests indicate that this person should select the strongest matches for retention rather than excluding mismatches specifically. We plan to go on to produce a proof of concept prototype to show how such a system could be designed.

The experimental process has already uncovered some qualitative results that could provide insight into the opportunities and challenges of deploying NLP engines to augment editorial workflows. For example, we discovered that the engines could be useful in the research phase, by revealing related stories when an author is still drafting, as well as in writing and publication; but that some terminology used in the field is obscure and if shown should be translated into more recognisable terms for non-expert users.

Accordingly, future phases will explore practical scenarios for integration of NLP techniques in the news gathering and editorial workflow through interactive prototypes focused on specific use cases.

Demos of the ACE prototype have garnered strong interest from Digital Networks Technology, the LRS project, ABC News, Splash, iView and the WCMS project. We anticipate that any division with a significant content corpus in need of better discovery and analysis tools could benefit from the considered application of NLP techniques.

Page of 8 9

Page 9: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Opportunities for future development

If we can design an effective a system that allows users to select the most useful metadata from automatically extracted list, then we can substantially improve the quality of metadata in the ABC’s systems while simultaneously providing a simpler and more effective editorial workflow.

This would demonstrate the usefulness of the system, but some opportunities for greater value would be missed. These improvements go beyond the scope of an initial proof of concept, but would be fruitful avenues for further research.

1. Identifying consistent errors

By only approving NLP metadata that is correct we would miss out on metadata that is consistently identified erroneously by the NLP service. If we had the ability to teach the system the correct response, then that match would become useful.

For example, the “ABC” is consistently misidentified as something other than the Australian Broadcasting Corporation. In our proposed proof of concept those mismatches would be ignored by the editor, and discarded by the system.

Further research could incorporate a “correction” process into the workflow which may include presenting multiple choices of disambiguation and/or creating our own definition.

2. Identifying missing entities

Some entities are missed by the NLP engines, or identified but not disambiguated as there is no matching database entry. The ability for ABC users to create new database entries for those missing entities would lead to richer results in future.

3. Building a learning system

By teaching the system to correct consistent errors and enriching its concept and topic database, we would over time be building a system that learned from the collective knowledge of ABC contributors, content authors and editors. This would result in a continual opening up of our content to easier discovery and distribution.

Page of 9 9

Page 10: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Appendix 1: Natural Language Processing Case Study

Introduction

As part of our Automatic Content Classification Engine (ACE) project we investigated three commercially available NLP services - AlchemyAPI, OpenCalais, TextRazor – to analyse a subset of ABC content.

These services were used to extract named entities, categorise content into topics, generate concepts/tags and in the case of AlchemyAPI, identify the associated sentiment of the named entities extracted.

In this document we explore in detail the results generated by AlchemyAPI for a single selected story published by the ABC. We chose this example as it is illustrative of the issues, matches and mismatches common to NLP analysis of Australian news stories.

Terminology

1. Named EntitiesNamed-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. - Wikipedia

2. Named Entity DisambiguationIn natural language processing, entity linking, named entity disambiguation (NED), named entity recognition and disambiguation (NERD) or named entity normalization (NEN)[1] is the task of determining the identity of entities mentioned in text. It is distinct from named entity recognition (NER) in that it identifies not the occurrence of names (and a limited classification of those), but their reference. - Wikipedia

3. Sentiment AnalysisSentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader). - Wikipedia!

Page of 1 7

Page 11: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Case study example: Bali Nine.!

Bali Nine1 families and diplomats en route to Cilacap2 amid emotional pleas

Myuran Sukumaran's3 sister has issued an emotional plea for his life to be spared, appearing in a

YouTube4 video clutching a photograph of her brother as a young boy wearing a school uniform.

"My brother made a mistake 10 years ago and he's paid for this mistake every single day since then,"

Brintha Sukumaran5 said.

"My brother is now a good man and after 10 years in prison, he has taught so many Indonesian prisoners about art and about how to live outside in the world and have a good and productive life

"From the bottom of my heart, please President Widodo6 have mercy on my brother ... change

punishment for humanity."

Sukumaran7 and his co-charged Andrew Chan8 were sentenced to death in Indonesia9 in 2006, as

ringleaders of the Bali Nine1 drug smuggling gang.

Some of their family members are on the way to Cilacap2.

Consular officials from the countries whose citizens face execution have also started arriving in

Cilacap2, which is close to the high-security prison island of Nusakambangan10 where all of the

death row convicts are now housed.

Australian and Indonesian officials have met and it is understood they discussed final requests from the condemned men and their funeral arrangements.

Foreign Minister11 Julie Bishop12 said Australian officials had been told the execution of the Bali

Nine1pair was imminent.

Fuzzy Named Entities

Named entities correctly disambiguated

Named entities incorrectly disambiguated

Named entities missed or ignored

Page of 2 7

Page 12: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

"Indonesian authorities today advised Australian consular officials that the executions of Andrew

Chan8 and Myuran Sukumaran3 will be scheduled imminently at Nusakambangan10 prison in

central Java13"¨," she said in a statement on Saturday.

However, she said the Australian Government14 would still seek clemency from Indonesian

president Joko Widodo15.

Jakarta16 has said an exact date for the executions could not be decided yet, as a judicial review was

still pending for the sole Indonesian in the group of 10 people who face death by firing squad.

Indonesia9's Supreme Court17 said the ruling on that case could be made as early as Monday,

paving the way for the executions to proceed.

Filipina on death row given execution notice: lawyer

A Filipina on death row in Indonesia9 has been informed that she will be executed on Tuesday, her

lawyer said.

"We were informed by Mary Jane18 herself that she received the notice that the sentence will be

implemented on April 28," Veloso's19 lawyer Minnie Lopez20 told news agency AFP21.

Veloso's19 father and mother, her two sons aged six and 12, and sister pushed through a scrum of

waiting journalists.

"If anything bad happens to my daughter, I will hold many people accountable. They owe us my

daughter's life," Veloso's19 55-year-old mother, Celia22, told a Philippine radio station.

"I hope my appeal reaches President Widodo6."

Lawyers for Veloso19 have also filed another court bid to halt her execution.

Authorities said on Thursday they had ordered prosecutors to start making preparations for the executions.

However convicts must be given 72 hours' notice before executions are carried out, and this notice is yet to be given.

Lawyers for the Australians say the legal process is not complete, with both a constitutional court

challenge and judicial commission still in progress, however Indonesia9 says all judicial reviews

and appeals for clemency have been exhausted, and that the legal manoeuvres amount to delaying tactics.

Page of 3 7

Page 13: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

The 10 inmates facing execution, including Chan23, Sukumaran7, Veloso19, one each from Brazil24

and France25 and four from Africa26, have all lost appeals for clemency from Mr Widodo27, who

has argued that Indonesia9 is fighting a drugs emergency.

Mr Widodo27 has turned a deaf ear to increasingly desperate appeals on the convicts' behalf from

their governments, from social media and from others such as band Napalm Death28 " the

president is a huge heavy metal fan.

Julian McMahon29 (centre), the lawyer for the Bali Nine1 pair on death row, leaves the Cilacap2

district prosecutor's30 office. (AAP31: Darma Semito32)

Highlights

• Mary-Jane Veloso, one of the subjects of this story, is never identified by her full name, leading to anumber of misidentifications.• The entity database is light on entries regarding Indonesian political figures, sometimesmisidentifying them as entertainers with similar names.• The entity database is better on Australian political figures, but still incomplete.• The entity database (while very knowledgeable about Australian landmarks) does not recognisemany significant Indonesian landmarks.• The entity database is unaware of topical phrases such as “Bali Nine”, which it detected andthrough text analysis defined as an unknown “organisation”.

Page of 4 7

Page 14: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

1Bali!Nine!0.55$nega)ve

Iden%fied'as'“Organiza%on”,'somewhat'incorrect.'Did'not'disambiguate'as:'h=p://dbpedia.org/page/Bali_Nine

2Cilacap!0.55$nega)ve

Iden%fied'as'“City”.'Did'not'disambiguate'as:'h=p://dbpedia.org/page/Cilacap_Regency

3Myuran!Sukumaran

Did'not'recognise'as'en%ty.'h=p://dbpedia.org/page/Myuran_Sukumaran

4YouTube!0.37$neutral

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/YouTube

5

Brintha!Sukumaran!0.75$nega)ve

Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Sukumaran'No'entry'exists'in'dbpedia.

6President!Widodo!0.81$nega)ve

Iden%fied'as'“Person”,'somewhat'incorrect'(should'be'just'Widodo)'Did'not'disambiguate'as:'h=p://dbpedia.org/resource/Joko_Widodo'See:'15

7 Sukumaran See:'3

8Andrew!Chan!0.54$nega)ve

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Andrew_Chan

9Indonesia!0.72$nega)ve

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Indonesia

10Nusakambangan!0.31$nega)ve

Iden%fied'as'“City”.'Did'not'disambiguate'as:'h=p://dbpedia.org/resource/Nusa_Kambangan

11Foreign!Minister!0.30$neutral

Iden%fied'as'“FieldTerminology”,'quite'ambigous

12Julie!Bishop!0.29$nega)ve

Iden%fied'as'“Person”.'Did'not'disambiguate'as:'h=p://dbpedia.org/page/Julie_Bishop

13 JavaDid'not'recognise'en%ty.'h=p://dbpedia.org/resource/Java

14

Australian!Government!0.34$neutral

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Government_of_Australia

15Joko!Widodo!0.49$nega)ve

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Joko_Widodo'See:'3

Page of 5 7

Page 15: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

16Jakarta!0.34$nega)ve

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Jakarta

17Supreme!Court!0.32$neutral

Iden%fied'as'“Organiza%on”

18Mary!Jane!0.33$neutral

Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Mary_Jane_CroZ

19Veloso!0.83$nega)ve

Iden%fied'as'“Person”'No'entry'exists'in'dbpedia.

20Minnie!Lopez!0.26$neutral

Iden%fied'as'“Person”'No'entry'exists'in'dbpedia.

21AFP!0.30$neutral

Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Philippines'The'correct'disambigua%on'is:'h=p://dbpedia.org/page/Agence_France[Presse

22Celia!0.28$posi)ve

Iden%fied'as'“Person”

23 Chan Did'not'recognise'as'en%ty.'Did'not'iden%fy'it'to'be'the'same'as'8

24Brazil!0.26$neutral

Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Brazilian_military_government'Correct'disambigua%on:'h=p://dbpedia.org/page/Brazil

25France!0.22$neutral

Iden%fied'as'“County”.'Did'not'disambiguate'as:'h=p://dbpedia.org/page/France

26Africa!0.28$nega)ve

Correctly'disambiguated'as:'h=p://dbpedia.org/resource/Africa

27 Mr!Widodo Did'not'recognise'as'en%ty.'Did'not'iden%fy'it'to'be'the'same'as'15

28 Napalm!DeathDid'not'recognise'as'en%ty.'No'entry'exists'in'dbpedia'for'heavy'metal'band'Napalm'Death

29Julian!McMahon!0.33$nega)ve

Incorrectly'disambiguated'as:'h=p://dbpedia.org/resource/Julian_McMahon

30Prosecutor!0.28$nega)ve

Iden%fed'as'“JobTitle”

31 AAPDid'not'recognise'en%ty.'Did'not'disambiguate'as:'h=p://dbpedia.org/page/Australian_Associated_Press

32 Darma!Semito Did'not'recognise'en%ty.

Page of 6 7

Page 16: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Named entities in order of detected relevance:

1. Veloso -0.83

2. President Widodo – 0.81

3. Brintha Sukumaran – 0.75

4. Indonesia – 0.71

5. Bali Nine – 0.55

6. Cilacap – 0.55

7. Andrew Chan – 0.54

8. Joko Widodo – 0.49

9. Youtube – 0.37

10. Australian Government – 0.34

11. Jakarta – 0.34

12. Mary Jane – 0.33

13. Julian McMahon – 0.33

14. Supreme Court - 0.32

15. Nusakambangan – 0.31

16. Foreign Minister – 0.30

17. AFP – 0.30

18. Julie Bishop – 0.29

19. Celia – 0.28

20. Africa – 0.28

21. Prosecutor – 0.28

22. Brazil – 0.26

23. Minnie Lopez – 0.26

24. France – 0.22

Page of 7 7

Page 17: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Appendix 2: System documentation

ACE report 2 Appendix 2 Page of 1 8

� ACE ABC Corpus Explorer

Through the ACE prototype interface you can filter stories by the ABC’s explicit metadata tags at left, which returns a list of stories aggregated from across the ABC using a parsing system originally developed for the Spoke project. You can then go on to investigate automatically generated output from our selected

Page 18: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 2 8

Entity detail on rollover

Identified entity (not disambiguated)

Disambiguated entity

Generated topics

Generated concepts

Entity detail

NLP engine chooser

NLP query form

� ACE NLP Result and query UI

Story marked up with links to detected metadata

� ACE detail showing detected entity

Page 19: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 3 8

Human check for false positives

Display documents containing exact entity

Use this entity in a more detailed query

Links to knowledge base entries on this entity

� ACE Entity: further detail on click

Display documents containing exact concept

Use this concept in a more detailed query

Links to knowledge base entries on this concept

� ACE Concept: further detail on click

Page 20: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 4 8

� ACE detailed query form

Auto-detected topics

Minimum relevance thresholdCheckbox = MUST CONTAIN Unchecked = SHOULD CONTAIN

delete from query parameters

Page 21: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 5 8

Continued on next page

� ACE Topic detection: ABC editorially chosen categories (headings) showing Alchemy API auto-detected topics for those stories below, in order of frequency.

Note that the engine will choose multiple topics and weight them according to confidence. The chart below shows only the single most relevant topic, according to the engine.

Using the ACE prototype you can follow each of these topic links to display a list of the connected articles.

Page 22: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 6 8

Page 23: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 7 8

� ACE Concept detection

From the query UI the ACE prototype can display a complete list of detected concepts, and allow you to see which stories they are linked from.

Page 24: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

ACE report 2 Appendix 2 Page of 8 8

� ACE NLP Engine Comparison of results from a corpus of 2278 documents

Page 25: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

Appendix 3: Cross-divisional feedback

All R&D projects, proposals and demonstrations are designed to align with ABC strategic priorities. As these priorities evolve we check back in with stakeholders regularly to ensure that our projects are correctly pitched. They need to be forward-thinking enough not to duplicate the work of other groups, while also being tactically driven to facilitate implementation by those groups.

R&D projects span a time horizon ranging from 3 to 7 years. ACE is on the near-future end of that range. It investigates the use of an emerging technology which has left the lab and is becoming widely available, but which is not yet on existing product roadmaps.

Feedback to date indicates that the problem that ACE seeks to solve – metadata quality – is increasingly important, and is not otherwise being addressed. Some technical groups and product teams can see near immediate application of the results generated by the current ACE prototype, while others would like to see further work on data quality, tuning and localisation. In response, we have made those issues the focus of the forthcoming ACE Reporter prototype.

Over the coming weeks we will reach out beyond the technical implementation and product teams that have been shown the initial ACE demo. That more wide-ranging cross-divisional feedback will be included in future reports.

Digital Network / WCMS

Very impressed with the work and it is definitely usable in the short term, especially as we migrate content form legacy CMSs to WCMS, great work,

Ciaran Forde, Head of Digital Architecture and Development, ABC Digital Network

It is exactly Charlie’s focus on business outcomes that makes working with him on his projects so appealing. We will create tools that have real business outcomes for the NLP work he has been doing.

Neil Wilkinson, Manger, Content Services, ABC Digital Network

iView:

Really interesting work and could be useful in iview for examining program metadata including:

* series Title* episode title* cast and director list (if we had these)* description* closed captions - this last one might be the most valuable

Page of 1 2

Page 26: ACE: Automated Content Enhancementrd.abc.net.au/assets/downloads/ABC_R+D Ace Report.pdf · stories they are most interested in delivered to them at the time, place, device, platform

iView (continued):

A natural language parser could be used not just for adding metadata for searching but also for finding related shows.

One concern is how badly each of the three systems behaved in some circumstances. This suggests significant tuning and localisation needs to be done.

I wonder if a much simpler approach might produce results that perhaps not as good as the best of what you’ve got but also not as bad as the weird cases we saw.

I think we should focus on a particular requirement, say personalised recommendations, and mock up several solutions to compare.

Great work.

Peter Marks, iView mobile development lead

Localisation and Recommendations System (LRS)

The Recommendations Engine uses a number of different techniques to deliver refined recommendations based on data sources including content metadata, audience behaviour and content analysis.

What ACE could provide to Recommendations is a valuable data source, by either directly attributing topics, keywords and sentiment to content or by improving content metadata during the publish process.

The content metadata then forms an integral data source for Recommendations both for generic recommendations as well as personalised recommendations based on user behaviour when combined with content metadata.

From a timing point of view, Recommendations is currently undergoing a series of prototypes to further elaborate business and architectural requirements.

Although the Recommendations roadmap has not been devised as yet, ACE could be immediately useful in driving discussions with Product Managers about what kind of recommendations are possible, particularly for text-based content such as News.

I’ve added ACE/NLP into the following table, which summarises the smorgasbord of techniques that the Recommendations Engine could potentially provide, so that when the initial Recommendations Engine roadmap is defined, NLP can be considered in setting priorities.

Page of 2 2