Local Memory Project

38
Providing tools to build collections of stories for local events from local sources 1

Transcript of Local Memory Project

Providing tools to build collections of stories for local events from local sources

1

Th�is work was made possible in part by IMLS LG-71-15-0077-15 and support from the Harvard Law School Library. We are grateful for the support.

Local Memory Project (LMP)http://www.localmemory.org/, https://twitter.com/localmem

Alexander C. Nwala, Michele C. Weigle, and Michael L. Nelson@webscidl

Old Dominion University

Adam B. Ziegler and Anastasia Aizman@harvardlil

Harvard Library Innovation Lab

Presented by: Alexander C. Nwala (@acnwala)Computer Science Ph.D student

Media Cloud Intern, Berkman Klein Center for Internet & Society, Harvard University

JCDL 2017, June 21, 2017

2

LMP: Outline1. Introduction2. LMP local stories collection building

a. Geo: Nearby news media discoveryb. Chrome Extension: Collection buildingc. Collection archivingd. Community collection building

3. Evaluationa. Datasetb. Metrics/Results

4. Conclusions

3

Local Michigan media first reported on the Flint water changeover in 2014http://www.mlive.com/opinion/flint/index.ssf/2014/04/editorial_switch_to_flint_rive.html

● April 2014: Officials in Flint, Michigan switched the city’s water source from Lake Huron (Detroit water system) to the Flint River

● This news was reported by local media such as Michigan Radio, the Flint Journal-MLive, and local TV affiliates in Flint (WEYI, WJRT, WSMH, and WNEM)1

1 Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered �The Flint Water Crisis. h�ttps://mediama�tters.org/research/2016/02/02/analysis-how-michigan-and-national-reporters-co/208290. (2016).

4

http://www.mlive.com/news/flint/index.ssf/2014/05/state_says_flint_river_water_m.html

● May 23, 2014: City residents complained about the water’s taste and smell

● This news was reported Ron Fonger of Flint Journal-MLive reported (local media)2

2 Ron Fonger. 2014. State says Flint River water meets all standards but more than twice the hardness of lake water. h�ttp://www.mlive.com/news/�int/index.ssf/2014/05/state_says_fl�int_river_water_m.html. (2014).

City residents complained about the water’s taste and smell

5

Between August and September 2014: the city issued three boil advisories to residents of Flint after finding fecal coliform bacteria (E. coli) in the water1

http://www.mlive.com/news/flint/index.ssf/2014/09/flint_says_drinking_water_advi.html http://www.mlive.com/news/flint/index.ssf/2014/09/flint_lifts_boil_water_advisor.html http://www.mlive.com/news/flint/index.ssf/2014/09/flint_flushes_out_latest_water.ht

ml

Flint issues three boil advisories after finding E. coli in the water

61 Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered �The Flint Water Crisis. h�ttps://mediamatt�ers.org/research/2016/02/ 02/analysis-how-michigan-and-national-reporters-co/208290. (2016).

January 5, 2016: Governor Rick Snyder declared a state of emergency for the city of Flint, due to dangerously high levels of lead contamination in the drinking water

https://www.democracynow.org/2016/1/8/poisoned_democracy_how_an_unelected_official

January 2016, Governor Rick Snyder declared a state of emergency for Flint

7

● A chain of events about the Flint water crisis was reported by local media, but most of the non-local media did not report this crucial story until 2016.1

● Local media is fundamental to journalism, but is in decline.3LMP attempts to shed some light on local media

1 Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered �The Flint Water Crisis. h�ttps://mediamatt�ers.org/research/2016/02/02/analysis-how-michigan-and-national-reporters-co/208290. (2016).3 Rasmus Kleis Nielsen. 2015. Local journalism: the decline of newspapers and the rise of digital media. IB Tauris.

Non-local media did not report this crucial story until 2016https://cloudfront.mediamatters.org/static/uploader/image/2016/02/03/flinttimeline1.png

8

Local and non-Local media have different priorities

Non-local news organizations such as CNN cover stories of a broader (national/international) scope such as Obamacare and the Syrian refugee migrant crisis

Local media such as the Caloosa Belle Newspaper (LaBelle, FL) cover stories that would not naturally be of interest to another locality, such as the annual Swamp Cabbage Festival

http://caloosabelle.com/?s=swamp+cabbage http://www.cnn.com/specials/world/migration-crisis

9

LMP: IntroductionLMP provides a suite of tools (beginning with two) to help users and

small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources

10

LMP: Outline1. Introduction2. LMP local stories collection building

a. Geo: Nearby news media discoveryb. Chrome Extension: Collection buildingc. Collection archivingd. Community collection building

3. Evaluationa. Datasetb. Metrics/Results

4. Conclusions

11

Geo: Nearby news media discovery

● Given a zip code, Geo, returns a list of newspapers, TV, and radio stations in order of proximity to location associated with the zip code.

● For example, given the zip code: “23529” (Norfolk Virginia, USA), here is a list of 10 news media for Norfolk:

12

Geo: Nearby news media discovery

● US local news repository:○ 5,992 Newspapers○ 1,061 TV stations, and○ 2,539 Radio stations

■ Scraped from http://www.usnpl.com/

● Non-US local news repository:○ 6,638 Newspapers○ 183 Countries○ 3,151 Cities

■ Scraped from https://www.thepaperboy.com/

14

SearchEngine(q = “protesters and police site:whro.org”)...SearchEngine(q = “protesters and police site:pilotonline.com”)SearchEngine(q = “protesters and police site:wtkr.com”)

Chrome Extension: Collection building

15

Non-LocalLocal vs

Local news sources from Virginia, such: Virginia Pilot, WHRO-TV, and WTKR-TV

Non-Local sources (e.g., CNN and NBC News), and Local sources (e.g., ABC7 Chicago and

Chicago Tribune), and a Youtube source

A non-Local collection mixes Local and non-Local sources

Chrome Extension: Collection building

17

To mitigate the problems of content drift and link rot, as well as preserve collections for future users and researchers, the LMP extension

implements collection archiving

18

...

archive.is https://archive.is/0hQQG

...

public archive0

public archive1

public archiven-1

PRESENT IMPLEMENTATION

IDEAL IMPLEMENTATION

archive-uri0

archive-uri1

archive-urin-1

Chrome Extension: Collection archiving

20

Community collection building● We believe there is value when multiple users contribute to the same collection

● This is similar in spirit to the Internet Archive’s request to the public to contribute URIs for the 2016 Orlando Nightclub Shooting Web Archive:

21

The LMP Extension enables users to share collections on Twitter

● We believe there is value when multiple users contribute to the same collection

● The LMP Extension enables users to share collections on Twitter. Shared collections may be tagged with a hashtag

● The hashtag provides a means for thematically-related collections to be organized

22

The hashtag provides a means for thematically-related collections to be organized

23

LMP: Outline1. Introduction2. LMP local stories collection building

a. Geo: Nearby news media discoveryb. Chrome Extension: Collection buildingc. Collection archivingd. Community collection building

3. Evaluationa. Datasetb. Metrics/Results

4. Conclusions

24

Evaluation● We claim that Local collections have less exposure compared to

non-Local collections

● Through collection building, archiving, and sharing, LMP could facilitate the increase of exposure of Local news sources

● To assess the validity of our claim, we measured the degree of exposure Local collections have compared to non-Local collections

25

Evaluation: Dataset● Our evaluation dataset comprised of 20 pairs (Local and non-Local) of collections

corresponding to 20 different stories

● Each collection (Local and non-Local) was further split into two classes: ○ G - extracted from

the default Google SERP, and

○ NV - extracted from the Google News vertical SERP

G NV

26

Evaluation: Dataset● Our evaluation dataset comprised of 20 pairs (Local and non-Local)

of collections corresponding to 20 different stories

● Each collection (Local and non-Local) was further split into two classes: ○ G - extracted from

the default Google SERP, and

○ NV - extracted from the Google News vertical SERP

27

Evaluation: Dataset (cont’d)● Our evaluation dataset comprised of 20 pairs (Local and non-Local)

of collections corresponding to 20 different stories

● Each collection (Local and non-Local) was further split into two classes: ○ G - extracted from

the default Google SERP, and

○ NV - extracted from the Google News vertical SERP

28

Evaluation: Metrics● For each collection we measured:

○ Archival coverage and tweet index rate to approximate the exposure of the Local and non-Local collections

● We also measured:○ Temporal range, ○ Precision, and ○ Sub-collection overlap for experimentation

29

Archival coverage: Non-Local collections produced higher archive rates than Local collections (claim confirmed)

● Definition: The archival coverage is the fraction of a collection that is archived

● Claim: We claim that non-Local collections possess higher archive rates than Local collections

● Extraction: The binary archived state of a story in a collection was extracted by utilizing the MemGator utility (http://memgator.cs.odu.edu/)

● Result: ○ Non-Local collections G and NV produced

archive rates of 0.83 and 0.80, respectively

○ Local collections G and NV produced archive rates of 0.52 and 0.63, respectively

30

Tweet index rates: Non-Local collections produced higher tweet index rates than Local collections (claim confirmed)

● Definition: The tweet index rate is the fraction of a collection which could also be found embedded in a tweet

● Claim: We claim that non-Local collections possess higher tweet index rates than Local collections

● Extraction: The binary tweet index state of a story in the collection was extracted by searching Twitter

● Result: ○ Non-Local collections G and NV produced

tweet index rates of 0.71 and 0.80, respectively

○ Local collections G and NV produced tweet index rates of 0.44 and 0.59, respectively

31

Temporal range: Non-Local-NV collections possessed the highest probability of producing the newest document with a probability of 0.75

(claim confirmed)● Definition: the temporal range of a collection is the

distribution of the creation datestamps of the stories in the collection

● Claim: We claim that non-Local collections are temporally biased to produce newer stories than Local collections

● Extraction: Most news stories have creation datestamps. We extracted these datestamps from the SERPs

● Result: ○ Local-G collections produce the oldest

documents with a probability of 0.7

○ The consequences of these probabilities are crucial: One must sample Local-G collections in order to maximize the chances of finding the first reports about a story or event 32

Precision: Type-G collections produce documents at a higher precision than NV (claim partially confirmed)

● Definition: The precision of a collection is the fraction of stories in the collection that are relevant to the collection query based on the judgement of a human evaluator. We considered a story relevant or non-relevant only if the relevance score was by a margin of 2 votes or more

● Claim: We claim that non-Local collections possess a higher precision than Local collections

● Extraction: 14 evaluators evaluated our dataset. For each story in a collection, an evaluator scored the story as relevant if the story was on topic with respect to the collection query, and non-relevant otherwise

● Result ○ Local-G precision: 0.84, non-Local-G: 0.72,

Local-NV: 0.71, and non-Local-NV: 0.68

Relevance Margin of 2 Vote or more

33

Precision: Type-G collections produce documents at a higher precision than NV (claim partially confirmed)

● Result ○ non-Local-G precision: 0.84, Local-G: 0.79,

non-Local-NV: 0.71, and Local-NV: 0.70

Relevance Margin of 1 Vote or more

34

Sub-collection overlap: Local collections showed a higher overlap rate than non-Local collection (claim confirmed)

● Definition: ○ Given a collection evaluation dataset, let

sub-collection sets LG and LNV define sets populated from Local-G and Local-NV, respectively

○ Similarly, let sub-collection sets NLG and NLNV define sets populated from non-Local-G and non-Local-NV, respectively

○ The overlap of 2 sets X, Y, overlap(X, Y) =

● Claim: We claim Local sub-collections LG and LNV have more in common (more overlap) than non-Local sub-collections NLG and NLNV

● Result: Local collections showed a higher overlap rate than non-Local collection

35

e1: Local collections overlape2: Non-Local collections overlape3: e1 and e2 overlap

LMP: Outline1. Introduction2. LMP local stories collection building

a. Geo: Nearby news media discoveryb. Chrome Extension: Collection buildingc. Collection archivingd. Community collection building

3. Evaluationa. Datasetb. Metrics/Results

4. Conclusions

36

Conclusions● We cannot rely exclusively on non-Local sources to build our

collections

● Local news sources are fundamental to journalism, but less exposed

● LMP’s tools could help expose local news source○ Geo (http://www.localmemory.org/geo/)○ Chrome Extension - Local stories collection generator

(http://www.localmemory.org/)

● Our tools, local news repository, and evaluation results are publicly available (https://github.com/harvard-lil/local-memory)

37

Follow: @localmemDownload Chrome Extension: http://www.localmemory.org/

Thank you!

@acnwala @webscidl @harvardlil

38