Computational Journalism Hong Kong
-
Upload
ettore-rizza -
Category
Documents
-
view
233 -
download
0
Transcript of Computational Journalism Hong Kong
-
8/9/2019 Computational Journalism Hong Kong
1/23
FEATURED
INTRODUCTION: COMPUTER SCIENCE
AND JOURNALISMFEBRUARY 14, 2013LEAVE A COMMENT
Maybe its not obvious that computer science andjournalism go together, but they do!
Computational journalism combines classic journalistic values of storytelling and public
accountability with techniques from computer science, statistics, the social sciences, and the
digital humanities.
This course, given at the University of Hong Kong during January-February 2013, is an
advanced look at how techniques from visualization, natural language processing, social
network analysis, statistics, and cryptography apply to four different areas of journalism:
finding stories through data mining, communicating what youve learned, filtering anoverwhelming volume of information, and tracking the spread of information and effects.
The course assumes knowledge of computer science, including standard algorithms and linear
algebra. Several of the assignments require students to write Python code at an intermediate
level. But this introductory video, which explains the topics covered, is for everyone.
Slideshere.For more, see thesyllabus,or jump directly to a lecture:
1. Basics.Feature vectors, clustering, projections.
2. Text analysis.Tokenization, TF-IDF, topic modeling.
3.
Algorithmic filters.Information overload. Newsblaster and Google News.4. Hybrid filters.Social networks as filters. Collaborative Filtering.
5.
Social network analysis.Using it in journalism. Centrality algorithms.
6. Knowledge representation.Structured data. Linked open data. General Q&A.
7. Drawing conclusions.Randomness. Competing hypotheses. Causation.
8. Security, surveillance, and privacy.Cryptography. Threat modeling.
LECTURES
LECTURE 8: SECURITY,SURVEILLANCE, AND PRIVACYFEBRUARY 13, 2013LEAVE A COMMENT
Who is watching our online activities? How do you protect a source in the 21st Century? Who
gets to access to all of this mass intelligence, and what does the ability to survey everything
all the time mean both practically and ethically for journalism? In this lecture we will talk
about who is watching and how, and how to create a security plan using threat modeling.
Topics:How is email transmitted? Who has access to your emails. Mass surveillance and its
legal status. How cryptography works. Encryption versus authentication. Man-in-the-middle
attacks. Secure communications using OTR. Case study: the leaked Wikileaks cables. Threat
modeling. Security planning.
http://courses.jmsc.hku.hk/jmsc6041spring2013/category/featured/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/featured/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/#respondhttp://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/featured/ -
8/9/2019 Computational Journalism Hong Kong
2/23
Slides
Readings Chris Soghoian,Why secrets arentsafe with journalists, New York times 2011
Hearst New Media Lecture 2012,Rebecca MacKinnon
Recommended
CPJ journalist security guide section 3,Information Security
Global Internet Filtering Map,Open Net Initiative
The NSA is building the countrysbiggest spy center,James Banford, Wired
Cryptographic security
Unplugged: The Show part 9: Public Key Cryptography
Diffe-Hellman key exchange,ArtOfTheProblem
Anonymity
Tor Project Overview Who is harmed by a real-names policy,Geek Feminism
Assignment:Threat modeling and security planning.Use threat modeling to come up with a
security plan for a given scenario.LECTURESECURITYASSIGNMENTS
ASSIGNMENT 6: THREAT MODELING
AND SECURITY PLANNINGFEBRUARY 8, 2013
For this assignment, each of you will pick one of the four reporting scenarios below and
design a security plan. More specifically, you will flesh out the scenario, create a threat
model, come up with a plausible security plan, and analyze the weaknesses of your plan.
Start by creating a threat model, which must consider:
What must be kept private? Specify all of the information that must be secret, including
notes, documents, files, locations, and identitiesand possibly even the fact that
someone is working on a story.
Who is the adversary and what do they want to know? It may be a single person, or an
entire organization or state, or multiple entities. They may be very interested in certain
types of information, e.g. identities, and uninterested in others. List each adversary andtheir interests.
What can they do to find out? List every way they could try to find out what you want
secret, including technical, legal, and social methods.
What is the risk? Explain what happens if an adversary succeeds in breaking your security.
What are the consequences, and to whom? Which of these is it absolutely necessary to
avoid?
Once you have specified your your threat model, you are ready to design your security plan.
The threat model describes the risk, and the goal of the security plan is to reduce that risk as
much as possible.
http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-8.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-8.pdfhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.cjr.org/behind_the_news/collateral_damage_news_organiz.php?page=allhttp://www.cjr.org/behind_the_news/collateral_damage_news_organiz.php?page=allhttp://cpj.org/reports/2012/04/information-security.phphttp://cpj.org/reports/2012/04/information-security.phphttp://cpj.org/reports/2012/04/information-security.phphttp://map.opennet.net/http://map.opennet.net/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.youtube.com/watch?v=jJrICB_HvuIhttp://www.youtube.com/watch?v=jJrICB_HvuIhttp://www.youtube.com/watch?feature=player_embedded&v=3QnD2c4Xovkhttp://www.youtube.com/watch?feature=player_embedded&v=3QnD2c4Xovkhttps://www.torproject.org/about/overviewhttps://www.torproject.org/about/overviewhttp://geekfeminism.wikia.com/wiki/Who_is_harmed_by_a_%22Real_Names%22_policy%3Fhttp://geekfeminism.wikia.com/wiki/Who_is_harmed_by_a_%22Real_Names%22_policy%3Fhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://geekfeminism.wikia.com/wiki/Who_is_harmed_by_a_%22Real_Names%22_policy%3Fhttps://www.torproject.org/about/overviewhttp://www.youtube.com/watch?feature=player_embedded&v=3QnD2c4Xovkhttp://www.youtube.com/watch?v=jJrICB_HvuIhttp://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://map.opennet.net/http://cpj.org/reports/2012/04/information-security.phphttp://www.cjr.org/behind_the_news/collateral_damage_news_organiz.php?page=allhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-8.pdf -
8/9/2019 Computational Journalism Hong Kong
3/23
Your plan must specify appropriate software tools,plushow these tools must be used. Pay
particular attention to necessary habits: specify who must do what, and in what way, to keep
the system secure. Explain how you will educate your sources and collaborators in the proper
use of your chosen tools, and how hard you think it will be to make sure everyone does
exactly the right thing.
Also document the weaknesses of your plan. What can still go wrong? What are the criticalassumptions that will cause failure if it turns out you have guessed wrong? What is going to
be difficult or expensive about this plan?
The scenarios you can choose from are:
1. You are a photojournalist in Syria with digital images you wants to get out of the country.
Limited internet access is available at a cafe. Some of the images may identify people
working with the rebels who could be targeted by the government if their identity is revealed.
In addition you would like to remain anonymous until the photographs are published, so that
you can continue to work inside the country for a little longer, and leave without difficulty.
2. You are working on an investigative story about the CIA conducting operations in the U.S.,
in possible violation the law. You have sources inside the CIA who would like to remain
anonymous. You will occasionally meet with these sources in but mostly communicate
electronically. You would like to keep the story secret until it is published, to avoid pre-
emptive legal challenges to publication.
3. You are reporting on insider trading at a large bank, and talking secretly to two
whistleblowers. If these sources are identified before the story comes out, at the very least you
will lose your sources, but there might also be more serious repercussions they could losetheir jobs, or the bank could attempt to sue. This story involves a large volume of proprietary
data and documents which must be analyzed.
4. You are working in Europe, assisting a Chinese human rights activist. The activist is
working inside China with other activists, but so far the Chinese government does not know
they are an activist and they would like to keep it this way. You have met the activist once
before, in person, and have a phone number for them, but need to set up a secure
communications channel.
These scenario descriptions are incomplete. Please feel free to expand them, making any
reasonable assumptions about the environment or the storythough you must document
your assumptions, and you cant assume that you have unrealistic resources or that your
adversary is incompetent.
ASSIGNMENTSECURITYLECTURES
LECTURE 7: DRAWING CONCLUSIONS
FROM DATA
http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/ -
8/9/2019 Computational Journalism Hong Kong
4/23
FEBRUARY 5, 2013LEAVE A COMMENT
Youve loaded up all the data. Youve run the algorithms. Youve completed your analysis.
But how do you know that you are right? Its incredibly easy to fool yourself, but fortunately,
there is a long history of fields grappling with the problem of determining truth in the face of
uncertainty, from statistics to intelligence analysis.
Topics:What does randomness look like? Variation from rolling dice. Base rate
fallacy. Conditional probability. Bayes theorem. Cognitive biases. Method of competing
hypotheses. Probabilistic scoring of hypotheses. Correlation and causation. Finding alternate
hypotheses for the NYPD stop and frisk data.
Slides
Readings
Correlation and causation,Business Insider
The Psychology of Intelligence Analysis,chapters 1,2,3 and 8. Richards J. Heuer
Graphical Inference for Infovis,Hadley Wickham et al.
If correlation doesntimply causation, then what does?,Michael Nielsen Why most published research findings are false,John P. A. Ioannidis
Assignment:statistical inference.Analyze international homicide rate vs. gun ownership
data.DRAWING CONCLUSIONSASSIGNMENTS
ASSIGNMENT 5: STATISTICAL
INFERENCEFEBRUARY 5, 2013LEAVE A COMMENT
For this assignment you will analyze global data on the number of homicides versus thenumber of guns in each country. Im giving you the data your job is to tell me what it
means. You will interpret a few different plots, and then implement the visual randomization
procedure from thepaperwe discussed in class to examine a tricky case more closely.
The data is fromThe Guardian Data Blog.I simplified the header names, dropped a few
unnecessary columns, and added an OECD column.
1.Ive written most of the code you will need for this assignment, available fromthis github
repo.(You can git clone if you like, otherwise just clickhereto download all files as a zip
archive).
2.We are going to use the R language for this assignment. This is mostly because it has really
nice built in charts (doing this in Python is a real pain), but also because you are likely to
encounter R out in the real world of data journalism.Download and install it.To start R,enterRon the command line. To run a program, entersource(filename.R)at the R command
prompt A full language manual ishere.You will only need to use a few basic concepts, such
asrandom number generationandfor loops.
3. Plot the data for all countries homicide rate (per 100,000) versus number of privately-
owned firearms (per 100) by runningsource(plot-all-countries.R)at the R prompt. What do
you see? Please report on the general patterns here, the outliers, and what this all might mean.
4.Now take a look at only theOECDcountries, by uncommenting the indicated line in the
source. Re-run the file. What does the chart show now?
5.Now plot only the non-OECDcountries, by uncommenting the indicated line in the source
(be sure to re-commentthe line that selects only OECD countries). What does the chart show
now?
http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-7.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-7.pdfhttp://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlhttp://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlhttps://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.htmlhttps://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.htmlhttp://jonathanstray.com/papers/wickham.pdfhttp://jonathanstray.com/papers/wickham.pdfhttp://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/drawing-conclusions/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/drawing-conclusions/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/#respondhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-test/archive/master.ziphttps://github.com/jstray/permutation-test/archive/master.ziphttps://github.com/jstray/permutation-test/archive/master.ziphttp://cran.rstudio.com/http://cran.rstudio.com/http://cran.rstudio.com/http://cran.r-project.org/doc/manuals/R-intro.htmlhttp://cran.r-project.org/doc/manuals/R-intro.htmlhttp://cran.r-project.org/doc/manuals/R-intro.htmlhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://www.inside-r.org/r-doc/base/forhttp://www.inside-r.org/r-doc/base/forhttp://www.inside-r.org/r-doc/base/forhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.inside-r.org/r-doc/base/forhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://cran.r-project.org/doc/manuals/R-intro.htmlhttp://cran.rstudio.com/https://github.com/jstray/permutation-test/archive/master.ziphttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/drawing-conclusions/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://jonathanstray.com/papers/wickham.pdfhttps://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.htmlhttp://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-7.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/ -
8/9/2019 Computational Journalism Hong Kong
5/23
6. It looks like there might be a pattern among the OECD countries, but the United States is
such an outlier that its hard totell. Is this pattern still significant without the US? To find out,
youre going to apply a randomization test. (Well also remove Mexico since its not a
developed country and thus not really comparable to the other OECD countries.)
Start with the file randomization-test.R. You need to write the code that performs the actual
randomization, filling the eight of the columns of charts with random permutations of theoriginal y values (homicide rates), but putting the original data in the realchartcolumn. To
prevent sneak peaks, the code is currently set up to use testing data. When your permutations
are working right, you should see something like this when you run the file:
After pressing Enter, the program will tell you which chart has the real (un-permuted) data.
Here, with fake data, its obvious. It wont always be.
7.Now that your program works, try it on the real data by commenting out the two lines that
generate the fake data. Re-run, and look at the plots carefully. Which one do you think is the
real data? Write down the number of the chart. Then hit enter, and see if you got it right.
http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Screen-Shot-2013-02-05-at-3.52.58-PM.png -
8/9/2019 Computational Journalism Hong Kong
6/23
8.This isnt quite fair, because you were already looking at the data in step 4. So get someone
elseto look at it fresh. Explain to them that you are charting firearms versus homicides and
that one of the charts is real but the rest are fakes, and ask them to spot the real chart.
9. Did you guess right? Did your fresh observer guess right? Did you and your observer guess
differently? If so, why do you think that is? Was it difficult for you to choose? Based on all of
this, do you think there is a correlation between gun ownership and homicide rate for theOECD countries? If so, how strong is it (effect size) and how strong is the evidence (statistical
significance)?
10. What does all this mean? Please write a short journalistic analysis of the
global relationship between firearms ownership and homicide rate, for a general audience.
Your editor has asked you to do this analysis and is very interested in whether there is a
causal relationshipwhether more guns cause more crimeso you will have to include
something about that.
Turn in: answers to questions in steps 3,4,5,7,8,9, your code, and your final short analysis
article.
ASSIGNMENTDRAWING CONCLUSIONSLECTURES
LECTURE 6: STRUCTURED
JOURNALISM AND KNOWLEDGE
REPRESENTATIONFEBRUARY 1, 2013LEAVE A COMMENT
Is journalism in the text/video/audio business, or is it in the knowledge business? This classwell look at this question in detail, which gets us deep into the issue of how knowledge is
represented in a computer. The traditional relational database model is often inappropriate for
journalistic work, so were going to concentrate on so-called linked data representations.
Such representations are widely used and increasingly popular. For example Google recently
released the Knowledge Graph. But generating this kind of data from unstructured text is still
very tricky, as well see when we look at th Reverb algorithm.
Topics:Structured and unstructured data. Article metadata and schema.org. Linked open data
and RDF. Entity extraction. Propositional representation of knowledge. Extracting structured
data from unstructured text. The Reverb algorithm. DeepQA. Automatic story writing from
data.
Slides(PDF)
Readings A fundamental way newspaper websites need to change,Adrian Holovaty
The next web of open, linked data- Tim Berners-Lee TED talk
Identifying Relations for Open Information Extraction,Fader, Soderland, and Etzioni
(Reverb algorithm)
Recommended
Standards-based journalism in a semantic economy,Xark
What the semantic web can represent- Tim Berners-Lee Building Watson: an overview of the DeepQA project
http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-6.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-6.pdfhttp://www.holovaty.com/writing/fundamental-change/http://www.holovaty.com/writing/fundamental-change/http://blog.ted.com/2009/03/13/tim_berners_lee_web/http://blog.ted.com/2009/03/13/tim_berners_lee_web/http://ai.cs.washington.edu/pubs/279http://ai.cs.washington.edu/pubs/279http://xark.typepad.com/my_weblog/2011/01/standards-based-journalism-in-a-semantic-economy.htmlhttp://xark.typepad.com/my_weblog/2011/01/standards-based-journalism-in-a-semantic-economy.htmlhttp://www.w3.org/DesignIssues/RDFnot.htmlhttp://www.w3.org/DesignIssues/RDFnot.htmlhttp://aaaipress.org/ojs/index.php/aimagazine/article/download/2303/2165http://aaaipress.org/ojs/index.php/aimagazine/article/download/2303/2165http://aaaipress.org/ojs/index.php/aimagazine/article/download/2303/2165http://www.w3.org/DesignIssues/RDFnot.htmlhttp://xark.typepad.com/my_weblog/2011/01/standards-based-journalism-in-a-semantic-economy.htmlhttp://ai.cs.washington.edu/pubs/279http://blog.ted.com/2009/03/13/tim_berners_lee_web/http://www.holovaty.com/writing/fundamental-change/http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-6.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/ -
8/9/2019 Computational Journalism Hong Kong
7/23
Can an algorithm write a better story than a reporter?Wired/ 2012.
Assignment:Entity extraction.Text enrichment experiments using OpenCalais.KNOWLEDGE REPRESENTATIONLECTUREASSIGNMENTS
ASSIGNMENT 4: ENTITY EXTRACTIONFEBRUARY 1, 2013LEAVE A COMMENTFor this assignment you will evaluate the performance of OpenCalais, a commercial entity
extraction service. Youll do this by building a text enrichment program, which takes plain
text and outputs HTML with links to the detected entities. Then you will take five random
articles from your data set, enrich them, and manually count how many entities OpenCalais
missed or got wrong.
1. Get an OpenCalais API key, fromthis page.
2. Install thepython-calaismodule.This will allow you to call OpenCalais from Python
easily. First,downloadthe latest version of python-calais. To install it, you just need calais.pyin your working directory. You will probably also need to install thesimplejson Python
module.Download it, then run python setup.py install. You may need to execute this as
super-user.
3. Call OpenCalais from Python.Make sure you can successfully submit text and get the
results back, followingthese steps.The output you want to look at is in the entities array,
which would be accessed as results.entities using the variable names in the sample code. In
particular you want the list of occurrences for each entity, in the instances field.>>> result.entities[0]['instances']
[{u'suffix': u' is the new President of the United States',
u'prefix': u'of the United States of America until 2009. ',u'detection': u'[of the United States of America until 2009.]Barack Obama[ is the new President of the United States]',u'length': 12, u'offset': 75, u'exact': u'Barack Obama'}]
>>> result.entities[0]['instances'][0]['offset']
75
>>>
Each instance has offset and length fields that indicate where in the input text the entity
was referenced. You can use these to determine where to place links in the output HTML.
4. Read a text file, create hyperlinks, and write it out. Your Python program should read
text from stdin and write HTML with links on all detected entities to stdout. There are two
cases to handle, depending on how much information OpenCalais gives back.
In many cases, like the example in step 3, OpenCalais will not be able to give you any
information other than the string corresponding to the entity, result.entities[x]['name']. In this
case you should construct a Wikipedia link by simply appending to the name to a Wikipedia
URL, converting spaces to underscores, e.g.
http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/#respondhttp://www.opencalais.com/user/registerhttp://www.opencalais.com/user/registerhttp://www.opencalais.com/user/registerhttp://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://code.google.com/p/python-calais/http://www.opencalais.com/user/registerhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/ -
8/9/2019 Computational Journalism Hong Kong
8/23
http://en.wikipedia.org/wiki/Barack_Obama
In other cases, especially companies and places, OpenCalias will supply a link to an RDF
document that contains more information about the entity. For example.
>>> result.entities[0]{u'_typeReference':u'http://s.opencalais.com/1/type/em/e/Company', u'_type':u'Company', u'name': u'Starbucks', '__reference':u'http://d.opencalais.com/comphash-1/6b2d9108-7924-3b86-bdba-7410d77d7a79', u'instances': [{u'suffix': u' in Paris.',u'prefix': u'of the United States now and likes to drink at ',u'detection': u'[of the United States now and likes to drink at]Starbucks[ in Paris.]', u'length': 9, u'offset': 156, u'exact':u'Starbucks'}], u'relevance': 0.314, u'nationality': u'N/A',u'resolutions': [{u'name': u'Starbucks Corporation', u'symbol':u'SBUX.OQ', u'score': 1, u'shortname': u'Starbucks', u'ticker':
u'SBUX', u'id': u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'}]}
>>> result.entities[0]['resolutions'][0]['id']
u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'
>>>
In this case the resolutions array will contain a hyperlink for each resolved entity, and this iswhere your link should go. The linked page will contain a series of triples (assertions) about
the entity, which you can obtain in machine-readable from by changing the .html at the end of
the link to .json. The sameAs: links are particularly important because they tell you that this
entity is equivalent to others in dbPedia and elsewhere.
Here is more onOpenCaliasentity disambiguation and use of linked data.
The final result should look something like below. Note that some links go to OpenCalais
entity pages with RDF links on them (London), some go to Wikipedia (politician) and
some are broken links when Wikipedia doesnt have the topic (Aarthi Ramachandran) And
of course Mr Gandhi is an entity that was not detected, three times.
The latest effort to decode Mr Gandhi comes in the form of a limited yet rather well written
biography by apolitical journalist,Aarthi Ramachandran.Her task is a thankless one. Mr
Gandhi is an applicant for a big job: ultimately, to leadIndia.But whereas any other job
applicant will at least offer minimal information about his qualifications, work experience,
reasons for wanting a post, Mr Gandhi is so secretive and defensive that he wont respond to
the most basic queries about his studies abroad, his time working for a management
consultancy inLondon,or what he hopes to do as apolitician.
Dont worry about producing a fully valid HTML document with headers and a tag,
just wrap each entity with and . Your browser will load it fine.
http://www.opencalais.com/documentation/linked-data-entitieshttp://www.opencalais.com/documentation/linked-data-entitieshttp://www.opencalais.com/documentation/linked-data-entitieshttp://www.opencalais.com/documentation/linked-data-entitieshttp://en.wikipedia.org/wiki/political_journalisthttp://en.wikipedia.org/wiki/political_journalisthttp://en.wikipedia.org/wiki/political_journalisthttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://en.wikipedia.org/wiki/politicianhttp://en.wikipedia.org/wiki/politicianhttp://en.wikipedia.org/wiki/politicianhttp://en.wikipedia.org/wiki/politicianhttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://en.wikipedia.org/wiki/political_journalisthttp://www.opencalais.com/documentation/linked-data-entities -
8/9/2019 Computational Journalism Hong Kong
9/23
5. Pick five random news stories and enrich them. First pick a news site with many stories
on the home page. Then generate five random numbers from 1 to the number of stories on the
page. Cut and paste the text of each article into a separate file, and save as plain text (no
HTML, no formatting.)
6. Read the enriched documents and count to see how well OpenCalais did.You need to
read each output document very carefully and count three things:
Entity references. Count each time there is a name of a person, place, or organization
appears, or other references to these things (e.g. the president.)
Detected references. How many of these references did OpenCalais find?
Correct references. How many of the links go to the right page? Did our hyperlinking
strategy (OpenCalais RDF pages where possible, Wikipedia when not) fail to correctly
disambiguate any of the references, or, even worse, disambiguate any to the wrong object?
Also, a broken link counts as an incorrect reference.
7. Turn in your work. Please turn in:
Your code
The enriched output from your documents
A brief report describing your results.The report should include a table of the three numbersreferences, detected, correctfor
each document, plusthe totals of these three numbers across all documents. Also report on
any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it
least accurate? Are there predictable patterns to the errors?
This assignment is due before class on Monday, February 4.ASSIGNMENTKNOWLEDGE REPRESENTATIONLECTURES
LECTURE 5: SOCIAL NETWORK
ANALYSISJANUARY 29, 2013LEAVE A COMMENTNetwork analysis (aka social network analysis, link analysis) is a promising and popular
technique for uncovering relationships between diverse individuals and organizations. It is
widely used in intelligence and law enforcement, but not so much in journalism. Well look at
basic techniques and algorithms and try to understand the promiseand the many practical
problems.
Topics:Whats a social network? Link analysis. Homophily and structural determinants of
behavior. Centrality measurements. Community detection and the modularity algorithm. K-
core decomposition. SNA in journalism. SNA that could be in journalism.Slides(PDF)
Readings Analyzing the Data Behind Skin and Bone,ICIJ
Identifying the Community Power Structure,an old handbook for community development
workers about figuring out who is influential by very manual processes.
Centrality and Network Flow,Borgatti
Recommended
Visualizing Communities,Jonathan Stray
The network of global corporate control,Vitali et. al.
http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-5.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-5.pdfhttp://www.icij.org/blog/2012/07/analyzing-data-behind-skin-and-bonehttp://www.icij.org/blog/2012/07/analyzing-data-behind-skin-and-bonehttp://www.soc.iastate.edu/extension/pub/comm/NCR19.pdfhttp://www.soc.iastate.edu/extension/pub/comm/NCR19.pdfhttp://www.analytictech.com/borgatti/papers/centflow.pdfhttp://www.analytictech.com/borgatti/papers/centflow.pdfhttp://jonathanstray.com/visualizing-communitieshttp://jonathanstray.com/visualizing-communitieshttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://jonathanstray.com/visualizing-communitieshttp://www.analytictech.com/borgatti/papers/centflow.pdfhttp://www.soc.iastate.edu/extension/pub/comm/NCR19.pdfhttp://www.icij.org/blog/2012/07/analyzing-data-behind-skin-and-bonehttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-5.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/ -
8/9/2019 Computational Journalism Hong Kong
10/23
The Dynamics of Protest Recruitment through an Online Network,Sandra Gonzlez-
Bailn, et al.
Sections I and II ofCommunity Detection in Graphs,Fortunato
Exploring Enron,Jeffrey Heer
Examples:
GalleonsWeb,Wall Street Journal
Muckety
Theyrule.net
Who Runs Hong Kong?,South China Morning Post
Assignment:Social network analysis.Compare different centrality metrics in Gephi.LECTURESOCIAL NETWORK ANALYSISASSIGNMENTS
ASSIGNMENT 3: SOCIAL NETWORK
ANALYSISJANUARY 29, 2013LEAVE A COMMENTFor this assignment you will analyze a social network using three different centrality
algorithms, and compare the results.
1. Download and installGephi,a free graph analysis package. It is open source and runs on
any OS.
2. Download the data filelesmis.gmlfrom theUCI Network Data Repository. This is a
network extracted from the famous French novel Les Miserablesyou may also be familiar
with the musical and the recent movie. Each node is a character, and there is an edge between
two characters if they appear in the same chapter. Les Miserables is written in over 300 shortchapters, so two characters that appear in the same chapter are very likely to meet or talk in
the plot of the book. Actually, the edges are weighted, and the weight is the number of
chapters those characters appear together in.
3. Open this file in Gephi, by choosing File->Open. When the dialog box comes up, set the
Graph Type type to Undirected. The graph will be plotted. What do you see? Can you
discern any patterns?
4. Now arrange the nodes in a nicer way, by choosing the Force Atlas 2 layout algorithm
from the Layout menu at left and pressing the Run button. When things settle down, hit the
Stop button. The graph will be arranged nicely, but it will be quite small. You can zoom inusing the mouse wheel (or two fingers on the trackpad on a mac) and pan using the right
mouse button.
5. Select the Edit tool from the bottom of the toolbar on the left. It looks like a mouse
pointer with question mark next to it:
http://www.nature.com/srep/2011/111215/srep00197/full/srep00197.htmlhttp://www.nature.com/srep/2011/111215/srep00197/full/srep00197.htmlhttp://arxiv.org/pdf/0906.0612.pdfhttp://arxiv.org/pdf/0906.0612.pdfhttp://arxiv.org/pdf/0906.0612.pdfhttp://hci.stanford.edu/jheer/projects/enron/http://hci.stanford.edu/jheer/projects/enron/http://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://news.muckety.com/http://news.muckety.com/http://theyrule.net/http://theyrule.net/http://www.youtube.com/watch?v=0iQSVU1Z748&feature=relmfuhttp://www.youtube.com/watch?v=0iQSVU1Z748&feature=relmfuhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/#respondhttp://gephi.org/http://gephi.org/http://gephi.org/http://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://gephi.org/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://www.youtube.com/watch?v=0iQSVU1Z748&feature=relmfuhttp://theyrule.net/http://news.muckety.com/http://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://hci.stanford.edu/jheer/projects/enron/http://arxiv.org/pdf/0906.0612.pdfhttp://www.nature.com/srep/2011/111215/srep00197/full/srep00197.html -
8/9/2019 Computational Journalism Hong Kong
11/23
6. Now you can click on any node to see its label, which is the name of the character it
represents. This information will appear in the Edit menu in the upper left. Heres the
information for the character Gavroche.
Click around the various nodes in the graph. Which characters have been given the most
central locations? If you are familiar with the story of Les Miserables, how does this
correspond to theplot?Are the most central nodes the most important characters?
7. Make Gephi color nodes by degree. Choose the Ranking tab from panel at the upper left,
then select the Nodes tab, then Degree from the drop-down menu. Press the Apply
button.
Now the nodes with the highest degree will be darker. Do these high degree nodes correspond
to the nodes that the layout algorithm put in the center? Are they the main characters in the
story?
8. Now make Gephi compute betweenness and closeness centrality by pressing the Run
button for the Network Diameter option under Network Overview in to the right of the
screen.
http://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plothttp://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plothttp://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plothttp://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plot -
8/9/2019 Computational Journalism Hong Kong
12/23
You will get a report with some graphs. Just click Close. Now betweenness and closeness
centrality will appear in the drop-down under Ranking, in the same place where you
selected degree centrality earlier, and you can assign colors based on either run by clicking
the Apply button.
Also, the numerical values for betweenness centrality and closeness centrality will now
appear in the Edit window for each node.
Select Betweenness Centrality from the drop-down meny and hit Apply. What do you
see? Which characters are marked as important? How does it differ from the characters which
are marked as important by degree?
Now selecte Closeness Centrality and hit Apply. (Note that this metric uses a scale which
is the reverse of the otherscloseness measures average distance to all other nodes, so smallvalues indicate more central nodes. You may want to swap the black and white endpoints of
the color scale to get something which is comparable to the other visualizations.) How does
closeness centrality differ from betweeness centrality and degree? Which characters differ
between closeness and the other metrics?
9. Turn in: your answers to the questions in steps 3, 6, 7 and 8, plus screenshots for the graph
plotted with degree, betweenness centrality, and closeness centrality. (To take a screenshot:
on Windows, use the Snipping Tool. On Mac, pressCmd + Shift + 4. If youre on
Linux, you get to tell me)
What I am interested in here is how the values computed by the different algorithms
correspond to the plot of Les Miserables (if you are familiar with it), and how they compare to
each other. Telling me that Jean Valjean has a closeness centrality of X is not a high -
enough level interpretationyour couldnt publish that in a finished story, because your
readers wont know what that means.
Due: before class on Friday, 1 February.
ASSIGNMENTSOCIAL NETWORK ANALYSISLECTURES
http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/ -
8/9/2019 Computational Journalism Hong Kong
13/23
LECTURE 4: SOCIAL AND HYBRID
FILTERSJANUARY 27, 2013LEAVE A COMMENT
Its possible to build powerful filtering systems by combining software and people,incorporating both algorithmic content analysis and human actions such as follow, share, and
like. Well look recommendation systems, the Facebook news feed, and the socially-driven
algorithms behind them. Well finish by looking at an example of using human preferences to
drive machine learning algorithms: Google Web search.
Topics:Social filtering. The network structure of Twitter. Social software. Comment ranking
on Reddit. Confidence sorting. User-item recommendation and collaborative filtering. Hybrid
filters. What makes a good filter?
Slides(PDF)
Readings Finding and Assessing Social Information Sources in the Context of Journalism,Nick
Diakopolous et al.
Item-Based Collaborative Filtering Recommendation Algorithms,Sarwar et. al
How Reddit Ranking Algorithms Work,Amir Salihefendic
Recommended
Google News Personalization: Scalable Online Collaborative Filtering,Das et al
Slashdot Moderation,Rob Malda
What is Twitter, a Social Network or a News Media?,Haewoon Kwak, et al,
The Netflix Prize,Wikipedia
How does Google use human raters in web search?, Matt CuttsAssignment:Hybrid filter Design.Design a filtering algorithm for status updates.
ASSIGNMENTS
ASSIGNMENT 2: FILTER DESIGNJANUARY 25, 2013LEAVE A COMMENT
For this assignment you will design a hybrid filtering algorithm. You will not implement it,
but you will explain your design criteria and provide a filtering algorithm in sufficient
technical detail to convince me that it might actually workincluding psuedocode.
1. Decide who your users are. Journalists? Professionals? General consumers? Someone else?
2. Decide what you will filter. You can choose:
Facebook status updates, like the Facebook news feed
Weibos, like Weiboscope
Tweets, like Trending Topics or the many Tweet discovery tools
The whole web, like Prismatic
something else, but ask me first
http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-4.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-4.pdfhttp://sm.rutgers.edu/pubs/diakopoulos-srsr-chi2012.pdfhttp://www.stat.osu.edu/~dmsl/Sarwar_2001.pdfhttp://www.stat.osu.edu/~dmsl/Sarwar_2001.pdfhttp://amix.dk/blog/post/19588http://amix.dk/blog/post/19588http://www2007.org/papers/paper570.pdfhttp://www2007.org/papers/paper570.pdfhttp://slashdot.org/moderation.shtmlhttp://slashdot.org/moderation.shtmlhttp://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdfhttp://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdfhttp://en.wikipedia.org/wiki/Netflix_Prizehttp://en.wikipedia.org/wiki/Netflix_Prizehttp://searchengineland.com/video-google-speaks-about-search-quality-raters-119986http://searchengineland.com/video-google-speaks-about-search-quality-raters-119986http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://searchengineland.com/video-google-speaks-about-search-quality-raters-119986http://en.wikipedia.org/wiki/Netflix_Prizehttp://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdfhttp://slashdot.org/moderation.shtmlhttp://www2007.org/papers/paper570.pdfhttp://amix.dk/blog/post/19588http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdfhttp://sm.rutgers.edu/pubs/diakopoulos-srsr-chi2012.pdfhttp://sm.rutgers.edu/pubs/diakopoulos-srsr-chi2012.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-4.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/ -
8/9/2019 Computational Journalism Hong Kong
14/23
3. List all available information that you have available as input to your algorithm. If you
want to filter Facebook or Twitter or Weibos, you may pretend that you are the company
running the service, and have access to all posts and user datafrom every user. You also
also assume you have a web crawler or a firehose of every RSS feed or whatever you like, but
you must be specific and realistic about what data you are operating with.
4. Argue for the design factors that you would like to influence the filtering, in terms of what
is desirable to the user, what is desirable to the publisher (e.g. Facebook or Prismatic), and
what is desirable socially. Explain as concretely as possible how each of these (probably
conflicting) goals might be achieved through in software. Since this is a hybrid filter, you can
also design social software that asks the user for certain types of information (e.g. likes, votes,
ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.
5. Write psuedo-code for a function that produces a top stories list. This function will be
called whenever the user loads your page or opens your app, so it must be fast and frequently
updated. You can assume that there are background processes operating on your servers if youlike. Your psuedo-code does not have to be executable, but it must be specific and
unambiguous, such that a good programmer could actually go and implement it. You can
assume that you have libraries for classic text analysis and machine learning algorithms. So,
you dont have to spell out algorithms like TF-IDF or item-based collaborative filtering, or
anything else you can dig up in the research literature, but simply say how youre going to use
such building blocks. If you use an algorithm we havent discussed in class, be sure to provide
a reference to it.
6. Write up steps 1-5. The result should be no more than three pages. However, you must
bespecific andplausible. You must be clear about what you are trying to accomplish, what
your algorithm is, and why you believe your algorithm meets your design goals (though ofcourse its impossible to know for sure without testing; but I want something that looks good
enough to be worth trying.)
The assignment is due before class on Tuesday, January 29.
ASSIGNMENTFILTER DESIGNLECTURES
LECTURE 3: ALGORITHMIC FILTERSJANUARY 23, 2013LEAVE A COMMENT
This class we begin our study of filtering with some basic ideas about its role in journalism.Theres just way too much information produced every day, more than any one person can
read by a factor of millions. We need software to help us deal with this flood. In this lecture,
we discuss purely algorithmic approaches to filtering, with a look at how the Newsblaster
system works (similar to Google News.)
Topics:How bad information overload actually is. The Newsblaster system, a precursor to
Google News. Clustering together stories on the same event. Sorting stories into topics.
Personalization. The filter bubble, and the filter design problem.
Slides (PDF)
Readings Who should see what when? Three design principles for personalized news,Jonathan Stray
http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-3.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-3.pdfhttp://www.niemanlab.org/2012/07/who-should-see-what-when-three-principles-for-personalized-news/http://www.niemanlab.org/2012/07/who-should-see-what-when-three-principles-for-personalized-news/http://www.niemanlab.org/2012/07/who-should-see-what-when-three-principles-for-personalized-news/http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-3.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/ -
8/9/2019 Computational Journalism Hong Kong
15/23
Tracking and summarizing news on a daily basis with Columbia Newsblaster,McKeown
et al
Recommended
Are we stuck in filter bubbles? Here are five potential paths out,Jonathan Stray
Guess what? Automated news doesntquite work,Gabe Rivera The Hermeneutics of Screwing Around, or What You Do With a Million Books,Stephen
Ramsay
Can an algorithm be wrong?,Tarleton Gillespie
CLUSTERINGFILTER DESIGNLECTURELECTURES
LECTURE 2: TEXT ANALYSISJANUARY 20, 2013LEAVE A COMMENT
Can we use machines to help us understand text? In this class we will cover basic text analysis
techniques, from word counting to topic modeling. The algorithms we will discuss this class
are used in just about everything: search engines, document set visualization, figuring out
when two different articles are about the same story, finding trending topics. The vector space
document model is fundamental to algorithmic handling of news content, and we will need it
to understand how just about every filtering and personalization system works.
Topics:Telling stories from quantitative analysis of language, word frequencies, the bag-of-words document vector model, cosine distance, TF-IDF, and a demonstration of the Overview
document set mining tool.
Slides(PDF)
Readings
Online Natural Language Processing Course,Stanford University
Week 7: Information Retrieval, Term-Document Incidence Matrix
Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
Week 7: Ranked Information Retrieval, Term Frequency Weighting
Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
Week 7: Ranked Information Retrieval, TF-IDF weighting
Recommended
Probabilistic Topic Models,David M. Blei
General purpose computer-assisted clustering and conceptualization,Justin Grimmer, Gary
King
A full-text visualization of the Iraq war logs,Jonathan Stray
Introduction to Information Retrieval Chapter 6,Scoring, Term Weighting, and The Vector
Space Model,Manning, Raghavan, and Schtze.
Examples
Watchwords: Reading China Through its Party Vocabulary, Qian Gang Message Machine,ProPublica
http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Tracking_and_Summarizing_News_on_a_Daily_Basis_with_Columbia_s_Newsblaster.pdfhttp://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Tracking_and_Summarizing_News_on_a_Daily_Basis_with_Columbia_s_Newsblaster.pdfhttp://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://news.techmeme.com/081203/automatedhttp://news.techmeme.com/081203/automatedhttp://news.techmeme.com/081203/automatedhttp://news.techmeme.com/081203/automatedhttp://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdfhttp://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdfhttp://limn.it/can-an-algorithm-be-wrong/http://limn.it/can-an-algorithm-be-wrong/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-2.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-2.pdfhttps://class.coursera.org/nlp/lecture/previewhttps://class.coursera.org/nlp/lecture/previewhttp://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttp://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttp://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://overview.ap.org/blog/2010/12/a-full-text-visualization-of-the-iraq-war-logs/http://overview.ap.org/blog/2010/12/a-full-text-visualization-of-the-iraq-war-logs/http://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://cmp.hku.hk/watchwords2012/http://cmp.hku.hk/watchwords2012/http://projects.propublica.org/emails/http://projects.propublica.org/emails/http://projects.propublica.org/emails/http://cmp.hku.hk/watchwords2012/http://cmp.hku.hk/watchwords2012/http://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://overview.ap.org/blog/2010/12/a-full-text-visualization-of-the-iraq-war-logs/http://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttps://class.coursera.org/nlp/lecture/previewhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-2.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://limn.it/can-an-algorithm-be-wrong/http://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdfhttp://news.techmeme.com/081203/automatedhttp://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Tracking_and_Summarizing_News_on_a_Daily_Basis_with_Columbia_s_Newsblaster.pdf -
8/9/2019 Computational Journalism Hong Kong
16/23
Assignment:TF-IDF.Analyze the topics of the U.S. State of the Union addresses over the
decades.LECTURETEXT ANALYSISLECTURES
LECTURE 1: BASICSJANUARY 20, 2013LEAVE A COMMENTWell try to define computational journalism, as the application of computer science to four
different areas: data-driven reporting, story presentation, information filtering, and effect
tracking. But first we have to figure out how to represent the outside world as data. We do this
using the feature vector representation. One of the most useful things we can do with such
vectors is compute the distances between two of them. We can also visualize the entire vector
space, but to do this we have to project the high-dimensional space down to the two
dimensions of the screen.
Topics:The definition of computational journalism, encoding the world as feature vectors,distance metrics, clustering algorithms, and visualization using multi-dimensional scaling.
Slides (PDF)
Readings
Computational Journalism,Cohen, Turner, Hamilton
sections 1 and 2 ofThe Challenges of Clustering High Dimensional Data,Steinbach,
Ertz, Kumar
Recommended
What should the digital public sphere do?, Jonathan Stray
Precision Journalism, Ch.1,Journalism and the Scientific Tradition,Philip Meyer Using clustering to analyze the voting blocs in the UK House of Lords,Jonathan Stray
Examples
The Jobless rate for People Like You,New York Times
Dollars for Docs,ProPublica
What did private security contractors do in Iraqanddocument mining methodology,
Jonathan Stray
The network of global corporate control,Vitali et. al.
GOP5make strange bedfellows in budget fight,Chase Davis, California Watch
CLUSTERINGLECTUREASSIGNMENTS
ASSIGNMENT 1: TF-IDFJANUARY 18, 2013LEAVE A COMMENT
Update:Henry Williams has kindly made available hiscode for the solutionto this
assignment.
In this assignment you will implement the TF-IDF formula and use it to study the topics in
State of the Union speeches given every year by the U.S. president.
http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-1.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-1.pdfhttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://www.compjournalism.com/?p=13http://www.compjournalism.com/?p=13http://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.htmlhttp://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.htmlhttp://projects.propublica.org/docdollars/http://projects.propublica.org/docdollars/http://overview.ap.org/blog/2012/02/iraq-security-contractors/http://overview.ap.org/blog/2012/02/iraq-security-contractors/http://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/#respondhttps://github.com/digitalhen/speechAnalysis/https://github.com/digitalhen/speechAnalysis/https://github.com/digitalhen/speechAnalysis/https://github.com/digitalhen/speechAnalysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://overview.ap.org/blog/2012/02/iraq-security-contractors/http://projects.propublica.org/docdollars/http://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.htmlhttp://www.compjournalism.com/?p=13http://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-1.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/ -
8/9/2019 Computational Journalism Hong Kong
17/23
1. Download the source data filestate-of-the-union.csv.This is a standard CSV file with one
speech per row. There are two columns: the year of the speech, and the text of the speech.
You will write a Python program that reads this file and turns it into TF-IDF document
vectors, then prints out some information. Here ishow to read a CSV in Python.
2. Tokenize the text each speech, to turn it into a list of words. As we discussed in class, were
going to tokenize using a simple scheme:
convert all characters to lowercase
remove all punctuation characters
split the string on spaces
3. Compute a TF (term frequency) vector for each document. This is simply how many times
each word appears in that document. You should end up with a Python dictionary from terms
(strings) to term counts (numbers) for each document.
4.Count how many documents each word appears in. This can be done after computing how
the TF vector by each document, by incrementing the document count of each word that
appears in the TF vector. After reading all documents you should now have a dictionary from
each term to the number of documents that term appears in.
5. Turn the final document counts into IDF (inverse document frequency) weights byapplying the formula IDF(term) = log(total number of documents / number of documents that
term appears in.)
6.Now multiply the TF vectors for each document by the IDF weights for each term, to
produce TF-IDF vectors for each document. Then normalize each vector, so the sum of
squared weights is 1.
7.Congratulations! You have a set of TF-IDF vectors for this corpus. Now its time to see
what they say. Take the speech you were assigned in class, and print out the highest weighted
20 terms, along with their weights. What do you think this particular speech is about? Write
your answer in at most 200 words.
8. Your task now is to see if you can understand how the topics changed since 1900. For each
decade since 1900, do the following:
sum all of the TF-IDF vectors for all speeches in that decade
print out the top 20 terms in the summed vector, and their weights
Now take a look at the terms for each decade. What patterns do you see? Can you connect the
terms to major historical events? (wars, the great depression, assassinations, the civil rights
movement, Watergate) Write up what you see in narrativeform, no more than 500 words,
referring to the terms for each decade.
9. Hand in:
your code
the printout and analysis from step 7
the printout and narrative from step 8.
SYLLABUS
This class will cover, in great detail, some of the most advanced techniques used by
journalists to understand digital information, and communicate it to users. We will focus on
unstructured text information in large quantities, and also cover related topics such as how to
draw conclusions from data without fooling yourself, social network analysis, and online
http://jonathanstray.com/papers/state-of-the-union.csvhttp://jonathanstray.com/papers/state-of-the-union.csvhttp://jonathanstray.com/papers/state-of-the-union.csvhttp://docs.python.org/2/library/csv.htmlhttp://docs.python.org/2/library/csv.htmlhttp://docs.python.org/2/library/csv.htmlhttp://docs.python.org/2/library/csv.htmlhttp://jonathanstray.com/papers/state-of-the-union.csv -
8/9/2019 Computational Journalism Hong Kong
18/23
security for journalists. These are the algorithms used by search engines and intelligence
agencies and everyone in between.
Due to our short scheduleeight classes over three weeksthis will be an intense course.
You will be given a homework assignment every class, which should take you 3-6 hours to
complete. About half of the assignments will involve some programming in Python. This
course will be quite technicalit is, after all, a course about applying computer science to
journalism. Aside from being able to program, I assume you know basic computer science
theory, and mathematics up to linear algebra. However, the assignments will also require you
to explain, in plain English, what the algorithmic result means in journalism terms. The code
will not be enough.
Please note that the JMSC is also offering a more accessible data journalism course in May,
taught by Irene Jay Liu. You may find that course a better fit if you do not have programming
experience. If you are not taking this course for credit you are welcome to sit in on the
lectures, but I will not mark your assignments.
You will be assigned readings to study beforeeach lecture. These will typically be research
papers. There are also recommended readings that will tell you much more about the topics
we cover, and examples of stories that use these techniques.
The course will be graded as follows:
Assignments: 60%, weighted equally
Class participation: 10%
Final project: 30%
Lecture 1.Basics
Well try to define computational journalism, as the application of computer science to four
different areas: data-driven reporting, story presentation, information filtering, and effect
tracking. But first we have to figure out how to represent the outside world as data. We do this
using the feature vector representation. One of the most useful things we can do with such
vectors is compute the distances between two of them. We can also visualize the entire vector
space, but to do this we have to project the high-dimensional space down to the two
dimensions of the screen.
Required
Computational Journalism,Cohen, Turner, Hamilton
sections 1 and 2 ofThe Challenges of Clustering High Dimensional Data,Steinbach,
Ertz, Kumar
http://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltext -
8/9/2019 Computational Journalism Hong Kong
19/23
Recommended
What should the digital public sphere do?, Jonathan Stray
Precision Journalism, Ch.1,Journalism and the Scientific Tradition,Philip Meyer
Using clustering to analyze the voting blocs in the UK House of Lords,Jonathan Stray
Examples
The Jobless rate for People Like