Computational Journalism Hong Kong

download Computational Journalism Hong Kong

of 23

Transcript of Computational Journalism Hong Kong

  • 8/9/2019 Computational Journalism Hong Kong

    1/23

    FEATURED

    INTRODUCTION: COMPUTER SCIENCE

    AND JOURNALISMFEBRUARY 14, 2013LEAVE A COMMENT

    Maybe its not obvious that computer science andjournalism go together, but they do!

    Computational journalism combines classic journalistic values of storytelling and public

    accountability with techniques from computer science, statistics, the social sciences, and the

    digital humanities.

    This course, given at the University of Hong Kong during January-February 2013, is an

    advanced look at how techniques from visualization, natural language processing, social

    network analysis, statistics, and cryptography apply to four different areas of journalism:

    finding stories through data mining, communicating what youve learned, filtering anoverwhelming volume of information, and tracking the spread of information and effects.

    The course assumes knowledge of computer science, including standard algorithms and linear

    algebra. Several of the assignments require students to write Python code at an intermediate

    level. But this introductory video, which explains the topics covered, is for everyone.

    Slideshere.For more, see thesyllabus,or jump directly to a lecture:

    1. Basics.Feature vectors, clustering, projections.

    2. Text analysis.Tokenization, TF-IDF, topic modeling.

    3.

    Algorithmic filters.Information overload. Newsblaster and Google News.4. Hybrid filters.Social networks as filters. Collaborative Filtering.

    5.

    Social network analysis.Using it in journalism. Centrality algorithms.

    6. Knowledge representation.Structured data. Linked open data. General Q&A.

    7. Drawing conclusions.Randomness. Competing hypotheses. Causation.

    8. Security, surveillance, and privacy.Cryptography. Threat modeling.

    LECTURES

    LECTURE 8: SECURITY,SURVEILLANCE, AND PRIVACYFEBRUARY 13, 2013LEAVE A COMMENT

    Who is watching our online activities? How do you protect a source in the 21st Century? Who

    gets to access to all of this mass intelligence, and what does the ability to survey everything

    all the time mean both practically and ethically for journalism? In this lecture we will talk

    about who is watching and how, and how to create a security plan using threat modeling.

    Topics:How is email transmitted? Who has access to your emails. Mass surveillance and its

    legal status. How cryptography works. Encryption versus authentication. Man-in-the-middle

    attacks. Secure communications using OTR. Case study: the leaked Wikileaks cables. Threat

    modeling. Security planning.

    http://courses.jmsc.hku.hk/jmsc6041spring2013/category/featured/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/featured/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/#respondhttp://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/13/lecture-8-security-surveillance-and-privacy/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/syllabus/http://www.slideshare.net/jonathanstray1/computer-science-and-journalism-two-great-tastes-that-taste-great-togetherhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/14/introduction-computer-science-and-journalism/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/featured/
  • 8/9/2019 Computational Journalism Hong Kong

    2/23

    Slides

    Readings Chris Soghoian,Why secrets arentsafe with journalists, New York times 2011

    Hearst New Media Lecture 2012,Rebecca MacKinnon

    Recommended

    CPJ journalist security guide section 3,Information Security

    Global Internet Filtering Map,Open Net Initiative

    The NSA is building the countrysbiggest spy center,James Banford, Wired

    Cryptographic security

    Unplugged: The Show part 9: Public Key Cryptography

    Diffe-Hellman key exchange,ArtOfTheProblem

    Anonymity

    Tor Project Overview Who is harmed by a real-names policy,Geek Feminism

    Assignment:Threat modeling and security planning.Use threat modeling to come up with a

    security plan for a given scenario.LECTURESECURITYASSIGNMENTS

    ASSIGNMENT 6: THREAT MODELING

    AND SECURITY PLANNINGFEBRUARY 8, 2013

    For this assignment, each of you will pick one of the four reporting scenarios below and

    design a security plan. More specifically, you will flesh out the scenario, create a threat

    model, come up with a plausible security plan, and analyze the weaknesses of your plan.

    Start by creating a threat model, which must consider:

    What must be kept private? Specify all of the information that must be secret, including

    notes, documents, files, locations, and identitiesand possibly even the fact that

    someone is working on a story.

    Who is the adversary and what do they want to know? It may be a single person, or an

    entire organization or state, or multiple entities. They may be very interested in certain

    types of information, e.g. identities, and uninterested in others. List each adversary andtheir interests.

    What can they do to find out? List every way they could try to find out what you want

    secret, including technical, legal, and social methods.

    What is the risk? Explain what happens if an adversary succeeds in breaking your security.

    What are the consequences, and to whom? Which of these is it absolutely necessary to

    avoid?

    Once you have specified your your threat model, you are ready to design your security plan.

    The threat model describes the risk, and the goal of the security plan is to reduce that risk as

    much as possible.

    http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-8.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-8.pdfhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://www.cjr.org/behind_the_news/collateral_damage_news_organiz.php?page=allhttp://www.cjr.org/behind_the_news/collateral_damage_news_organiz.php?page=allhttp://cpj.org/reports/2012/04/information-security.phphttp://cpj.org/reports/2012/04/information-security.phphttp://cpj.org/reports/2012/04/information-security.phphttp://map.opennet.net/http://map.opennet.net/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://www.youtube.com/watch?v=jJrICB_HvuIhttp://www.youtube.com/watch?v=jJrICB_HvuIhttp://www.youtube.com/watch?feature=player_embedded&v=3QnD2c4Xovkhttp://www.youtube.com/watch?feature=player_embedded&v=3QnD2c4Xovkhttps://www.torproject.org/about/overviewhttps://www.torproject.org/about/overviewhttp://geekfeminism.wikia.com/wiki/Who_is_harmed_by_a_%22Real_Names%22_policy%3Fhttp://geekfeminism.wikia.com/wiki/Who_is_harmed_by_a_%22Real_Names%22_policy%3Fhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/08/assignment-6-threat-modeling-and-security-planning/http://geekfeminism.wikia.com/wiki/Who_is_harmed_by_a_%22Real_Names%22_policy%3Fhttps://www.torproject.org/about/overviewhttp://www.youtube.com/watch?feature=player_embedded&v=3QnD2c4Xovkhttp://www.youtube.com/watch?v=jJrICB_HvuIhttp://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/http://map.opennet.net/http://cpj.org/reports/2012/04/information-security.phphttp://www.cjr.org/behind_the_news/collateral_damage_news_organiz.php?page=allhttp://www.nytimes.com/2011/10/27/opinion/without-computer-security-sources-secrets-arent-safe-with-journalists.htmlhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-8.pdf
  • 8/9/2019 Computational Journalism Hong Kong

    3/23

    Your plan must specify appropriate software tools,plushow these tools must be used. Pay

    particular attention to necessary habits: specify who must do what, and in what way, to keep

    the system secure. Explain how you will educate your sources and collaborators in the proper

    use of your chosen tools, and how hard you think it will be to make sure everyone does

    exactly the right thing.

    Also document the weaknesses of your plan. What can still go wrong? What are the criticalassumptions that will cause failure if it turns out you have guessed wrong? What is going to

    be difficult or expensive about this plan?

    The scenarios you can choose from are:

    1. You are a photojournalist in Syria with digital images you wants to get out of the country.

    Limited internet access is available at a cafe. Some of the images may identify people

    working with the rebels who could be targeted by the government if their identity is revealed.

    In addition you would like to remain anonymous until the photographs are published, so that

    you can continue to work inside the country for a little longer, and leave without difficulty.

    2. You are working on an investigative story about the CIA conducting operations in the U.S.,

    in possible violation the law. You have sources inside the CIA who would like to remain

    anonymous. You will occasionally meet with these sources in but mostly communicate

    electronically. You would like to keep the story secret until it is published, to avoid pre-

    emptive legal challenges to publication.

    3. You are reporting on insider trading at a large bank, and talking secretly to two

    whistleblowers. If these sources are identified before the story comes out, at the very least you

    will lose your sources, but there might also be more serious repercussions they could losetheir jobs, or the bank could attempt to sue. This story involves a large volume of proprietary

    data and documents which must be analyzed.

    4. You are working in Europe, assisting a Chinese human rights activist. The activist is

    working inside China with other activists, but so far the Chinese government does not know

    they are an activist and they would like to keep it this way. You have met the activist once

    before, in person, and have a phone number for them, but need to set up a secure

    communications channel.

    These scenario descriptions are incomplete. Please feel free to expand them, making any

    reasonable assumptions about the environment or the storythough you must document

    your assumptions, and you cant assume that you have unrealistic resources or that your

    adversary is incompetent.

    ASSIGNMENTSECURITYLECTURES

    LECTURE 7: DRAWING CONCLUSIONS

    FROM DATA

    http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/
  • 8/9/2019 Computational Journalism Hong Kong

    4/23

    FEBRUARY 5, 2013LEAVE A COMMENT

    Youve loaded up all the data. Youve run the algorithms. Youve completed your analysis.

    But how do you know that you are right? Its incredibly easy to fool yourself, but fortunately,

    there is a long history of fields grappling with the problem of determining truth in the face of

    uncertainty, from statistics to intelligence analysis.

    Topics:What does randomness look like? Variation from rolling dice. Base rate

    fallacy. Conditional probability. Bayes theorem. Cognitive biases. Method of competing

    hypotheses. Probabilistic scoring of hypotheses. Correlation and causation. Finding alternate

    hypotheses for the NYPD stop and frisk data.

    Slides

    Readings

    Correlation and causation,Business Insider

    The Psychology of Intelligence Analysis,chapters 1,2,3 and 8. Richards J. Heuer

    Graphical Inference for Infovis,Hadley Wickham et al.

    If correlation doesntimply causation, then what does?,Michael Nielsen Why most published research findings are false,John P. A. Ioannidis

    Assignment:statistical inference.Analyze international homicide rate vs. gun ownership

    data.DRAWING CONCLUSIONSASSIGNMENTS

    ASSIGNMENT 5: STATISTICAL

    INFERENCEFEBRUARY 5, 2013LEAVE A COMMENT

    For this assignment you will analyze global data on the number of homicides versus thenumber of guns in each country. Im giving you the data your job is to tell me what it

    means. You will interpret a few different plots, and then implement the visual randomization

    procedure from thepaperwe discussed in class to examine a tricky case more closely.

    The data is fromThe Guardian Data Blog.I simplified the header names, dropped a few

    unnecessary columns, and added an OECD column.

    1.Ive written most of the code you will need for this assignment, available fromthis github

    repo.(You can git clone if you like, otherwise just clickhereto download all files as a zip

    archive).

    2.We are going to use the R language for this assignment. This is mostly because it has really

    nice built in charts (doing this in Python is a real pain), but also because you are likely to

    encounter R out in the real world of data journalism.Download and install it.To start R,enterRon the command line. To run a program, entersource(filename.R)at the R command

    prompt A full language manual ishere.You will only need to use a few basic concepts, such

    asrandom number generationandfor loops.

    3. Plot the data for all countries homicide rate (per 100,000) versus number of privately-

    owned firearms (per 100) by runningsource(plot-all-countries.R)at the R prompt. What do

    you see? Please report on the general patterns here, the outliers, and what this all might mean.

    4.Now take a look at only theOECDcountries, by uncommenting the indicated line in the

    source. Re-run the file. What does the chart show now?

    5.Now plot only the non-OECDcountries, by uncommenting the indicated line in the source

    (be sure to re-commentthe line that selects only OECD countries). What does the chart show

    now?

    http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-7.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-7.pdfhttp://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlhttp://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlhttps://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.htmlhttps://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.htmlhttp://jonathanstray.com/papers/wickham.pdfhttp://jonathanstray.com/papers/wickham.pdfhttp://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/drawing-conclusions/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/drawing-conclusions/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/#respondhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-test/archive/master.ziphttps://github.com/jstray/permutation-test/archive/master.ziphttps://github.com/jstray/permutation-test/archive/master.ziphttp://cran.rstudio.com/http://cran.rstudio.com/http://cran.rstudio.com/http://cran.r-project.org/doc/manuals/R-intro.htmlhttp://cran.r-project.org/doc/manuals/R-intro.htmlhttp://cran.r-project.org/doc/manuals/R-intro.htmlhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://www.inside-r.org/r-doc/base/forhttp://www.inside-r.org/r-doc/base/forhttp://www.inside-r.org/r-doc/base/forhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.oecd.org/general/listofoecdmembercountries-ratificationoftheconventionontheoecd.htmhttp://www.inside-r.org/r-doc/base/forhttp://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.htmlhttp://cran.r-project.org/doc/manuals/R-intro.htmlhttp://cran.rstudio.com/https://github.com/jstray/permutation-test/archive/master.ziphttps://github.com/jstray/permutation-testhttps://github.com/jstray/permutation-testhttp://www.guardian.co.uk/news/datablog/2012/jul/22/gun-homicides-ownership-world-listhttp://stat.wharton.upenn.edu/~buja/PAPERS/Wickham-Cook-Hofmann-Buja-IEEE-TransVizCompGraphics_2010-Graphical%20Inference%20for%20Infovis.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/drawing-conclusions/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/assignment-5-statistical-inference/http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/http://jonathanstray.com/papers/wickham.pdfhttps://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.htmlhttp://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-7.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/05/lecture-7-drawing-conclusions-from-data/
  • 8/9/2019 Computational Journalism Hong Kong

    5/23

    6. It looks like there might be a pattern among the OECD countries, but the United States is

    such an outlier that its hard totell. Is this pattern still significant without the US? To find out,

    youre going to apply a randomization test. (Well also remove Mexico since its not a

    developed country and thus not really comparable to the other OECD countries.)

    Start with the file randomization-test.R. You need to write the code that performs the actual

    randomization, filling the eight of the columns of charts with random permutations of theoriginal y values (homicide rates), but putting the original data in the realchartcolumn. To

    prevent sneak peaks, the code is currently set up to use testing data. When your permutations

    are working right, you should see something like this when you run the file:

    After pressing Enter, the program will tell you which chart has the real (un-permuted) data.

    Here, with fake data, its obvious. It wont always be.

    7.Now that your program works, try it on the real data by commenting out the two lines that

    generate the fake data. Re-run, and look at the plots carefully. Which one do you think is the

    real data? Write down the number of the chart. Then hit enter, and see if you got it right.

    http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Screen-Shot-2013-02-05-at-3.52.58-PM.png
  • 8/9/2019 Computational Journalism Hong Kong

    6/23

    8.This isnt quite fair, because you were already looking at the data in step 4. So get someone

    elseto look at it fresh. Explain to them that you are charting firearms versus homicides and

    that one of the charts is real but the rest are fakes, and ask them to spot the real chart.

    9. Did you guess right? Did your fresh observer guess right? Did you and your observer guess

    differently? If so, why do you think that is? Was it difficult for you to choose? Based on all of

    this, do you think there is a correlation between gun ownership and homicide rate for theOECD countries? If so, how strong is it (effect size) and how strong is the evidence (statistical

    significance)?

    10. What does all this mean? Please write a short journalistic analysis of the

    global relationship between firearms ownership and homicide rate, for a general audience.

    Your editor has asked you to do this analysis and is very interested in whether there is a

    causal relationshipwhether more guns cause more crimeso you will have to include

    something about that.

    Turn in: answers to questions in steps 3,4,5,7,8,9, your code, and your final short analysis

    article.

    ASSIGNMENTDRAWING CONCLUSIONSLECTURES

    LECTURE 6: STRUCTURED

    JOURNALISM AND KNOWLEDGE

    REPRESENTATIONFEBRUARY 1, 2013LEAVE A COMMENT

    Is journalism in the text/video/audio business, or is it in the knowledge business? This classwell look at this question in detail, which gets us deep into the issue of how knowledge is

    represented in a computer. The traditional relational database model is often inappropriate for

    journalistic work, so were going to concentrate on so-called linked data representations.

    Such representations are widely used and increasingly popular. For example Google recently

    released the Knowledge Graph. But generating this kind of data from unstructured text is still

    very tricky, as well see when we look at th Reverb algorithm.

    Topics:Structured and unstructured data. Article metadata and schema.org. Linked open data

    and RDF. Entity extraction. Propositional representation of knowledge. Extracting structured

    data from unstructured text. The Reverb algorithm. DeepQA. Automatic story writing from

    data.

    Slides(PDF)

    Readings A fundamental way newspaper websites need to change,Adrian Holovaty

    The next web of open, linked data- Tim Berners-Lee TED talk

    Identifying Relations for Open Information Extraction,Fader, Soderland, and Etzioni

    (Reverb algorithm)

    Recommended

    Standards-based journalism in a semantic economy,Xark

    What the semantic web can represent- Tim Berners-Lee Building Watson: an overview of the DeepQA project

    http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-6.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-6.pdfhttp://www.holovaty.com/writing/fundamental-change/http://www.holovaty.com/writing/fundamental-change/http://blog.ted.com/2009/03/13/tim_berners_lee_web/http://blog.ted.com/2009/03/13/tim_berners_lee_web/http://ai.cs.washington.edu/pubs/279http://ai.cs.washington.edu/pubs/279http://xark.typepad.com/my_weblog/2011/01/standards-based-journalism-in-a-semantic-economy.htmlhttp://xark.typepad.com/my_weblog/2011/01/standards-based-journalism-in-a-semantic-economy.htmlhttp://www.w3.org/DesignIssues/RDFnot.htmlhttp://www.w3.org/DesignIssues/RDFnot.htmlhttp://aaaipress.org/ojs/index.php/aimagazine/article/download/2303/2165http://aaaipress.org/ojs/index.php/aimagazine/article/download/2303/2165http://aaaipress.org/ojs/index.php/aimagazine/article/download/2303/2165http://www.w3.org/DesignIssues/RDFnot.htmlhttp://xark.typepad.com/my_weblog/2011/01/standards-based-journalism-in-a-semantic-economy.htmlhttp://ai.cs.washington.edu/pubs/279http://blog.ted.com/2009/03/13/tim_berners_lee_web/http://www.holovaty.com/writing/fundamental-change/http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/02/Lecture-6.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/lecture-6-knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/
  • 8/9/2019 Computational Journalism Hong Kong

    7/23

    Can an algorithm write a better story than a reporter?Wired/ 2012.

    Assignment:Entity extraction.Text enrichment experiments using OpenCalais.KNOWLEDGE REPRESENTATIONLECTUREASSIGNMENTS

    ASSIGNMENT 4: ENTITY EXTRACTIONFEBRUARY 1, 2013LEAVE A COMMENTFor this assignment you will evaluate the performance of OpenCalais, a commercial entity

    extraction service. Youll do this by building a text enrichment program, which takes plain

    text and outputs HTML with links to the detected entities. Then you will take five random

    articles from your data set, enrich them, and manually count how many entities OpenCalais

    missed or got wrong.

    1. Get an OpenCalais API key, fromthis page.

    2. Install thepython-calaismodule.This will allow you to call OpenCalais from Python

    easily. First,downloadthe latest version of python-calais. To install it, you just need calais.pyin your working directory. You will probably also need to install thesimplejson Python

    module.Download it, then run python setup.py install. You may need to execute this as

    super-user.

    3. Call OpenCalais from Python.Make sure you can successfully submit text and get the

    results back, followingthese steps.The output you want to look at is in the entities array,

    which would be accessed as results.entities using the variable names in the sample code. In

    particular you want the list of occurrences for each entity, in the instances field.>>> result.entities[0]['instances']

    [{u'suffix': u' is the new President of the United States',

    u'prefix': u'of the United States of America until 2009. ',u'detection': u'[of the United States of America until 2009.]Barack Obama[ is the new President of the United States]',u'length': 12, u'offset': 75, u'exact': u'Barack Obama'}]

    >>> result.entities[0]['instances'][0]['offset']

    75

    >>>

    Each instance has offset and length fields that indicate where in the input text the entity

    was referenced. You can use these to determine where to place links in the output HTML.

    4. Read a text file, create hyperlinks, and write it out. Your Python program should read

    text from stdin and write HTML with links on all detected entities to stdout. There are two

    cases to handle, depending on how much information OpenCalais gives back.

    In many cases, like the example in step 3, OpenCalais will not be able to give you any

    information other than the string corresponding to the entity, result.entities[x]['name']. In this

    case you should construct a Wikipedia link by simply appending to the name to a Wikipedia

    URL, converting spaces to underscores, e.g.

    http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/#respondhttp://www.opencalais.com/user/registerhttp://www.opencalais.com/user/registerhttp://www.opencalais.com/user/registerhttp://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://code.google.com/p/python-calais/http://pypi.python.org/pypi/simplejsonhttp://pypi.python.org/pypi/simplejsonhttp://code.google.com/p/python-calais/downloads/detail?name=python-calais-1.4.tar.gzhttp://code.google.com/p/python-calais/http://www.opencalais.com/user/registerhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/knowledge-representation/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/02/01/assignment-4-entity-extraction/http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/
  • 8/9/2019 Computational Journalism Hong Kong

    8/23

    http://en.wikipedia.org/wiki/Barack_Obama

    In other cases, especially companies and places, OpenCalias will supply a link to an RDF

    document that contains more information about the entity. For example.

    >>> result.entities[0]{u'_typeReference':u'http://s.opencalais.com/1/type/em/e/Company', u'_type':u'Company', u'name': u'Starbucks', '__reference':u'http://d.opencalais.com/comphash-1/6b2d9108-7924-3b86-bdba-7410d77d7a79', u'instances': [{u'suffix': u' in Paris.',u'prefix': u'of the United States now and likes to drink at ',u'detection': u'[of the United States now and likes to drink at]Starbucks[ in Paris.]', u'length': 9, u'offset': 156, u'exact':u'Starbucks'}], u'relevance': 0.314, u'nationality': u'N/A',u'resolutions': [{u'name': u'Starbucks Corporation', u'symbol':u'SBUX.OQ', u'score': 1, u'shortname': u'Starbucks', u'ticker':

    u'SBUX', u'id': u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'}]}

    >>> result.entities[0]['resolutions'][0]['id']

    u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'

    >>>

    In this case the resolutions array will contain a hyperlink for each resolved entity, and this iswhere your link should go. The linked page will contain a series of triples (assertions) about

    the entity, which you can obtain in machine-readable from by changing the .html at the end of

    the link to .json. The sameAs: links are particularly important because they tell you that this

    entity is equivalent to others in dbPedia and elsewhere.

    Here is more onOpenCaliasentity disambiguation and use of linked data.

    The final result should look something like below. Note that some links go to OpenCalais

    entity pages with RDF links on them (London), some go to Wikipedia (politician) and

    some are broken links when Wikipedia doesnt have the topic (Aarthi Ramachandran) And

    of course Mr Gandhi is an entity that was not detected, three times.

    The latest effort to decode Mr Gandhi comes in the form of a limited yet rather well written

    biography by apolitical journalist,Aarthi Ramachandran.Her task is a thankless one. Mr

    Gandhi is an applicant for a big job: ultimately, to leadIndia.But whereas any other job

    applicant will at least offer minimal information about his qualifications, work experience,

    reasons for wanting a post, Mr Gandhi is so secretive and defensive that he wont respond to

    the most basic queries about his studies abroad, his time working for a management

    consultancy inLondon,or what he hopes to do as apolitician.

    Dont worry about producing a fully valid HTML document with headers and a tag,

    just wrap each entity with and . Your browser will load it fine.

    http://www.opencalais.com/documentation/linked-data-entitieshttp://www.opencalais.com/documentation/linked-data-entitieshttp://www.opencalais.com/documentation/linked-data-entitieshttp://www.opencalais.com/documentation/linked-data-entitieshttp://en.wikipedia.org/wiki/political_journalisthttp://en.wikipedia.org/wiki/political_journalisthttp://en.wikipedia.org/wiki/political_journalisthttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://en.wikipedia.org/wiki/politicianhttp://en.wikipedia.org/wiki/politicianhttp://en.wikipedia.org/wiki/politicianhttp://en.wikipedia.org/wiki/politicianhttp://d.opencalais.com/er/geo/city/ralg-geo1/f08025f6-8e95-c3ff-2909-0a5219ed3bfahttp://d.opencalais.com/er/geo/country/ralg-geo1/11a98374-ebec-8e0c-7a54-751d2161804dhttp://en.wikipedia.org/wiki/Aarthi_Ramachandranhttp://en.wikipedia.org/wiki/political_journalisthttp://www.opencalais.com/documentation/linked-data-entities
  • 8/9/2019 Computational Journalism Hong Kong

    9/23

    5. Pick five random news stories and enrich them. First pick a news site with many stories

    on the home page. Then generate five random numbers from 1 to the number of stories on the

    page. Cut and paste the text of each article into a separate file, and save as plain text (no

    HTML, no formatting.)

    6. Read the enriched documents and count to see how well OpenCalais did.You need to

    read each output document very carefully and count three things:

    Entity references. Count each time there is a name of a person, place, or organization

    appears, or other references to these things (e.g. the president.)

    Detected references. How many of these references did OpenCalais find?

    Correct references. How many of the links go to the right page? Did our hyperlinking

    strategy (OpenCalais RDF pages where possible, Wikipedia when not) fail to correctly

    disambiguate any of the references, or, even worse, disambiguate any to the wrong object?

    Also, a broken link counts as an incorrect reference.

    7. Turn in your work. Please turn in:

    Your code

    The enriched output from your documents

    A brief report describing your results.The report should include a table of the three numbersreferences, detected, correctfor

    each document, plusthe totals of these three numbers across all documents. Also report on

    any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it

    least accurate? Are there predictable patterns to the errors?

    This assignment is due before class on Monday, February 4.ASSIGNMENTKNOWLEDGE REPRESENTATIONLECTURES

    LECTURE 5: SOCIAL NETWORK

    ANALYSISJANUARY 29, 2013LEAVE A COMMENTNetwork analysis (aka social network analysis, link analysis) is a promising and popular

    technique for uncovering relationships between diverse individuals and organizations. It is

    widely used in intelligence and law enforcement, but not so much in journalism. Well look at

    basic techniques and algorithms and try to understand the promiseand the many practical

    problems.

    Topics:Whats a social network? Link analysis. Homophily and structural determinants of

    behavior. Centrality measurements. Community detection and the modularity algorithm. K-

    core decomposition. SNA in journalism. SNA that could be in journalism.Slides(PDF)

    Readings Analyzing the Data Behind Skin and Bone,ICIJ

    Identifying the Community Power Structure,an old handbook for community development

    workers about figuring out who is influential by very manual processes.

    Centrality and Network Flow,Borgatti

    Recommended

    Visualizing Communities,Jonathan Stray

    The network of global corporate control,Vitali et. al.

    http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-5.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-5.pdfhttp://www.icij.org/blog/2012/07/analyzing-data-behind-skin-and-bonehttp://www.icij.org/blog/2012/07/analyzing-data-behind-skin-and-bonehttp://www.soc.iastate.edu/extension/pub/comm/NCR19.pdfhttp://www.soc.iastate.edu/extension/pub/comm/NCR19.pdfhttp://www.analytictech.com/borgatti/papers/centflow.pdfhttp://www.analytictech.com/borgatti/papers/centflow.pdfhttp://jonathanstray.com/visualizing-communitieshttp://jonathanstray.com/visualizing-communitieshttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://jonathanstray.com/visualizing-communitieshttp://www.analytictech.com/borgatti/papers/centflow.pdfhttp://www.soc.iastate.edu/extension/pub/comm/NCR19.pdfhttp://www.icij.org/blog/2012/07/analyzing-data-behind-skin-and-bonehttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-5.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/lecture-5-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/
  • 8/9/2019 Computational Journalism Hong Kong

    10/23

    The Dynamics of Protest Recruitment through an Online Network,Sandra Gonzlez-

    Bailn, et al.

    Sections I and II ofCommunity Detection in Graphs,Fortunato

    Exploring Enron,Jeffrey Heer

    Examples:

    GalleonsWeb,Wall Street Journal

    Muckety

    Theyrule.net

    Who Runs Hong Kong?,South China Morning Post

    Assignment:Social network analysis.Compare different centrality metrics in Gephi.LECTURESOCIAL NETWORK ANALYSISASSIGNMENTS

    ASSIGNMENT 3: SOCIAL NETWORK

    ANALYSISJANUARY 29, 2013LEAVE A COMMENTFor this assignment you will analyze a social network using three different centrality

    algorithms, and compare the results.

    1. Download and installGephi,a free graph analysis package. It is open source and runs on

    any OS.

    2. Download the data filelesmis.gmlfrom theUCI Network Data Repository. This is a

    network extracted from the famous French novel Les Miserablesyou may also be familiar

    with the musical and the recent movie. Each node is a character, and there is an edge between

    two characters if they appear in the same chapter. Les Miserables is written in over 300 shortchapters, so two characters that appear in the same chapter are very likely to meet or talk in

    the plot of the book. Actually, the edges are weighted, and the weight is the number of

    chapters those characters appear together in.

    3. Open this file in Gephi, by choosing File->Open. When the dialog box comes up, set the

    Graph Type type to Undirected. The graph will be plotted. What do you see? Can you

    discern any patterns?

    4. Now arrange the nodes in a nicer way, by choosing the Force Atlas 2 layout algorithm

    from the Layout menu at left and pressing the Run button. When things settle down, hit the

    Stop button. The graph will be arranged nicely, but it will be quite small. You can zoom inusing the mouse wheel (or two fingers on the trackpad on a mac) and pan using the right

    mouse button.

    5. Select the Edit tool from the bottom of the toolbar on the left. It looks like a mouse

    pointer with question mark next to it:

    http://www.nature.com/srep/2011/111215/srep00197/full/srep00197.htmlhttp://www.nature.com/srep/2011/111215/srep00197/full/srep00197.htmlhttp://arxiv.org/pdf/0906.0612.pdfhttp://arxiv.org/pdf/0906.0612.pdfhttp://arxiv.org/pdf/0906.0612.pdfhttp://hci.stanford.edu/jheer/projects/enron/http://hci.stanford.edu/jheer/projects/enron/http://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://news.muckety.com/http://news.muckety.com/http://theyrule.net/http://theyrule.net/http://www.youtube.com/watch?v=0iQSVU1Z748&feature=relmfuhttp://www.youtube.com/watch?v=0iQSVU1Z748&feature=relmfuhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/#respondhttp://gephi.org/http://gephi.org/http://gephi.org/http://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/index.phphttp://networkdata.ics.uci.edu/data/lesmis/lesmis.gmlhttp://gephi.org/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/29/assignment-3-social-network-analysis/http://www.youtube.com/watch?v=0iQSVU1Z748&feature=relmfuhttp://theyrule.net/http://news.muckety.com/http://online.wsj.com/article/SB10001424052748703386704576186592268116056.htmlhttp://hci.stanford.edu/jheer/projects/enron/http://arxiv.org/pdf/0906.0612.pdfhttp://www.nature.com/srep/2011/111215/srep00197/full/srep00197.html
  • 8/9/2019 Computational Journalism Hong Kong

    11/23

    6. Now you can click on any node to see its label, which is the name of the character it

    represents. This information will appear in the Edit menu in the upper left. Heres the

    information for the character Gavroche.

    Click around the various nodes in the graph. Which characters have been given the most

    central locations? If you are familiar with the story of Les Miserables, how does this

    correspond to theplot?Are the most central nodes the most important characters?

    7. Make Gephi color nodes by degree. Choose the Ranking tab from panel at the upper left,

    then select the Nodes tab, then Degree from the drop-down menu. Press the Apply

    button.

    Now the nodes with the highest degree will be darker. Do these high degree nodes correspond

    to the nodes that the layout algorithm put in the center? Are they the main characters in the

    story?

    8. Now make Gephi compute betweenness and closeness centrality by pressing the Run

    button for the Network Diameter option under Network Overview in to the right of the

    screen.

    http://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plothttp://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plothttp://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plothttp://en.wikipedia.org/wiki/Les_Mis%C3%A9rables#Plot
  • 8/9/2019 Computational Journalism Hong Kong

    12/23

    You will get a report with some graphs. Just click Close. Now betweenness and closeness

    centrality will appear in the drop-down under Ranking, in the same place where you

    selected degree centrality earlier, and you can assign colors based on either run by clicking

    the Apply button.

    Also, the numerical values for betweenness centrality and closeness centrality will now

    appear in the Edit window for each node.

    Select Betweenness Centrality from the drop-down meny and hit Apply. What do you

    see? Which characters are marked as important? How does it differ from the characters which

    are marked as important by degree?

    Now selecte Closeness Centrality and hit Apply. (Note that this metric uses a scale which

    is the reverse of the otherscloseness measures average distance to all other nodes, so smallvalues indicate more central nodes. You may want to swap the black and white endpoints of

    the color scale to get something which is comparable to the other visualizations.) How does

    closeness centrality differ from betweeness centrality and degree? Which characters differ

    between closeness and the other metrics?

    9. Turn in: your answers to the questions in steps 3, 6, 7 and 8, plus screenshots for the graph

    plotted with degree, betweenness centrality, and closeness centrality. (To take a screenshot:

    on Windows, use the Snipping Tool. On Mac, pressCmd + Shift + 4. If youre on

    Linux, you get to tell me)

    What I am interested in here is how the values computed by the different algorithms

    correspond to the plot of Les Miserables (if you are familiar with it), and how they compare to

    each other. Telling me that Jean Valjean has a closeness centrality of X is not a high -

    enough level interpretationyour couldnt publish that in a finished story, because your

    readers wont know what that means.

    Due: before class on Friday, 1 February.

    ASSIGNMENTSOCIAL NETWORK ANALYSISLECTURES

    http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/
  • 8/9/2019 Computational Journalism Hong Kong

    13/23

    LECTURE 4: SOCIAL AND HYBRID

    FILTERSJANUARY 27, 2013LEAVE A COMMENT

    Its possible to build powerful filtering systems by combining software and people,incorporating both algorithmic content analysis and human actions such as follow, share, and

    like. Well look recommendation systems, the Facebook news feed, and the socially-driven

    algorithms behind them. Well finish by looking at an example of using human preferences to

    drive machine learning algorithms: Google Web search.

    Topics:Social filtering. The network structure of Twitter. Social software. Comment ranking

    on Reddit. Confidence sorting. User-item recommendation and collaborative filtering. Hybrid

    filters. What makes a good filter?

    Slides(PDF)

    Readings Finding and Assessing Social Information Sources in the Context of Journalism,Nick

    Diakopolous et al.

    Item-Based Collaborative Filtering Recommendation Algorithms,Sarwar et. al

    How Reddit Ranking Algorithms Work,Amir Salihefendic

    Recommended

    Google News Personalization: Scalable Online Collaborative Filtering,Das et al

    Slashdot Moderation,Rob Malda

    What is Twitter, a Social Network or a News Media?,Haewoon Kwak, et al,

    The Netflix Prize,Wikipedia

    How does Google use human raters in web search?, Matt CuttsAssignment:Hybrid filter Design.Design a filtering algorithm for status updates.

    ASSIGNMENTS

    ASSIGNMENT 2: FILTER DESIGNJANUARY 25, 2013LEAVE A COMMENT

    For this assignment you will design a hybrid filtering algorithm. You will not implement it,

    but you will explain your design criteria and provide a filtering algorithm in sufficient

    technical detail to convince me that it might actually workincluding psuedocode.

    1. Decide who your users are. Journalists? Professionals? General consumers? Someone else?

    2. Decide what you will filter. You can choose:

    Facebook status updates, like the Facebook news feed

    Weibos, like Weiboscope

    Tweets, like Trending Topics or the many Tweet discovery tools

    The whole web, like Prismatic

    something else, but ask me first

    http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-4.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-4.pdfhttp://sm.rutgers.edu/pubs/diakopoulos-srsr-chi2012.pdfhttp://www.stat.osu.edu/~dmsl/Sarwar_2001.pdfhttp://www.stat.osu.edu/~dmsl/Sarwar_2001.pdfhttp://amix.dk/blog/post/19588http://amix.dk/blog/post/19588http://www2007.org/papers/paper570.pdfhttp://www2007.org/papers/paper570.pdfhttp://slashdot.org/moderation.shtmlhttp://slashdot.org/moderation.shtmlhttp://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdfhttp://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdfhttp://en.wikipedia.org/wiki/Netflix_Prizehttp://en.wikipedia.org/wiki/Netflix_Prizehttp://searchengineland.com/video-google-speaks-about-search-quality-raters-119986http://searchengineland.com/video-google-speaks-about-search-quality-raters-119986http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/25/assignment-2-filter-design/http://searchengineland.com/video-google-speaks-about-search-quality-raters-119986http://en.wikipedia.org/wiki/Netflix_Prizehttp://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdfhttp://slashdot.org/moderation.shtmlhttp://www2007.org/papers/paper570.pdfhttp://amix.dk/blog/post/19588http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdfhttp://sm.rutgers.edu/pubs/diakopoulos-srsr-chi2012.pdfhttp://sm.rutgers.edu/pubs/diakopoulos-srsr-chi2012.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-4.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/27/lecture-4-social-and-hybrid-filters/
  • 8/9/2019 Computational Journalism Hong Kong

    14/23

    3. List all available information that you have available as input to your algorithm. If you

    want to filter Facebook or Twitter or Weibos, you may pretend that you are the company

    running the service, and have access to all posts and user datafrom every user. You also

    also assume you have a web crawler or a firehose of every RSS feed or whatever you like, but

    you must be specific and realistic about what data you are operating with.

    4. Argue for the design factors that you would like to influence the filtering, in terms of what

    is desirable to the user, what is desirable to the publisher (e.g. Facebook or Prismatic), and

    what is desirable socially. Explain as concretely as possible how each of these (probably

    conflicting) goals might be achieved through in software. Since this is a hybrid filter, you can

    also design social software that asks the user for certain types of information (e.g. likes, votes,

    ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.

    5. Write psuedo-code for a function that produces a top stories list. This function will be

    called whenever the user loads your page or opens your app, so it must be fast and frequently

    updated. You can assume that there are background processes operating on your servers if youlike. Your psuedo-code does not have to be executable, but it must be specific and

    unambiguous, such that a good programmer could actually go and implement it. You can

    assume that you have libraries for classic text analysis and machine learning algorithms. So,

    you dont have to spell out algorithms like TF-IDF or item-based collaborative filtering, or

    anything else you can dig up in the research literature, but simply say how youre going to use

    such building blocks. If you use an algorithm we havent discussed in class, be sure to provide

    a reference to it.

    6. Write up steps 1-5. The result should be no more than three pages. However, you must

    bespecific andplausible. You must be clear about what you are trying to accomplish, what

    your algorithm is, and why you believe your algorithm meets your design goals (though ofcourse its impossible to know for sure without testing; but I want something that looks good

    enough to be worth trying.)

    The assignment is due before class on Tuesday, January 29.

    ASSIGNMENTFILTER DESIGNLECTURES

    LECTURE 3: ALGORITHMIC FILTERSJANUARY 23, 2013LEAVE A COMMENT

    This class we begin our study of filtering with some basic ideas about its role in journalism.Theres just way too much information produced every day, more than any one person can

    read by a factor of millions. We need software to help us deal with this flood. In this lecture,

    we discuss purely algorithmic approaches to filtering, with a look at how the Newsblaster

    system works (similar to Google News.)

    Topics:How bad information overload actually is. The Newsblaster system, a precursor to

    Google News. Clustering together stories on the same event. Sorting stories into topics.

    Personalization. The filter bubble, and the filter design problem.

    Slides (PDF)

    Readings Who should see what when? Three design principles for personalized news,Jonathan Stray

    http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-3.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-3.pdfhttp://www.niemanlab.org/2012/07/who-should-see-what-when-three-principles-for-personalized-news/http://www.niemanlab.org/2012/07/who-should-see-what-when-three-principles-for-personalized-news/http://www.niemanlab.org/2012/07/who-should-see-what-when-three-principles-for-personalized-news/http://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-3.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/23/lecture-3-algorithmic-filters/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/assignment/
  • 8/9/2019 Computational Journalism Hong Kong

    15/23

    Tracking and summarizing news on a daily basis with Columbia Newsblaster,McKeown

    et al

    Recommended

    Are we stuck in filter bubbles? Here are five potential paths out,Jonathan Stray

    Guess what? Automated news doesntquite work,Gabe Rivera The Hermeneutics of Screwing Around, or What You Do With a Million Books,Stephen

    Ramsay

    Can an algorithm be wrong?,Tarleton Gillespie

    CLUSTERINGFILTER DESIGNLECTURELECTURES

    LECTURE 2: TEXT ANALYSISJANUARY 20, 2013LEAVE A COMMENT

    Can we use machines to help us understand text? In this class we will cover basic text analysis

    techniques, from word counting to topic modeling. The algorithms we will discuss this class

    are used in just about everything: search engines, document set visualization, figuring out

    when two different articles are about the same story, finding trending topics. The vector space

    document model is fundamental to algorithmic handling of news content, and we will need it

    to understand how just about every filtering and personalization system works.

    Topics:Telling stories from quantitative analysis of language, word frequencies, the bag-of-words document vector model, cosine distance, TF-IDF, and a demonstration of the Overview

    document set mining tool.

    Slides(PDF)

    Readings

    Online Natural Language Processing Course,Stanford University

    Week 7: Information Retrieval, Term-Document Incidence Matrix

    Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval

    Week 7: Ranked Information Retrieval, Term Frequency Weighting

    Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting

    Week 7: Ranked Information Retrieval, TF-IDF weighting

    Recommended

    Probabilistic Topic Models,David M. Blei

    General purpose computer-assisted clustering and conceptualization,Justin Grimmer, Gary

    King

    A full-text visualization of the Iraq war logs,Jonathan Stray

    Introduction to Information Retrieval Chapter 6,Scoring, Term Weighting, and The Vector

    Space Model,Manning, Raghavan, and Schtze.

    Examples

    Watchwords: Reading China Through its Party Vocabulary, Qian Gang Message Machine,ProPublica

    http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Tracking_and_Summarizing_News_on_a_Daily_Basis_with_Columbia_s_Newsblaster.pdfhttp://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Tracking_and_Summarizing_News_on_a_Daily_Basis_with_Columbia_s_Newsblaster.pdfhttp://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://news.techmeme.com/081203/automatedhttp://news.techmeme.com/081203/automatedhttp://news.techmeme.com/081203/automatedhttp://news.techmeme.com/081203/automatedhttp://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdfhttp://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdfhttp://limn.it/can-an-algorithm-be-wrong/http://limn.it/can-an-algorithm-be-wrong/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-2.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-2.pdfhttps://class.coursera.org/nlp/lecture/previewhttps://class.coursera.org/nlp/lecture/previewhttp://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttp://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttp://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://overview.ap.org/blog/2010/12/a-full-text-visualization-of-the-iraq-war-logs/http://overview.ap.org/blog/2010/12/a-full-text-visualization-of-the-iraq-war-logs/http://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://cmp.hku.hk/watchwords2012/http://cmp.hku.hk/watchwords2012/http://projects.propublica.org/emails/http://projects.propublica.org/emails/http://projects.propublica.org/emails/http://cmp.hku.hk/watchwords2012/http://cmp.hku.hk/watchwords2012/http://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://nlp.stanford.edu/IR-book/pdf/06vect.pdfhttp://overview.ap.org/blog/2010/12/a-full-text-visualization-of-the-iraq-war-logs/http://gking.harvard.edu/files/abs/discov-abs.shtmlhttp://www.cs.princeton.edu/~blei/papers/Blei2012.pdfhttps://class.coursera.org/nlp/lecture/previewhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-2.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-2-text-analysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://limn.it/can-an-algorithm-be-wrong/http://www.playingwithhistory.com/wp-content/uploads/2010/04/hermeneutics.pdfhttp://news.techmeme.com/081203/automatedhttp://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://www.niemanlab.org/2012/07/are-we-stuck-in-filter-bubbles-here-are-five-potential-paths-out/http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Tracking_and_Summarizing_News_on_a_Daily_Basis_with_Columbia_s_Newsblaster.pdf
  • 8/9/2019 Computational Journalism Hong Kong

    16/23

    Assignment:TF-IDF.Analyze the topics of the U.S. State of the Union addresses over the

    decades.LECTURETEXT ANALYSISLECTURES

    LECTURE 1: BASICSJANUARY 20, 2013LEAVE A COMMENTWell try to define computational journalism, as the application of computer science to four

    different areas: data-driven reporting, story presentation, information filtering, and effect

    tracking. But first we have to figure out how to represent the outside world as data. We do this

    using the feature vector representation. One of the most useful things we can do with such

    vectors is compute the distances between two of them. We can also visualize the entire vector

    space, but to do this we have to project the high-dimensional space down to the two

    dimensions of the screen.

    Topics:The definition of computational journalism, encoding the world as feature vectors,distance metrics, clustering algorithms, and visualization using multi-dimensional scaling.

    Slides (PDF)

    Readings

    Computational Journalism,Cohen, Turner, Hamilton

    sections 1 and 2 ofThe Challenges of Clustering High Dimensional Data,Steinbach,

    Ertz, Kumar

    Recommended

    What should the digital public sphere do?, Jonathan Stray

    Precision Journalism, Ch.1,Journalism and the Scientific Tradition,Philip Meyer Using clustering to analyze the voting blocs in the UK House of Lords,Jonathan Stray

    Examples

    The Jobless rate for People Like You,New York Times

    Dollars for Docs,ProPublica

    What did private security contractors do in Iraqanddocument mining methodology,

    Jonathan Stray

    The network of global corporate control,Vitali et. al.

    GOP5make strange bedfellows in budget fight,Chase Davis, California Watch

    CLUSTERINGLECTUREASSIGNMENTS

    ASSIGNMENT 1: TF-IDFJANUARY 18, 2013LEAVE A COMMENT

    Update:Henry Williams has kindly made available hiscode for the solutionto this

    assignment.

    In this assignment you will implement the TF-IDF formula and use it to study the topics in

    State of the Union speeches given every year by the U.S. president.

    http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-1.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-1.pdfhttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://www.compjournalism.com/?p=13http://www.compjournalism.com/?p=13http://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.htmlhttp://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.htmlhttp://projects.propublica.org/docdollars/http://projects.propublica.org/docdollars/http://overview.ap.org/blog/2012/02/iraq-security-contractors/http://overview.ap.org/blog/2012/02/iraq-security-contractors/http://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/#respondhttps://github.com/digitalhen/speechAnalysis/https://github.com/digitalhen/speechAnalysis/https://github.com/digitalhen/speechAnalysis/https://github.com/digitalhen/speechAnalysis/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/assignments/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/clustering/http://californiawatch.org/dailyreport/gop-5-make-strange-bedfellows-budget-fight-9262http://arxiv.org/PS_cache/arxiv/pdf/1107/1107.5728v2.pdfhttp://overview.ap.org/blog/2012/02/private-security-contractors-in-iraq-analysis/http://overview.ap.org/blog/2012/02/iraq-security-contractors/http://projects.propublica.org/docdollars/http://www.nytimes.com/interactive/2009/11/06/business/economy/unemployment-lines.htmlhttp://www.compjournalism.com/?p=13http://www.unc.edu/~pmeyer/book/Chapter1.htmhttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://jonathanstray.com/what-should-the-digital-public-sphere-dohttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://courses.jmsc.hku.hk/jmsc6041spring2013/files/2013/01/Lecture-1.pdfhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/#respondhttp://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/20/lecture-1-basics/http://courses.jmsc.hku.hk/jmsc6041spring2013/category/lectures/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/tag/lecture/http://courses.jmsc.hku.hk/jmsc6041spring2013/2013/01/18/assignment-1-tf-idf/
  • 8/9/2019 Computational Journalism Hong Kong

    17/23

    1. Download the source data filestate-of-the-union.csv.This is a standard CSV file with one

    speech per row. There are two columns: the year of the speech, and the text of the speech.

    You will write a Python program that reads this file and turns it into TF-IDF document

    vectors, then prints out some information. Here ishow to read a CSV in Python.

    2. Tokenize the text each speech, to turn it into a list of words. As we discussed in class, were

    going to tokenize using a simple scheme:

    convert all characters to lowercase

    remove all punctuation characters

    split the string on spaces

    3. Compute a TF (term frequency) vector for each document. This is simply how many times

    each word appears in that document. You should end up with a Python dictionary from terms

    (strings) to term counts (numbers) for each document.

    4.Count how many documents each word appears in. This can be done after computing how

    the TF vector by each document, by incrementing the document count of each word that

    appears in the TF vector. After reading all documents you should now have a dictionary from

    each term to the number of documents that term appears in.

    5. Turn the final document counts into IDF (inverse document frequency) weights byapplying the formula IDF(term) = log(total number of documents / number of documents that

    term appears in.)

    6.Now multiply the TF vectors for each document by the IDF weights for each term, to

    produce TF-IDF vectors for each document. Then normalize each vector, so the sum of

    squared weights is 1.

    7.Congratulations! You have a set of TF-IDF vectors for this corpus. Now its time to see

    what they say. Take the speech you were assigned in class, and print out the highest weighted

    20 terms, along with their weights. What do you think this particular speech is about? Write

    your answer in at most 200 words.

    8. Your task now is to see if you can understand how the topics changed since 1900. For each

    decade since 1900, do the following:

    sum all of the TF-IDF vectors for all speeches in that decade

    print out the top 20 terms in the summed vector, and their weights

    Now take a look at the terms for each decade. What patterns do you see? Can you connect the

    terms to major historical events? (wars, the great depression, assassinations, the civil rights

    movement, Watergate) Write up what you see in narrativeform, no more than 500 words,

    referring to the terms for each decade.

    9. Hand in:

    your code

    the printout and analysis from step 7

    the printout and narrative from step 8.

    SYLLABUS

    This class will cover, in great detail, some of the most advanced techniques used by

    journalists to understand digital information, and communicate it to users. We will focus on

    unstructured text information in large quantities, and also cover related topics such as how to

    draw conclusions from data without fooling yourself, social network analysis, and online

    http://jonathanstray.com/papers/state-of-the-union.csvhttp://jonathanstray.com/papers/state-of-the-union.csvhttp://jonathanstray.com/papers/state-of-the-union.csvhttp://docs.python.org/2/library/csv.htmlhttp://docs.python.org/2/library/csv.htmlhttp://docs.python.org/2/library/csv.htmlhttp://docs.python.org/2/library/csv.htmlhttp://jonathanstray.com/papers/state-of-the-union.csv
  • 8/9/2019 Computational Journalism Hong Kong

    18/23

    security for journalists. These are the algorithms used by search engines and intelligence

    agencies and everyone in between.

    Due to our short scheduleeight classes over three weeksthis will be an intense course.

    You will be given a homework assignment every class, which should take you 3-6 hours to

    complete. About half of the assignments will involve some programming in Python. This

    course will be quite technicalit is, after all, a course about applying computer science to

    journalism. Aside from being able to program, I assume you know basic computer science

    theory, and mathematics up to linear algebra. However, the assignments will also require you

    to explain, in plain English, what the algorithmic result means in journalism terms. The code

    will not be enough.

    Please note that the JMSC is also offering a more accessible data journalism course in May,

    taught by Irene Jay Liu. You may find that course a better fit if you do not have programming

    experience. If you are not taking this course for credit you are welcome to sit in on the

    lectures, but I will not mark your assignments.

    You will be assigned readings to study beforeeach lecture. These will typically be research

    papers. There are also recommended readings that will tell you much more about the topics

    we cover, and examples of stories that use these techniques.

    The course will be graded as follows:

    Assignments: 60%, weighted equally

    Class participation: 10%

    Final project: 30%

    Lecture 1.Basics

    Well try to define computational journalism, as the application of computer science to four

    different areas: data-driven reporting, story presentation, information filtering, and effect

    tracking. But first we have to figure out how to represent the outside world as data. We do this

    using the feature vector representation. One of the most useful things we can do with such

    vectors is compute the distances between two of them. We can also visualize the entire vector

    space, but to do this we have to project the high-dimensional space down to the two

    dimensions of the screen.

    Required

    Computational Journalism,Cohen, Turner, Hamilton

    sections 1 and 2 ofThe Challenges of Clustering High Dimensional Data,Steinbach,

    Ertz, Kumar

    http://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltexthttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdfhttp://cacm.acm.org/magazines/2011/10/131400-computational-journalism/fulltext
  • 8/9/2019 Computational Journalism Hong Kong

    19/23

    Recommended

    What should the digital public sphere do?, Jonathan Stray

    Precision Journalism, Ch.1,Journalism and the Scientific Tradition,Philip Meyer

    Using clustering to analyze the voting blocs in the UK House of Lords,Jonathan Stray

    Examples

    The Jobless rate for People Like