Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf ·...
Transcript of Collaborative Publishing: Wiki and Wikipediapeople.cs.pitt.edu/~rosta/SocialWeb/wiki.pdf ·...
Collaborative Publishing:
Wiki and Wikipedia
By Qi Li
Agenda
• Overview of Wiki and Wikipedia
• Knowledge Organization of Wikipedia
• Improving Wikipedia’s Accurary
• Wikipedia in Natural Language Processing
Overview of Wiki and
Wikipedia
Reference:
Keshava P Subramanya ([email protected])
Roopa Kannan ([email protected])
What is Wikipedia?
• Wikipedia is a freely licensed encyclopedia written by thousands of volunteers in many languages
• Free license allows others to freely
copy, redistribute, and modify our work
commercially or non-commercially
• Founded January 15, 2001
wikipedia.org
What is wikis?
• A wiki is software that allows users to
create, edit, and link web pages easily.
• Wikis are often used to create
collaborative websites and to power
community websites.
• Ward Cunningham, developer of the first
wiki, WikiWikiWeb, originally described it
as "the simplest online database that could
possibly work".wikipedia.org
What is the Wikimedia Foundation?
• Non-profit foundation
• Aims to distribute a free encyclopedia to every single person on the planet in their own language
• Wikipedia and its sister projects
• Funded by public donations
• Applying for grants
wikimediafoundation.org
Wikimedia Foundation
Governed by Board of Directors (5 positions: 1 permanent (Jimmy Wales) 2 Bomis reps, 2 community reps)
Foundation coordinates official (volunteer) positions:
Fundraising, legal, technical development, press, etc
MediaWiki (software)
And the projects:
Local chapters: English (en); German (de); Italian (it); etc.: 215 languages in total
Wiktionary Wikinews Wikipedia Wikiversity Wikiquote Wikisource Commons
English-language
WikipediaAdmins
Long-term users, lots of contribs, heavy community participation
Logged-in users with some contributions
less community participation
Anonymous IP edits
Vandals, trolls, sockpuppets
Foundation
board
Developers, stewards, bureaucrats
Advantages of Free License
• Remains non-proprietary
• Decreases individual sense of
ownership
• Increases a sense of shared ownership
• Enhances the popularity of Wikipedia
• Attribution requirement extends brand
Free Software
• MediaWiki is GPL
• We use all free software on the website
• GNU/Linux
• Apache
• MySQL
• Php
How big is Wikipedia?
• English Wikipedia is largest and has over 130 million words
• English Wikipedia larger than Britannica
and Microsoft Encarta combined
• In 15 months the publicly distributed
compressed database dumps may reach 1
terabyte total size
How big is Wikipedia Globally?
• English – 533,000 articles
• German – 220,000 article
• Japanese – 110,000 articles
• French – 100,000 articles
• Swedish – 71,000 articles
• Nearly 1.5 million across 200 languages
• 20+ with >10,000. 50+ with >1000
How popular is Wikipedia?
Wikimedia Projects
• Wikipedia
• Wiktionary
• Wikibooks
• Wikisource
• Wikiquote
• Wikispecies
• Wikimedia Commons
• Wikinews
Wikimedia’s Hardware
• 40+ servers
• Squid caching servers in front to serve
cached objects quickly
• Apache/PHP webservers in the middle
• Database backend (MySql)
MediaWiki
• MediaWiki is one of many wiki engines
• Collaborative software that allows users to
add or edit content
• Primarily developed for Wikipedia from
2002 onwards
• Scalable and multilingual
• Free license
MediaWiki features
• Quality control features (versioning)
• Editing features (simple markup)
• Community features (talk pages, profiles,
access levels)
Jakob Voss : Knowledge Organization with Wikipedia. 5th NKOS Workshop, Sep 21,2006
Knowledge Organization
with Wikipedia
Reference:
1. Jakob Voss Common Library Network (GBV) at 5th
NKOS Workshop, Alicante September 21, 2006
2. Phoebe Ayers: UC Davis, Physical Sciences &
Engineering Library, phoebe.ayers @ gmail.com
en.wikipedia.org/wiki/User:Phoebe Ayers
[[Outline]]
• Wikipedia: namespace
• Wikipedia's Category system
• Mapping
• Indexing with Wikipedia articles
[[What is Wikipedia
namespaces]]• Main: The main namespace or article namespace
is the encyclopedia proper. It is the default
namespace and does not use a prefix.
• Portal (prefix Portal:) is for reader-oriented
portals that help readers find and browse through
articles related to a specific subject.
• User (prefix User:) is a namespace that provides
pages for Wikipedia users' personal
presentations and auxiliary pages for personal
use, for example containing bookmark to favorite
pages.
• Image (prefix Image:, also called image
description pages) is a namespace that provides
info about images and sound clips, one page for
each, with a link to the image or sound clip itself.
Wikipedia Namespace (cont.)
• Category contains categories of pages, with each displaying
a list of pages in that category and optional additional text.
• Help: the basic, technical features of Wikipedia.
• Talk namespaces: are used to discuss changes to the
corresponding page in the associated namespace. Pages in
the user talk namespace are used to leave messages for a
particular user.
– the talk namespace associated with the main article
namespace has the prefix Talk:,
– while the talk namespace associated with the user
namespace has the prefix User talk:
Wikipedia Namespace (cont.)
• MediaWiki (prefix MediaWiki:) is a namespace
containing interface texts such as link labels and
messages. They are used for adjusting the localisation
(i.e. local version) of interface messages without waiting
for a new LanguageXx.php file to get installed.
• Template (formerly part of the MediaWiki namespace) is
used to define a standard text which can then be
conveniently added within pages, either the text itself at
the time of adding, or a reference to the text at the time
of viewing the page. The latter way effectively changes
all such occurrences of the standard text automatically
by just editing the page where the text is defined. .
How do articles get written?
• Someone starts it
• Someone else checks it
• A (possibly third) party edits it…
http://en.wikipedia.org/wiki/Help:Contents/
Editing_Wikipedia
Article Criteria
• Notable (encyclopedic)
• Not vanity
• Not duplication
• Community consensus…
Edit wars… and other things
that go boom
Predictable vandalism… posted and reverted the same minute (10:31)
How to edit Wikipedia
categories• Tagging by linking
[[Categorie:Information Science]]
...
• Open for all
• Blind tagging
• Multi-hierarchical relations
• High connectivity
Jakob Voss : Knowledge
Organization with
Wikipedia. 5th NKOS
Workshop, Sep 21,2006
[[Wikipedia categories]]
Jakob Voss : Knowledge
Organization with
Wikipedia. 5th NKOS
Workshop, Sep 21,2006
[[Category system as KOS]]
Collaboratively edited, general thesaurus(en, Jan 2006: 91,502 categories, 923,196 articles)
Distribution of descriptor levels
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
1 2 3 4 5 6 7 8 9 10 11 12
level
descriptors
DDC (ext.)
Wikipedia (en)
Voss (2006): Collaborative thesaurus tagging the Wikipedia way
http://arxiv.org/abs/cs.IR/0604036
Distribution of descriptors per record
0%
1%
10%
100%
1 2 3 4 5 6 7 8 9
descriptors (categories or tags)
records (pages or posts)
Wikipedia
del.icio.us
exponential (λ=0.6)
What is Collaborative Publishing?
• Collaborative: works are created by
multiple people together rather than
individually
• Publishing: knowledge
• Some projects are overseen by an editor
or editorial team
• Many grow without any top-down oversight
Characteristics 1: access
control
• All users to edit any page but with control
access
• Control Access
Create Edit Link browsing
Administration × × × ×
Group
Individual
Public
Characteristics 2: Revision
control
Wikipedia’s Accuracy
Criticisms
• Could a collaborative project that anyone can edit be “a public good”?– Contribute articles
– Quality of articles is close to Encyclopaedia Britannica
• Vandalism
• Creeping bureaucracy & growing instances of infighting among editors
• The community’s anti-intellectual attitude
• “digital Maoism”
• “faith-based encyclopaedia”
Further criticisms
• Entries for pop cultural figures vs. those
for great literary figures, scientists, etc.
• Entry for Britney Spears longer than entry
for St. Augustine
• Seinfeld longer than Shakespeare; Barbie
longer than Bellow
• Response: Nothing to get exercised about
80/10 Rule
• Counting only logged in users, and even excluding some prominent approved bot users
• 10 percent of all users make 80% of all edits
• 5 percent of all users make 66% of edits
• Half of all edits are made by just 2 1/2 percent of all users
Edits by Anons
• Controversial, intruiging
• Yes, you can edit this page
• Without logging in!
Edits by Anons - %
• Anonymous ip numbers can edit
Wikipedia, and do
• But these edits make up a total of around
18% of all edits, with some evidence of a
downward trend over time
• Anecdotally, many regular users report
sometimes editing anonymously by
accident or as a quiet form of Sock
Puppeting
Edits across namespaces
• Articles 85%
• Talk pages 8%
• User Page 3%
• User Talk Pages 4%
These percentages are stable in 2003
And 2004
Studying the Accuracy of Wikipedia
• Study by Nature
– “factual errors, omissions or misleading
statements”: Wikipedia vs Britannica: 162 vs
123; major : 4 vs 4
• Survey: whether they think sample articles
are accurate
– 76% -- accurate
Separate the wheat from the chaff
• Proposal 1: Based on explicit article
validation
– “trusted user” (defined using various criteria)
explicitly marks an article as “good”
– Peer-based explicit system: allow users to
choose which of their peers to trust, thus
providing different results for each user
– Shortage: explicit input from reviews
• Proposal 2: automatically assess information quality by calculating metrics based on metadata recorded and stored by Wikipedia– Metrics: # of edits made for the article and # of unique editors for the article
– Distinguish of two classes of pages
– Link ratio analysis
– Quality of editors
– Trustworthiness or reputation of authors and articles
– Segments instead of articles
• Surprisingly successful
• Large/Complete/Coverage
• Again: Free
Wikipedia and copyright
• Need for copyright less than we imagined?
• Do our empirical assumptions about the
need for copyright need adjustment?
• Take open source software like Linux
(Surowieki, 2004).
References• Cohen, Noam. “Courts Turn to Wikipedia, but Selectively.” The New York Times
January 29 (2007): Section C, page 3.
• Economist. “Battle of Britannica.” Economist 378.8471 (April 1, 2006): 65-66.
• Fallis, Don. “The Epistemic Benefits and Costs and Collaboration.” Southern Journal of Philosophy 44.S (2006): 197-208.
• Fallis, Don. “On Verifying the Accuracy of Information: Philosophical Perspectives.”Library Trends 52.3 (2004): 463-487.
• Fricke, Martin and Don Fallis. “Indicators of Accuracy of Consumer Health Information on the Internet.” Journal of the American Medical Informatics Association9 (2002): 73-79.
• Giles, J. “Internet Encyclopedias Go Head to Head.” Nature 438.7069 (December 15, 2005): 900-901.
• Hettinger, Edwin. “Justifying Intellectual Property.” Philosophy and Public Affairs 18 (1989): 31-52.
• Paine, Lynn Sharp. “Trade Secrets and the Justification of Intellectual Property: A Comment on Hettinger.” Philosophy and Public Affairs 20 (1991): 247-263.
• Poe, Marshall. “The Hive.” Atlantic Monthly 298.2 (September 2006): 86-94.
• Resnik, David. “A Pluralistic Account of Intellectual Property.” Journal of Business Ethics 46 (2003): 319-335.
• Schiff, Stacy. “Know it All.” New Yorker 82.23 (July 31, 2006).
• Sunstein, Cass. “Mobbed up.” New Republic 230.24 (June 28, 2004): 40-45.
• Surowieki, James. The Wisdom of Crowds. New York: Anchor Books, 2004.
Wikipedia in NLP
Ontology
Thesauri
Categorization
Topic Detection
Information Retrieval (Query Expansion)
Word Sense Disambiguation
Question Answer
Translation (CLIR)
…
Wikitology !
• Using Wikipedia as an ontology offers the
best of both approaches
–Each article is a concept in the
ontology
–Terms linked via Wikipedia’s category
system and inter-article links
• It’s a consensus ontology created, kept
current and maintained by a diverse
community
• Overall content quality is high•••• intro •••• wikipedia •••• experiments •••• evaluation •••• next •••• conclusion ••••
Wikitology features
• Terms have unique IDs (URLs) and are “self describing” for people
• Several underlying graphs provide structure: categories, article links
• Article history contains useful meta-data (e.g., for trust)
• External sources provide more info (e.g., Google’s pagerank)
• Some of the data available in structured form, e.g., in RDF from DBpedia•••• intro •••• wikipedia •••• experiments •••• evaluation •••• next •••• conclusion ••••
[[Semantic Wikipedia]]
Typed links: [[is capital of::England]]
=> RDF triples
Völkel et al (2006): Semantic Wikipedia. WWW2006 conference
Thesauri
• Reference:
– Mining Domain-Specific Thesauri from
Wikipedia: A case study, Milne, D., Medelyan,
O., and Witten, H. 2006. Proceedings of the
2006 IEEE/WIC/ACM International
Conference on Web Intelligence
– Milne, D., Witten, I. H., & Nichols, D. M.
(2007). Extracting corpus specific knowledge
bases from Wikipedia. CIKM. Lisbon,
Portugal.
Thesauri
• Thesauri:
– an indexed compilation of words with similar,
related, broader, narrower and opposite
meanings.
• Wikipedia
– Each article - a concept
– Hyperlinks - relations
• Equivalence - USE, USE FOR
• Hierarchical - BT, NT
• Associative - RT
Topic Detection
• Reference:• Identifying document topics using the Wikipedia category network,
Peter Schonhofen, Proceedings of the 2006 IEEE/ACM International
Conference on Web Intelligence (WI 2006 Main Conference
Proceedings)
• Topic Detection:
– utilize an ontology to detect concepts in the document
– select the most dominant concepts to present the
document.
• Ontology from wikipedia
– Coverage of wikipedia is general purpose and very
wide,
– Structure is rich and consistent
Wikipedia structure
• Components: articles, images pages, discussion about article contents, authors, page component templates and so on.
• Articles: titles, categories, refer to other articles
• Categories: hierarchically into sub- and super-categories (not just tree)
• Author: links between articles, hierarchy of categories.
Wikipedia structure
Wikipedia for classification
• Reference: – Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text
Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and
Shaul Markovitch American Association for Artificial Intelligence 2006
– Benerjee, S., Ramanthan, K., & Gupta, A. (2007), Clustering short text
using Wikipedia, SIGIR
– Meyer, M., & Rensing, C. (2007). Categorizing Learning Objects based
on Wikiepdia as Substitue Corpus. Proceedings of the First International
Workshop on Learning Object Discovery and Exchange.
• Deals with automatic assignment of category labels to natural language documents
• Represent document as bags of words
• Features from words
• Limitation of BOW:• by individual word occurrences in the training set
– Wal-Mart supply chain goes real time
– Wal-Mart manages its stock with RFID technology
• Effective in medium difficulty categorization, but bad in small categories or short documents
• Using encyclopedia to endow the machine document with the broader of knowledge available to humans
Text Categorization
• Auxiliary text classifier: –matching documents with the most relevant articles of wikipedia
–conventional bag of words + new features
• Examples for idea of auxiliary text classifier:– “Bernanke takes charge”–BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …
• Using wikipedia–Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document
–Leverage the knowledge gained from these
• “jaguar car models”,
• the Wikipedia-based feature generator returns:
– JAGUAR (CAR),
–DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar),
–V12 (Jaguar’s engine),
–JAGUAR E-TYPE
–JAGUAR XJ.
• “jaguar Panthera onca”,
–JAGUAR,
–FELIDAE (feline species family), related felines such as LEOPARD,
–PUMA and BLACK PANTHER, as well as KINKAJOU
• Some names denote multiple entities:
– “John Williams and the Boston Pops conducted a
summer Star Wars concert at Tanglewood.”
John Williams⇒ John Williams (composer)
– “John Williams lost a Taipei death match against
his brother, Axl Rotten.”
John Williams⇒ John Williams (wrestler)
– “John Williams won a Victoria Cross for his
actions at the battle of Rorke’s Drift.
John Williams⇒ John Williams (VC)
Word Sense Disambiguation
• Some entities have multiple names:
– John Williams (composer)⇐ John Williams
– John Williams (composer)⇐ John Towner
Williams
– John Williams (wrestler)⇐ John Williams
– John Williams (wrestler)⇐ Ian Rotten
– Venus (planet)⇐ Venus
– Venus (planet)⇐ Morning Star
– Venus (planet)⇐ Evening Star
WSD
• Web searches
– Queries about Named Entities (NEs) constitute a significant portion of popular web queries.
– Ideally, search results are clustered such that:
• In each cluster, the queried name denotes the same entity.
• Each cluster is enriched by querying the web with alternative names of the corresponding entity.
• Web-based Information Extraction (IE)
– Aggregating extractions from multiple web pages can lead to improved accuracy in IE tasks (e.g. extracting relationships between NEs).
– Named entity disambiguation is essential for performing a meaningful aggregation.
Wikipedia Structures
• In general, there is a many-to-many
relationship between names and entities,
captured in Wikipedia through:
–Redirect articles.
–Disambiguation articles.
• Hyperlinks: An article may contain links to
other articles in Wikipedia.
• Categories: each article belongs to at
least one Wikipedia category.
Redirect Articles
• Redirect article:
– exists for each alternative name used to refer to an
entity in Wikipedia.
– Example: The article titled John Towner Williams
consists in a pointer to the article John Williams
(composer).
• Disambiguation article:
– lists all Wikipedia entities (articles) that may be
denoted by an ambiguous name.
– Example: The article titled John Williams
(disambiguation) list 22 entities (articles).
Conclusion
• Overview of Wikipedia
• Knowledge organization in Wikipedia
• Accuracy of Wikipedia
• Application of Wikipedia in NLP