Quality, Quantity, Web and Semantics
-
Upload
zemanta -
Category
Technology
-
view
1.167 -
download
5
description
Transcript of Quality, Quantity, Web and Semantics
![Page 1: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/1.jpg)
Quality to Quantity to Qualityon the Web
Andraž Tori, CTO at Zemanta@andraz
![Page 2: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/2.jpg)
Topics
- a bit about Zemanta
- how advanced “data tools” and spammers interact
![Page 3: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/3.jpg)
We are all trying to organize the web
![Page 4: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/4.jpg)
Making it right,
making it useful
and linked
![Page 5: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/5.jpg)
![Page 6: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/6.jpg)
![Page 7: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/7.jpg)
Not so long time ago, in a city not far away...
![Page 8: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/8.jpg)
some other people
![Page 9: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/9.jpg)
are trying to do the opposite
![Page 10: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/10.jpg)
trying to disorganize it,
make it confusing,
and to profit from that
![Page 11: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/11.jpg)
![Page 12: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/12.jpg)
using the tools we have built!
![Page 13: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/13.jpg)
![Page 14: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/14.jpg)
Their motives are not sinster(mostly)
![Page 15: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/15.jpg)
it is about profit
![Page 16: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/16.jpg)
Profit
- publish as much content as possible
- quality is not (that) important
- get traffic or high page ranking for certain terms
- sell clicks, links or whole “fully built” sites to the highest bidder
- users and search engines are necessary evil to be tricked as cheaply as possible
![Page 17: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/17.jpg)
![Page 18: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/18.jpg)
So, why do I care?
![Page 19: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/19.jpg)
Job opening
You will get a spreadsheet with 180 blog url’s and logins. You will log into each blog and schedule 2 posts per week ...
You will spice up every post with images and/or related links within the content, using a Wordpress plugin called Zemanta
https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
![Page 20: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/20.jpg)
And why might you care?
- the organized information is great tool for those that try to disorganize it
- they are poisoning “our web”, including twitter, facebook
- and it's hard to see in the fog they are causing
- it is just matter of time when they start poisioning linked data too
![Page 21: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/21.jpg)
![Page 22: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/22.jpg)
What do we do at
![Page 23: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/23.jpg)
- is a “personal writing assistant”
- suggesting content while you write (your blog)
- analyzing your text
- connecting it with background knowledge, other stories on the web, images
- you choose what suggestions to include
- to make your writing more informative, vivid and useful
![Page 24: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/24.jpg)
![Page 25: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/25.jpg)
![Page 26: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/26.jpg)
![Page 27: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/27.jpg)
Opening up the hood
![Page 28: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/28.jpg)
![Page 29: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/29.jpg)
![Page 30: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/30.jpg)
the reality
![Page 31: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/31.jpg)
![Page 32: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/32.jpg)
Contentsuggestions
How it works
Plain text(article) Analysis Semantic
search
RSS feedsLinked data
![Page 33: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/33.jpg)
Main design goals
- Input is meaningful chunk of text (not a keyword or a phrase)
- Input is (semi) English language
- Has to work across all domains in the open world
- music, celebrities, finance, entertainment, politics, gardening, parenting, …
![Page 34: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/34.jpg)
Analysis pipeline
Named EntityExtraction
Known phrasesextraction
(aho-corasick)
Triple storeSurface form features evaluation
Statistical comparison tobackground knowledge
Semantic coherenceand hand-tuned
heuristics
Disambiguated entities
etc.
![Page 35: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/35.jpg)
Analysis pipeline
Named EntityExtraction
Known phrasesextraction
(aho-corasick)
Triple storeSurface form features evaluation
Statistical comparison tobackground knowledge
Semantic coherenceand hand-tuned
heuristics
Disambiguated entities
etc.
Categorization to D
moz
Categories Ambigious named entities
![Page 36: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/36.jpg)
Background knowledge
- Data from Wikipedia, MusicBrainz, Freebase… and world wild web
- Includes linguistical and semantical properties+ unstructured data
- Present in two forms:
- in “original” custom built triple store on top of MySQL (150 GB)
- processed into 7 GB optimized “memory mapped dump”
![Page 37: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/37.jpg)
Background knowledge
- 7M mined and linked up entities and concepts
- 30M aliases
- Refreshed about once a month
- want to make it real-time
- Input data quality is really important
Triple store
etc.
![Page 38: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/38.jpg)
Text
After analysis
SOLRarticles
SOLRimages
Related articles
Images
![Page 39: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/39.jpg)
Example SOLR query
![Page 40: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/40.jpg)
boost((( wiki_entities:Health insurance wiki_entities:Medical underwriting wiki_entities:United States wiki_entities:Affordable Care Act wiki_entities:Barack Obama wiki_entities:Lifetime (TV network) wiki_entities:Insurance wiki_entities:Preventive medicine wiki_entities:Childwiki_entities:Patient Protection and Affordable Care Act ) ^3.0)
(text:zemhealthinsurq^0.68 text:health^0.62 text:premium^0.36text:zeminsurcompaniq^0.56 text:increas^0.29 text:rate^0.27text:zemhealthinsurcompaniq^0.35 text:zempreventcareq^0.26text:medic^0.26 text:compani^0.23 text:obamacar^0.21text:todai^0.21 text:polici^0.21 text:care^0.19 ) ^105.0
((dmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Healthdmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_Statesdmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States/California) ^0.1),
(1 - 0.2) * sqrt(1.0/(1.15E-8*float(1285185600000 - date(published_datetime) ms)+1.0)) + 0.2)
![Page 41: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/41.jpg)
Solr
- We adapted Solr for “query by document”
- 52% precision (at 10) on internal evaluations
- plain Lucene MLT comes to 44%
- difference is from “bag of terms” approach over “bag of words” (terms coming from analysis step)
- Our live index is 5M articles
- Solr is really not optimized to handle 50 terms in a single query
![Page 42: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/42.jpg)
Lucene plain “More Like This”
![Page 43: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/43.jpg)
Metrics & tests
- Every part of the system is being constantly evaluted
- Precision/recall at 5 different points in the system
- Mostly bi-weekly releases of new datasets and the engine
![Page 44: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/44.jpg)
Overview
- We do pretty deep processing to deliver simple user experience of “personal authoring assistant”
- And everything is available over the web API
- tagging
- named entity recognition and disambiguation to Linked Open Data URIs
![Page 45: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/45.jpg)
Most used
What API offers?
Most interesting
• Tags
• Categories
• Concepts and entities
• Related articles
• Related images
![Page 46: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/46.jpg)
![Page 47: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/47.jpg)
So mash-ups happen...
![Page 48: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/48.jpg)
Some API users
![Page 49: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/49.jpg)
We are just one of the many people offeringservices based on large amounts of web data
each spending man-years trying to organize their data, trying to offer best possible service
![Page 50: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/50.jpg)
now back to the bad guys
![Page 51: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/51.jpg)
![Page 52: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/52.jpg)
Job opening
You will get a spreadsheet with 180 blog url’s and logins. You will log into each blog and schedule 2 posts per week ...
You will spice up every post with images and/or related links within the content, using a Wordpress plugin called Zemanta
https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
![Page 53: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/53.jpg)
There's more than meets the eye
![Page 54: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/54.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 55: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/55.jpg)
Warnings
- I've seen no single system using the whole pipeline as described, however all parts were found in the wild
- Examples used are from all kinds of sites – good, bad and ugly
- I am not trying to imply that all of the steps in the diagram are bad, but they can be used by bad guys efficiently
![Page 56: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/56.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 57: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/57.jpg)
Finding their keywords, niches
- Domain expertise
- Users like to install extensions and say “yes”
- You observe referrers on sites you control
- You buy the data on the black market
![Page 58: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/58.jpg)
![Page 59: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/59.jpg)
The sophisticated part of the market
“Demand Media relies on a proprietary algorithm to help editors best determine what subjects their writers should tackle.”
Factors:
- Keyword competition
- Revenue
- Driving traffic to/from existing conent
http://emediavitals.com/article/16/demand-media-s-content-assembly-line
![Page 60: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/60.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 61: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/61.jpg)
Find / create content
- Steal
- Take from “open article directories”
- Have your own “content assembly line” like Demand Media
![Page 62: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/62.jpg)
Open article directories
![Page 63: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/63.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 64: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/64.jpg)
T i i n t the text you re lookin for.һ ѕ ѕ ο а ɡ
![Page 65: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/65.jpg)
T i i nοt the text you аre lookinɡ for.һ ѕ ѕ
![Page 66: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/66.jpg)
Übersetzen sie zufällig Sprache und wieder auf EnglischLanguage and translate it happen again in English
Μεταφράστε αυτό σε δειγματοληπτικούς γλώσσα και πίσω στην αγγλική γλώσσα
Translate this random language back to English
Traduisez au langage aléatoire et revenir à l'anglaisTranslate to random language to English and back
它翻译成随机的语言和回英文Translate it back into the English language and random
Translate it to random language and back to English
![Page 67: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/67.jpg)
Covering their tracks
- Trying to fool search engines or people?
- Search engines are catching up
- Google Translate API is being closed due to “abuse”?
- The trend is “rewriting” by human editors, procured on the global market
![Page 68: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/68.jpg)
![Page 69: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/69.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta, OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 70: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/70.jpg)
Spammers say darndest things
![Page 71: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/71.jpg)
![Page 72: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/72.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 73: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/73.jpg)
![Page 74: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/74.jpg)
Remixing linked data and spam
- Currently mostly the good guys are using Linked Data
- However, it's just too tempting to be left alone
- Fully synthetic articles using factual information from linked data?
– Using advanced tools to form proper natural language sentences and maybe even storyline?
![Page 75: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/75.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 76: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/76.jpg)
Publish
- On hosted third party platforms
- eating their resources
- Platforms have hard time killing spammers
- Smaller ones don't necessarily have the incentive
- If they remove spammer too fast, it is easier for spammer to probe the limits
- Platforms use “kill with delay”
- Spam detection is resource intensive
![Page 77: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/77.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 78: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/78.jpg)
Valuable comments
As I write this post, Zemanta is showing me 5 different articles that are related to my post. I could visit each one of these sites and reach out to the owner to see if they would be interested in linking to my post, or I could leave a valuable comment on the page and include a link back to my post.
http://www.mainelyseo.com/zemanta-review-seo-link-building-with-the-zemanta-plugin/
![Page 79: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/79.jpg)
- Guy in previous slide is honest and well-meaning
- But what if you automate that via Amazon Mechanical Turk or oDesk?
![Page 80: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/80.jpg)
Gather search terms(extensions, logs, guess)
Analyze → what people search for?
Find / createsuch content
Cover your tracksUse Zemanta or OpenCalaisto add tags, images, links
Pull additional contentfrom Freebase
Use Zemanta to findsimilar blogs
Amazon Mechanical Turkto post comments
and links back to your siteProfit?
Publish
![Page 81: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/81.jpg)
Profit?
- sell ads
- sell links
- sell “fully developed site”
- to the highest bidder
![Page 82: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/82.jpg)
Search engines to the rescue?
- Mahalo cut 10% of the staff the day after Google announced ranking changes
- Demand Media's stock isn't doing that well anymore
- However this is a never-ending story, we'll have co-evolution for foreseeable future
![Page 83: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/83.jpg)
Ecosystem
- Very sophisticated, large players
- moving to more high quality content, video?
- Small time operations
- using more and more sophisticated tools available on the market cheaply (modern asymmetric warfare?)
- Dark industry specifically building tools to poison the web and sell them to small time operators
![Page 84: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/84.jpg)
Food for thought
![Page 85: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/85.jpg)
Can we make spammers (and others) work for us, making linked data better?
(think reCAPTCHA)
![Page 86: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/86.jpg)
Could article directories be fruitfully used?
eZineArticles.com, GoArticles.com, etc...
![Page 87: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/87.jpg)
Find rewritten articles and use them as parallel corpus?
![Page 88: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/88.jpg)
Could we use global workforce market more efficiently to get more linked data?
![Page 89: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/89.jpg)
Thesis, antithesis, synthesis?
http://xkcd.com/810/
![Page 90: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/90.jpg)
Thank you!
Questions?
![Page 91: Quality, Quantity, Web and Semantics](https://reader034.fdocuments.net/reader034/viewer/2022051413/554c9c55b4c905c10d8b4fd8/html5/thumbnails/91.jpg)
Image sources http://www.flickr.com/photos/dzingeek/4587871752/
http://www.flickr.com/photos/25101572@N02/4393474025/
http://www.flickr.com/photos/billward/4740384434/
http://www.flickr.com/photos/jurvetson/542500748
http://www.flickr.com/photos/legofenris/4288913574
http://www.flickr.com/photos/ekilby/3733627940
http://www.flickr.com/photos/ekilby/3732799269/
http://www.flickr.com/photos/cipherswarm/38354452
http://xkcd.com/810/