Social Data and Log Analysis Using MongoDB
-
Upload
takahiro-inoue -
Category
Technology
-
view
15.628 -
download
2
Transcript of Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB2011/03/01(Tue) #mongotokyo
doryokujin
Self-Introduction
• doryokujin (Takahiro Inoue), Age: 25
• Education: University of Keio• Master of Mathematics March 2011 ( Maybe... )
• Major: Randomized Algorithms and Probabilistic Analysis
• Company: Geisha Tokyo Entertainment (GTE)• Data Mining Engineer (only me, part-time)
• Organized Community: • MongoDB JP, Tokyo Web Mining
My Job
• I’m a Fledgling Data Scientist
• Development of analytical systems for social data
• Development of recommendation systems for social data
• My Interest: Big Data Analysis
• How to generate logs scattered many servers
• How to storage and access to data
• How to analyze and visualization of billions of data
Agenda
• My Company’s Analytic Architecture
• How to Handle Access Logs
• How to Handle User Trace Logs
• How to Collaborate with Front Analytic Tools
• My Future Analytic Architecture
Agenda
• My Company’s Analytic Architecture
• How to Handle Access Logs
• How to Handle User Trace Logs
• How to Collaborate with Front Analytic Tools
• My Future Analytic Architecture
Of Course Everything With
Hadoop, Mongo Map Reduce
Hadoop, Schema Free
REST Interface, JSON
Capped Collection,Modifier Operation
My Company’s Analytic Architecture
Social Game (Mobile): Omiseyasan
• Enjoy arranging their own shop (and avatar)
• Communicate with other users by shopping, part-time, ...
• Buy seeds of items to display their own shop
Data Flow
Flash ComposeServer
User Game Save Data
Access Logs
User Registration / Charge
User Trace Logs
Access
Back-end Architecture
User Registration / Charge
User Trace LogsAccess Logs User Game
Save Data
Pretreatment: Trimming,Validation, Filtering,...
As a Central Data Server
Back Up To S3
PyMongo
Dumbo (Hadoop Streaming)
Front-end Architecture
Social Data Analysis Data Analysis
Web UI
sleepy.mongoose(REST Interface)
PyMongo
Environment• MongoDB: 1.6.4
• PyMongo: 1.9
• Hadoop: CDH2 ( soon update to CDH3 )
• Dumbo: Simple Python Module for Hadoop Streaming
• Cassandra: 0.6.11
• R, Neo4j, jQuery, Munin, ...
• [Data Size (a rough estimate)]
• Access Log 15GB / day ( gzip ) - 2,000M PV
• User Trace Log 5GB / day ( gzip )
How to Handle Access Logs
How to Handle Access Logs
User Registration / Charge
User Trace LogsAccess Logs User Game
Save Data
Pretreatment: Trimming,Validation, Filtering, ...
As a Data Server
Back Up To S3
Access Data Flow
user_access
user_pageview
daily_pageviewagent_pageview
hourly_pageview
Access Logs
Pretreatment
1st Map Reduce
2nd Map Reduce
Group by
Caution: need MongoDB >= 1.7.4
Hadoop
• Using Hadoop: Pretreatment Raw Records
• [Map / Reduce]
• Read all records
• Split each record by ‘¥s’
• Filter unnecessary records (such as *.swf)
• Check records whether correct or not
• Insert (save) records to MongoDB
※ write operations won’t yet fully utilize all cores
110.44.178.25 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/
BattleSelectAssetPage.html;jsessionid=9587B0309581914AB7438A34B1E51125-n15.at3?collec\
tion=12&opensocial_app_id=00000&opensocial_owner_id=00000 HTTP/1.0" 200 6773 "-"
"DoCoMo/2.0 ***"
110.44.178.26 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/shopping/battle/
ShoppingBattleTopPage.html;jsessionid=D901918E3CAE46E6B928A316D1938C3A-n11.a\
p1?opensocial_app_id=00000&opensocial_owner_id=11111 HTTP/1.0" 200 15254 "-"
"DoCoMo/2.0 ***"
110.44.178.27 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/
BattleSelectAssetDetailPage;jsessionid=202571F97B444370ECB495C2BCC6A1D5-n14.at11?asse\
t=53&collection=9&opensocial_app_id=00000&opensocial_owner_id=22222 HTTP/1.0" 200
11616 "-" "SoftBank/***"
...(many records)
Access Logs
> db.user_trace.find({user: "7777", date: "2011-02-12"}).limit(0)
.forEach(printjson)
{
"_id" : "2011-02-12+05:39:31+7777+18343+Access",
"lastUpdate" : "2011-02-19",
"ipaddr" : "202.32.107.166",
"requestTimeStr" : "12/Feb/2011:05:39:31 +0900",
"date" : "2011-02-12",
"time" : "05:39:31",
"responseBodySize" : 18343,
"userAgent" : "DoCoMo/2.0 SH07A3(c500;TB;W24H14)",
"statusCode" : "200",
"splittedPath" : "/avatar2-gree/MyPage,
"userId" : "7777",
"resource" : "/avatar2-gree/MyPage;jsessionid=...?
battlecardfreegacha=1&feed=...&opensocial_app_id=...&opensocial_viewer_id=...&
opensocial_owner_id=..."
}
Collection: user_trace
1st Map Reduce
• [Aggregation]
• Group by url, date, userId
• Group by url, date, userAgent
• Group by url, date, time
• Group by url, date, statusCode
• Map Reduce operations runs in parallel on all shards
map = Code("""
function(){
emit({
path:this.splittedPath,
userId:this.userId,
date:this.date
},1)}
""")
reduce = Code("""
function(key, values){
var count = 0;
values.forEach(function(v) {
count += 1;
});
return {"count": count, "lastUpdate": today};
}
""")
• this.userId
• this.userAgent
• this. timeRange
• this.statusCode
1st Map Reduce with PyMongo
# ( mongodb >= 1.7.4 )
result = db.user_access.map_reduce(map,
reduce,
marge_out="user_pageview",
full_response=True,
query={"date": date})
• About output collection, there are 4 options: (MongoDB >= 1.7.4)• out : overwrite collection if already exists• marge_output : merge new data into the old output collection• reduce_output : reduce operation will be performed on the two values
(the same key on new result and old collection) and the result will be written to the output collection.
• full_responce (=false) : If True, return on stats on the operation. If False, No collection will be created, and the whole map-reduce operation will happen in RAM. The Result set fits within the 8MB/doc limit (16MB/doc in 1.8?).
Map Reduce (>=1.7.4):out option in JavaScript
• "collectionName" : If you pass a string indicating the name of a collection, then the output will replace any existing output collection with the same name.
• { merge : "collectionName" } : This option will merge new data into the old output collection. In other words, if the same key exists in both the result set and the old collection, the new key will overwrite the old one.
• { reduce : "collectionName" } : If documents exists for a given key in the result set and in the old collection, then a reduce operation (using the specified reduce function) will be performed on the two values and the result will be written to the output collection. If a finalize function was provided, this will be run after the reduce as well.
• { inline : 1} : With this option, no collection will be created, and the whole map-reduce operation will happen in RAM. Also, the results of the map-reduce will be returned within the result object. Note that this option is possible only when the result set fits within the 8MB limit.
http://www.mongodb.org/display/DOCS/MapReduce
> db.user_pageview.find({
"_id.userId": "7777",
"_id.path": "/.*MyPage$/",
"_id.date": {$lte: "2011-02-12"}
).limit(1).forEach(printjson)
#####
{
"_id" : {
"date" : "2011-02-12",
"path" : "/avatar2-gree/MyPage",
"userId" : "7777",
},
"value" : {
"count" : 10,
"lastUpdate" : "2011-02-19"
}
}
• Regular Expression
• <, >, <=, >=
Collection: user_pageview
map = Code("""
function(){
emit({
"path" : this._id.path,
"date": this._id.date,
},{
"pv": this.value.count,
"uu": 1
});
}
""")
reduce = Code("""
function(key, values){
var pv = 0;
var uu = 0;
values.forEach(function(v){
pv += v.pv;
uu += v.uu;
});
return {"pv": pv, "uu": uu};
}
""")
2nd Map Reduce with PyMongo
map = Code("""
function(){
emit({
"path" : this._id.path,
"date": this._id.date,
},{
"pv": this.value.count,
"uu": 1
});
}
""")
reduce = Code("""
function(key, values){
var pv = 0;
var uu = 0;
values.forEach(function(v){
pv += v.pv;
uu += v.uu;
});
return {"pv": pv, "uu": uu};
}
""")
2nd Map Reduce with PyMongo
Must be the same key ({“pv”: NaN} if not)
# ( mongodb >= 1.7.4 )
result = db.user_pageview.map_reduce(map,
reduce,
marge_out="daily_pageview",
full_response=True,
query={"date": date})
> db.daily_pageview.find({
"_id.date": "2011-02-12",
"_id.path": /.*MyPage$/
}).limit(1).forEach(printjson)
{
"_id" : {
"date" : "2011-02-12",
"path" : "/avatar2-gree/MyPage",
},
"value" : {
"uu" : 53536,
"pv" : 539467
}
}
Collection: daily_pageview
Current Map Reduce is Imperfect• [Single Threads per node]
• Doesn't scale map-reduce across multiple threads
• [Overwrite the Output Collection]• Overwrite the old collection ( no other options like “marge” or
“reduce” )
# mapreduce code to merge output (MongoDB < 1.7.4)
result = db.user_access.map_reduce(map,
reduce,
full_response=True,
out="temp_collection",
query={"date": date})
[db.user_pageview.save(doc) for doc in temp_collection.find()]
Useful Reference: Map Reduce
• http://www.mongodb.org/display/DOCS/MapReduce
• A Look At MongoDB 1.8's MapReduce Changes
• Map Reduce and Getting Under the Hood with Commands
• Map/reduce runs in parallel/distributed?
• Map/Reduce parallelism with Master/SlaveA
• mapReduce locks the whole server
• mapreduce vs find
How to HandleUser Trace Logs
How to Handle User TRACE Logs
User Registration / Charge
User Trace LogsAccess Logs Game Save
Data
Pretreatment: Trimming,Validation, Filtering, ...
As a Data Server
Back Up To S3
User Trace / Charge Data Flow
user_trace
user_charge
daily_charge
daily_trace
User Trace Logs
Pretreatment
User Registration / Charge
User Trace Log
Hadoop• Using Hadoop: Pretreatment Raw Records
• [Map / Reduce]• Split each record by ‘¥s’
• Filter Unnecessary Records
• Check records whether user behaves dishonestly
• Unify format to be able to sum up ( Because raw records are written by free format )
• Sum up records group by “userId” and “actionType”
• Insert (save) records to MongoDB
※ write operations won’t yet fully utilize all cores
An Example of User Trace Log
UserId ActionType ActionDetail
An Example of User Trace Log-----Change------ActionLogger a{ChangeP} (Point,1371,1383) ActionLogger a{ChangeP} (Point,2373,2423)
------Get------ActionLogger a{GetMaterial} (syouhinnomoto,0,-1) ActionLogger a{GetMaterial} usesyouhinnomoto ActionLogger a{GetMaterial} (omotyanomotoPRO,1,6)
-----Trade-----ActionLogger a{Trade} buy 3 itigoke-kis from gree.jp:00000 #逆からみれば売った事に
-----Make-----ActionLogger a{Make} make item kuronekono_nActionLogger a{MakeSelect} make item syouhinnomoto ActionLogger a{MakeSelect} (syouhinnomoto,0,1)
-----PutOn/Off-----ActionLogger a{PutOff} put off 1 ksuterasActionLogger a{PutOn} put 1 burokkus @2500
-----Clear/Clean-----ActionLogger a{ClearLuckyStar} Clear LuckyItem_1 4 times
-----Gatcha-----ActionLogger a{Gacha} Play gacha with first free play:わくわくおみせ服ガチャActionLogger a{Gacha} Play gacha:わくわくおみせ服ガチャ
The value of “actionDerail” must be unified format
> db.user_trace.find({date:"2011-02-12”,
actionType:"a{Make}",
userId:”7777"}).forEach(printjson)
{
"_id" : "2011-02-12+7777+a{Make}",
"date" : "2011-02-12"
"lastUpdate" : "2011-02-19",
"userId" : ”7777",
"actionType" : "a{Make}",
"actionDetail" : {
"make item ksutera" : 3,
"make item makaron" : 1,
"make item huwahuwamimiate" : 1,
…
}
}
Collection: user_trace
Sum up values group by “userId” and “actionType”
> db.daily_trace.find({
date:{$gte:"2011-02-12”,$lte:”2011-02-19”},
actionType:"a{Make}"}).forEach(printjson)
{
"_id" : "2011-02-12+group+a{Make}",
"date" : "2011-02-12",
"lastUpdate" : "2011-02-19",
"actionType" : "a{Make}",
"actionDetail" : {
"make item kinnokarakuridokei" : 615,
"make item banjo-" : 377,
"make item itigoke-ki" : 135904,
...
},
...
}...
Collection: daily_trace
User Charge Log
// TOP10 Users at 2011-02-12 abount Accounting
> db.user_charge.find({date:"2011-02-12"})
.sort({totalCharge:-1}).limit(10).forEach(printjson)
{
"_id" : "2011-02-12+7777+Charge",
"date" : "2011-02-12",
"lastUpdate" : "2011-02-19",
"totalCharge" : 10000,
"userId" : ”7777",
"actionType" : "Charge",
"boughtItem" : {
"アクセサリーの素EX" : 13,
"コネルギー+6000" : 3,
"アクセサリーの素PRO" : 20
}
}
{…
Collection: user_charge
Sum up values group by “userId” and “actionType”
> db.daily_charge.find({date:"2011-02-12",T:"all"})
.limit(10).forEach(printjson)
{
"_id" : "2011-02-12+group+Charge+all+all",
"date" : "2011-02-12",
"total" : 100000,
"UU" : 2000,
"group" : {
"わくわくポイント" : 1000000,
"アクセサリー" : 1000000, ...
},
"boughtItemNum" : {
"料理の素EX" : 8,
"アクセサリーの素" : 730, ...
},
"boughtItem" : {
"料理の素EX" : 10000,
"アクセサリーの素" : 100000, ...
}
}
Collection: daily_charge
Categorize Users
user_registration
user_category
• [Categorize Users]
• by play term
• by total amount of charge
• by registration date
• [ Take an Snapshot of Each Category’s Stats per Week]
Attribution
Attribution
Attribution
Attribution
Categorize Usersuser_trace
user_charge
user_savedata
user_pageview
> db.user_registration.find({userId:”7777"}).forEach(printjson)
{
"_id" : "2010-06-29+7777+Registration",
"userId" : ”7777"
"actionType" : "Registration",
"category" : {
“R1” : “True”, # categorize whether resign or not
“T” : “ll” # categorize play term
…
},
“firstCharge” : “2010-07-07”, # date when first charge
“lastLogin” : “2010-09-30”, # date when last access
“playTerm” : 94,
“totalCumlativeCharge” : 50000, # total amount of accounting
“totalMonthCharge” : 10000, # total amount of accounting recent a month
…
}
Collection: user_registration
Tagging User
> var cross = new Cross() # User Definition Function
> MCResign = cross.calc(“2011-02-12”,“MC”,1)
# each value is the number of the user
# Charge(yen)/Term(day)
0(z) ~¥1k(s) ~¥10k(m) ¥100k~(l) total
~1day(z) 50000 10 5 0 50015
~1week(s) 50000 100 50 3 50153
~1month(m) 100000 200 100 1 100301
~3month(l) 100000 300 50 6 100356
month~(ll) 0 0 0 0 0
Collection: user_category
How to Collaborate WithFront Analytic Tools
Front-end Architecture
Social Data Analysis Data Analysis
Web UI
sleepy.mongoose(REST Interface)
PyMongo
Web UI and Mongo
Data Table: jQuery.DataTables[ Data Table ]
• Want to Share Daily Summary
• Want to See Data from Many Viewpoint
• Want to Implement Easily
• jQuery.DataTables
1 Variable length pagination
2 On-the-fly filtering
3 Multi-column sorting with data
type detection
4 Smart handling of column widths
5 Scrolling options for table
viewport
6 ...
Graph: jQuery.HighCharts[ Graph ]
• Want to Visualize Data
• Handle Time Series Data Mainly
• Want to Implement Easily
• jQuery.HighCharts
1. Numerous Chart Types
2. Simple Configuration Syntax
3. Multiple Axes
4. Tooltip Labels
5. Zooming
6. ...
sleepy.mongoose
• [REST Interface + Mongo]
• Get Data by HTTP GET/POST Request
• sleepy.mongoose
‣ request as “/db_name/collection_name/_command”
‣made by a 10gen engineer: @kchodorow
‣ Sleepy.Mongoose: A MongoDB REST Interface
//start server
> python httpd.py
…listening for connections on http://localhost:27080
//connect to MongoDB
> curl --data server=localhost:27017 'http://localhost:27080/_connect’
//request example
> http://localhost:27080/playshop/daily_charge/_find?criteria={}&limit=10&batch_size=10
{"ok": 1, "results": [{“_id": “…”, ”date":… },{“_id”:…}], "id": 0}}
sleepy.mongoose
JSON: Mongo <---> Ajax
JSONGet
• jQuery library and MongoDB are compatible
• It is not necessary to describe HTML tag(such as <table>)
sleepy.mongoose(REST Interface)
Example: Web UI
R and Mongo
> db.user_registration.find({userId:”7777"}).forEach(printjson)
{
"_id" : "2010-06-29+7777+Registration",
"userId" : ”7777"
"actionType" : "Registration",
"category" : {
“R1” : “True”, # categorize whether resign or not
“T” : “ll” # categorize play term
…
},
“firstCharge” : “2010-07-07”, # date when first charge
“lastLogin” : “2010-09-30”, # date when last access
“playTerm” : 94,
“totalCumlativeCharge” : 50000, # total amount of accounting
“totalMonthCharge” : 10000, # total amount of accounting recent a month
…
}
Collection: user_registration
Want to know the relationbetween user attributions
##### LOAD LIBRARY #####
library(RCurl)
library(rjson)
##### CONF #####
today.str <- format(Sys.time(), "%Y-%m-%d")
url.base <- "http://localhost:27080"
mongo.db <- "playshop"
mongo.col <- "user_registration"
mongo.base <- paste(url.base, mongo.db, mongo.col, sep="/")
mongo.sort <- ""
mongo.limit <- "limit=100000"
mongo.batch <- "batch_size=100000"
R Code: Access MongoDBUsing sleepy.mongoose
##### FUNCTION #####
find <- function(query){
mongo <- fromJSON(getURL(url))
docs <- mongo$result
makeTable(docs) # My Function
}
# Example
# Using sleepy.mongoose https://github.com/kchodorow/sleepy.mongoose
mongo.criteria <- "_find?criteria={ ¥
\"totalCumlativeCharge\":{\"$gt\":0,\"$lte\":1000}}"
mongo.query <- paste(mongo.criteria, mongo.sort, ¥
mongo.limit, mongo.batch, sep="&")
url <- paste(mongo.base, mongo.query, sep="/")
user.charge.low <- find(url)
R Code: Access MongoDBUsing sleepy.mongoose
# Result: 10th Document
[[10]][[10]]$playTerm[1] 31
[[10]]$lastUpdate[1] "2011-02-24"
[[10]]$userId[1] "7777"
[[10]]$totalCumlativeCharge[1] 10000
[[10]]$lastLogin[1] "2011-02-21"
[[10]]$date[1] "2011-01-22"
[[10]]$`_id`[1] "2011-02-12+18790376+Registration"
...
The Result
# Result: Translate Document to Table
playTerm totalWinRate totalCumlativeCharge totalCommitNum totalWinNum [1,] 56 42 1000 533 224 [2,] 57 33 1000 127 42 [3,] 57 35 1000 654 229 [4,] 18 31 1000 49 15 [5,] 77 35 1000 982 345 [6,] 77 45 1000 339 153 [7,] 31 44 1000 70 31 [8,] 76 39 1000 229 89 [9,] 40 21 1000 430 92[10,] 26 40 1000 25 10...
Make a Data Table from The Result
Scatter Plot / Matrix
Each Category
(User Attribution)
# Run as a batch command$ R --vanilla --quiet < mongo2R.R
Munin and MongoDB
Monitoring DB Stats
https://github.com/erh/mongo-munin
https://github.com/osinka/mongo-rs-munin
Munin configuration examples - MongoDB
My FutureAnalytic Architecture
user_access
user_trace
User Trace Logs
Access Logs
capped collection(per hour) Trimming
FilteringSum Up
RealTime(hourly)
Flume
daily/hourly_access
daily/hourly_trace
capped collection(per hour)
MapReduceModifierSum Up
RealTime(hourly)
Realtime Analysiswith MongoDB
Flume
Server A
Server B
Server C
Server D
Server E
Server F
Collector MongoDB
Access LogUser Trace Log
Hourly / Realtime
Flume Plugin
> db.flume_capped_21.find().limit(1).forEach(printjson)
{
"_id" : ObjectId("4d658187de9bd9f24323e1b6"),
"timestamp" : "Wed Feb 23 2011 21:52:06 GMT+0000 (UTC)",
"nanoseconds" : NumberLong("562387389278959"),
"hostname" : "ip-10-131-27-115.ap-southeast-1.compute.internal",
"priority" : "INFO",
"message" : "202.32.107.42 - - [14/Feb/2011:04:30:32 +0900] "GET /avatar2-gree.4d537100/res/swf/avatar/18051727/5/useravatar1582476746.swf?opensocial_app_id=472&opensocial_viewer_id=36858644&o
pensocial_owner_id=36858644 HTTP/1.1" 200 33640 "-" "DoCoMo/2.0 SH01C(c500;TB;W24H16)"",
"metadata" : {}
}
An Output FromMongo-Flume Plugin
Mongo Flume Plugin: https://github.com/mongodb/mongo-hadoop/tree/master/flume_plugin
Summary
Summary
• Almighty as a Analytic Data Server
• schema-free: social game data are changeable
• rich queries: important for analyze many point of view
• powerful aggregation: map reduce
• mongo shell: analyze from mongo shell are speedy and handy
• More...
• Scalability: using Replication, Sharding are very easy
• Node.js: It enable us server side scripting with Mongo
My Presentation・「MongoDBを用いたソーシャルアプリのログ解析」 ~解析基盤構築からフロントUIまで、MongoDBを最大限に活用する~ :
http://www.slideshare.net/doryokujin/mongodb-uimongodb
・「MongoDBとAjaxで作る解析フロントエンド&GraphDBを用いたソーシャルデータ解析」:
http://www.slideshare.net/doryokujin/mongodbajaxgraphdb-5774546
・「HadoopとMongoDBを活用したソーシャルアプリのログ解析」:
http://www.slideshare.net/doryokujin/hadoopmongodb
・「GraphDB徹底入門」~構造や仕組み理解から使いどころ・種々のGraphDBの比較まで幅広く~ :
http://www.slideshare.net/doryokujin/graphdbgraphdb
I ♥ MongoDB JP
• continue to be a organizer of MongoDB JP
• continue to propose many use cases of MongoDB
• ex: Social Data, Log Data, Medical Data, ...
• support MongoDB users
• by document translation, user-group, IRC, blog, book, twitter,...
• boosting services and products using MongoDB
[Contact me]twitter: doryokujinskype: doryokujin mail: [email protected]: http://d.hatena.ne.jp/doryokujin/MongoDB JP: https://groups.google.com/group/mongodb-jp?hl=ja
Thank you for coming to Mongo Tokyo!!