2017腾讯灯塔反欺诈白皮书居中版 - 3gimg.qq.com · 05 二、反作弊中的攻防技术 从总体上看,移动互联网地下产业的发展大致经历了三个阶段:
Hadoop在反作弊中的应用 林述民
-
Upload
drewz-lin -
Category
Technology
-
view
664 -
download
8
Transcript of Hadoop在反作弊中的应用 林述民
Hadoop在反作弊中的应用 �
林述民 � � @人人游戏/大数据研究组
2013年 � 4月 �
互联网广告作弊的经济驱动 �
• 结算方式 – CPD (Cost Per Day) – CPM (Cost Per Mille) – CPC (Cost Per Click) – CPA (Cost Per Action) – CPS (Cost Per Sales)
广告位 100% 展示
10% 点击 0.3% 注册
0.3% 付费 0.3%
反作弊的本质:异常(anomaly)识别 �
基于某种定义的距
离(Distance),
以及正常组群
(Cluster)的定
义,所产生的离群
点的检测技术。 �
异常检测的6个挑战 �
• 正常行为的定义 • 恶意行为(malicious)会对检测快速响应变化 • 正常行为也在不断演化
• 不同领域的异常检测要求差别很大(主要体现在对后续的处理措施的影响)
• 足量、平衡、准确的标注数据(Label Data)
• 噪音数据导致异常数据无法被检测出来 �
异常的种类(Type of Anomaly) �
• 点异常(Point Anomalies) – 基于剩余数据(rest of data)的异常
• 上下文异常(Contextual Anomalies) – 基于以下属性的组合:
• 下行文属性(Contextual attributes)
• 行为属性(Behavioral attributes )
• 组合异常(Collective Anomalies) – 基于全数据(entire data set)的组合异常 �
反作弊常用的技术方法 �
方法名称
1 分类(Classification)
2 聚类(Clustering)
3 密度检验(Nearest Neighborhood)
4 假设检验(Statistical)
5 信息论方法(Information Theoretic)
6 降维处理(Spectral)
show � ADs � click �
Ader Platform �
show log � register log � click log �
statistics � semi-results � user profile �
HBase � ML Model � Reports � U-U Matrix � HDFS �
案例分享:应用MR计算用户相似度 �
What’s your favorite apps? �
Apps:用户的特征(Feature) �
用户的相似度矩阵 �
� F1 F2 F3 F4
U1 1 1 � 1 �
U2 1 1 � � 1
U3 1 � �
U4 1 � � �
U5 1 � � � … � U = User
F = Feature
相似度计算 �
U1 = {1,1,1,0} U2 = {1,1,0,1} U1×U2 = {1,1,1,0}×{1,1,0,1}T/(||U1|||U2||) U1×U2 = (1×1 + 1×1 + 0×0 + 0×1) /(30.5×30.5) U1×U2 = 2/3 U1×U2 = 0.67
计算:输入矩阵(U=User,F=Feature) �
� F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20
U1 0.2 � � 0.1 � � 0.3 � � � � � � � � � � � � �
U2 0.1 � � 0.9 � � � 0.4 � � � � 0.4 � � � � 0.4 � �
U3 � 0.3 � � � � � � � � 0.2 � � � 0.7 � � � � �
U4 � � � � 0.6 � � � � � � � � � � � � � � �
U5 � � � � � 0.3 � 0.1 � � � � 0.1 � � � � 0.1 � �
U6 � � � � 0.1 � � � � 0.4 � � � � 0.3 � � � � �
U7 � � � 0.5 � � � � � � � � � � � � � � � �
U8 � � � � � � � 0.8 � � � � 0.7 0.3 � � � 0.8 � �
U9 � � � � 0.2 � � 0.1 � � 0.3 � � � � � � � � �
U10 � 0.5 � � � � � � 0.3 � � � � � � � 0.2 � � �
U11 � � 0.2 � � � 0.1 � 0.1 0.2 � � � 0.1 0.1 � � � 0.1 0.1
U12 � � � � � 0.4 � � � � 0.1 � � � � 0.1 � � � �
A =
输出矩阵A(U=User,F=Feature) � � F1 F2 … Fm
U1 0.2 � � 0.1
U2 0.1 � � 0.9
U3 � 0.3 � �
U4 � � � �
U5 � � � �
U6 � � � �
… � � � 0.5
… � � � �
… � � � �
… � 0.5 � �
… � � 0.2 �
U(N) � � � �
<U1×U2> �
<U1×U3> �
<U1×U4> �
<U1×U5> �
<U1×U6> �
<U1×U(N)> �
… �
<U2×U3> �
<U2×U4> �
<U2×U5> �
<U2×U6> �
<U2×U7> �
<U2×U(N)> �
… � … �
<U3×U4> �
<U3×U5> �
<U3×U6> �
<U3×U7> �
<U3×U8> �
<U3×U(N)> �
… �
<U4×U5> �
<U4×U6> �
<U4×U7> �
<U4×U8> �
<U4×U(N)> �
… �
<U(N-1)×U(N)> � … �
… �
CN2 =
N(N −1)2
O(N2): �
� F1 F2 F3 F4 F5 � � � � � � F1 F2 F3 F4 F5
U1 0.2 � � 0.1 � � � � � � U1 0.2 � � 0.1 �
U2 0.1 � � 0.9 � � � � � � U2 0.1 � � 0.9 �
U3 � 0.3 � � � � � � � � U3 � 0.3 � � �
U4 � � � � 0.6 � � � � � U4 � � � � 0.6
U5 � � � � � � � � � � U5 � � � � �
U6 0.2 � � 0.1 � � � � � � U6 0.2 � � 0.1 �
U7 0.1 � � 0.9 � � � � � � U7 0.1 � � 0.9 �
U8 � 0.3 � � � � � � � � U8 � 0.3 � � �
U9 � 0.6 � 0.6 � � � � � � U9 � 0.6 � 0.6 �
U10 � 0.1 0.1 � � � � � � � U10 � 0.1 0.1 � �
直接分布式计算? �
Partition1 �
Partition2 �
Partition3 �
User Matrix � Duplicated User Matrix �
对用户特征矩阵进行转置 �
� F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 � � � � � � U1 U2 U3 U4 U5
U1 0.2 � � 0.1 � � 0.3 � � � � � � � � F1 0.2 0.1 � � �
U2 0.1 � � 0.9 � � � 0.4 0.6 0.1 � � � � � F2 0.3 � � �
U3 � 0.3 � � � � � � � 0.1 � � � � � F3 � � � �
U4 � � � � 0.6 � � � 0.6 � � � � � � F4 0.1 � 0.9 � � �
U5 � � � � � 0.3 � 0.1 � � � � � � � F5 � � � 0.6 � �
� � � � � � � � � � � � � � � � F6 � � 0.3 �
U = User � � � � � � � � � � � � � F7 0.3 � � �
F = Feature � � � � � � � � � � � � � F8 0.4 � � � 0.1 �
� � � � � � � � � � � � � � � � F9 � 0.6 � 0.6 �
� � � � � � � � � � � � � � � � F10 � 0.1 0.1 � � U –> UT �
Skip Sparse Feature �
对转置后的矩阵(U->UT=F)的MapReduce计算 �
� U1 U2 U3 U4 U5
F1 0.2 � � 0.1 �
F2 0.1 � � 0.9 �
F3 � 0.3 � � �
F4 � � � � 0.6
F5 � � � � �
F6 0.2 � � 0.1 �
Value � Key � Feature ID �
Possible combination {Ui, Uj}: w(k) = {fUi(k)×fUj(k)}1..M
For every Feature ID K, Emit:
User List �
Ui×Uj = ∑ [fUi(k)×fUj(k)] k ∈ 1..M
伪代码(Map Reduce � 分布式矩阵乘法) �
class Mapper � � � method Map(Feature, User List of Feature) � � � � � � � for every possible <user pair> do{ � � � � � � � � � � � � � � Emit(<user pair>, <feature value>) ;
} class Reducer method Reduce(<user pair>, <feature value>) for all collected <feature value> of <user pairs> do{ similarity + = <feature value>; } Emit(<user pair>, similarity ) ; �
优化:利用Stripe算法优化IO Throughput �
Adjacent list as output �
Remove Calculation to Reducer �
class Mapper � � � method Map(Feature, User List of Feature) � � � � � � � for every possible <user pair> do{ Emit(<user pair>, <feature value>) ;
Emit(<user>, <user feature value>List) ; }
class Reducer method Reduce(<user>, <user feature value>List) for all collected <feature value> in <user feature value>List do{ similarity + = <feature value>; } Emit(<user pair>, similarity ) ; �
Stripe带来了什么(1)? �
Less Keys, Efficient Merge Sort
Stripe带来什么(2)? �
<(u1,u2),V> <(u1,u3),V>
… <(u2,u3),V> <(u2,u4),V>
… �
<u1,(u2,v2|u3,v3…)> … <u2,(u3,v3|u4,v4…)> …
If(length(u) ≈ length(v)), space save up to 1/3
V.S.
Stripe: <Key Value>
output
Normal: <Key Value>
output
优化:利用贪婪算法(Greedy)切分负载 �
Machine 1 Machine 2 Machine … Machine N
2. Greedy Splitter �
– … � – … �
3. Current Min Loaded �
1. Incoming Loads �
Load = N * (N-1) /2 N = Input Length �
优化:应用Mirror Mark方法消除热点---Mirror �
Machine 1 Machine 2 Machine … Machine N
2. Mirror & Mark �
– … � – … �
1. Big One Coming � = � + � + �
优化:应用Mirror Mark方法消除热点---MarkCopy �
Machine 1 Machine 2 Machine … Machine N
2. Mirror & Mark �
– … � – … �
1. Big One Coming � = � + � + �
3. Multiplied Loaded �
应用Mirror Mark算法需要注意的问题 �
• Mirror的分组数大小很重要! • 实验数据: – feature:2,819 – user:28,683,344 – origin space: 1.9G – origin lines:2,819 �
� Input Space Output Lines
(Feature Duplicated) Record Inflation (记录膨胀⽐比)
Space Inflation (磁盘空间膨胀⽐比)
split 5,000 N/A
(Disk out of space) N/A
(Disk out of space) N/A N/A
split 20,000 35G 7032 2.5:1 18:1
split 50,000 15G 4285 1.5:1 8:1
split 100,000 7.7G 3399 1.2:1 4:1
方法、步骤和优化效果: �
Approach � Time � Escaped � (hour) �
Map � Reduce � Total �
Feature � Transfer � 40+ � 3 � 43 �
Greedy � file � split � 30 � 3 � 33 �
Stripe � 18 � 3 � 21 �
Mirror � Mark � 1+ � 3 � 4+ �
• 2 Million Users, • 2.8TB KVs Output • 16*(8 Cores, 32GB MEM) Data Node
问题简化:我们做了哪些工作? �
令:A=用户特征矩阵 则:用户相似度矩阵=A×AT
A×AT的时间复杂度:O(N3)
UT的行数 � U的列数 � 行、列向量的元素 �
用户个数 � 用户个数 � 特征个数 �
O(N3) = N × N × N
问题简化:我们做了哪些工作? �
UT的行数 � U的列数 � 行、列向量的元素 �
用户个数 � 用户个数 � 特征个数 �
O(N3) = N × N × N
稀疏性: 不计算“0” �
分布式: Shard-F<Ui:Uj> �
回顾:MapReduce相似度计算的设计与优化 �
• 并行化: – 对user-feature矩阵转置,有效利用数据稀疏性
• IO吞吐优化: – 应用Stripe算法,Mapper输出由<K,V>转换成<K,
VList>输出 • 负载均衡: – 无序Greedy对输入实时切分 – 排序Greedy对输入进行切分或“Top N Sort”切分
• 热点消除: – 应用Mirror Mark,对热点数据进行冗余、标注、切分
谢 � � 谢! �