Quick look over D4D Dataset
-
Upload
haoyi-xiong -
Category
Technology
-
view
89 -
download
5
Transcript of Quick look over D4D Dataset
Quick Glance at D4D dataset
Leye Wang, Haoyi Xiong, Daqing Zhang
Agenda } 4 data sets descriptions from OFFICIAL GUIDE } Some pitfalls in these data set
} Missing data mentioned officially } More pitfalls
} Preliminary analysis } Antenna distribution } Subpref distribution [set3] } Location change patterns
} Coarse: based on subpref [set3] } Fine: based on antenna [set2]
} Day/Night call distribution in Abidjan [set2]
} Risks in the data
4 different data set 1. Antenna-to-antenna traffic on an hourly basis 2. Individual Trajectories: High Spatial Resolution Data 3. Individual Trajectories: Long Term Data 4. Communication Subgraphs
} Time duration: 2011.12.1 – 2012.4.28, 150 days
SET1: Antenna-to-antenna traffic on an hourly basis
} Data } Time } Origin } Destination } Call number } Call duration
} One line one hour } 20 weeks from 12.5
Antenna Positions } 1238 antennas
SET2: Individual Trajectories: High Spatial Resolution Data } 50,000 randomly sampled individuals over two-week periods } 10 periods, new random identifiers are chosen in every time period
} Data Example:
SET3: Individual Trajectories: Long Term Data } biggest data } Location level: antenna à sub-prefecture } 255 sub-prefectures } 50,000 randomly sampled individuals, 150 days
} In fact, It’s almost 500,000 users!
SET4: Communication Subgraphs } 5,000 randomly selected individuals (egos) } divided into periods of two weeks spanning the entire observation
period } Every individual has 10 periods (ego-centered graphs)
} ego-centered graph } consider first and second order neighbors of the ego and
communications between all individuals } not include communications between second order neighbors } A connection means there is communication calls between these
two. The number, duration, direction of the calls aren’t provided.
SET4:Communication Subgraphs } The anonymized identifiers assigned to the individuals
are identical for all time slots but are unique for each subgraph. } For each ego, there are 10 graphs which can be taken into
account altogether. } For different egos, same individuals will be represented by
different ids.
SET4: Communication Subgraphs (cont.) } Data Sample
Data set sizes
Data set #files Size per file Users Date range SET1 10 350~730 MB / 12.5-4.22*
SET2 10 140~180 MB 50,000 12.5-4.22*
SET3 10 2~3 GB 500,000 12.1-4.28 SET4 10 3~5 MB 5,000 12.5-4.22#
* - Data in 12.05 and 12.06 for SET1 and SET2 obviously lost much. # - described in official guide, no time column in data SET4
Pitfalls in data set
Missing data mentioned officially } For technical reasons, the antenna identifiers are not
always available. } code −1 was given to antenna with missing identifier } happens for a significant number of calls (about 25%)
} The datasets covers a total of 3600 hours. } Due to technical reasons data is sometimes missing in the
datasets; missing data covers a total period of about 100 hours.
} These 2 types of missing data can’t be neglected simply. } Both not a small percentage.
More pitfalls } 7 out of 1238 antennas without GPS locations
} 573, 749, 1061, 1200, 1205, 1208, 1213 } Fortunately, no calls in these 7 antennas for all the time. They can
be omitted safely.
} Many antennas don’t work at a specific period of time } Not work: on in calls and no out calls } Each period for 2 weeks
} How to deal with these antennas should be taken into account seriously.
Period #Ants no calls 0(12.5-12.18) 122 1(12.19-1.1) 138 2(1.2-1.15) 133 3(1.16-1.29) 144 4(1.30-2.12) 156 5(2.13-2.26) 206 6(2.27-3.11) 238 7(3.12-3.25) 308 8(3.26-4.8) 31 9(4.9-4.22) 25
No call antennas in the whole time scope
} Total 23 antennas } 301,340,573,691,749,777,811,934,976,1046,1061,1130,1200,1201,
1205,1208,1213,1215,1221,1231,1232,1234,1236
Preliminary analysis
Analysis 1 Antennas Distribution
Use latitude = 7.4 to Separate north and south 1231 ants with position North: 204 (16.6%) South: 1027 (83.4%)
Antenna Distribution(heat map) } Most antennas are in big cities.
Antenna Distribution (South)
Antenna Distribution (Abidjan)
Log (-4.12,-3.86) Lat (5.23,5.49) Ann: 389 (31.6%) About 300 km^2 Cote d'Ivoire: 322,463 km^2 0.1% area for 30% antennas
Antennas without Calls } Between 12-05 and 12-18 } No out-calls
} Total 145 (138 except 7 antennas without positions)
} An abnormal area
All antennas
Antennas no calls
} No in-calls } 123 antennas, similar with those with no out-calls
Antennas Distribution: some conclusions } Distributed extremely uneven
} In very little cities, there are many antennas which can be used to locate position precisely } Best example: Abidjan
} Non-active antennas also distributed unevenly } Little knowledge about those non-active antennas (why, when) } How to deal with these antennas may be a challenge
Analysis 2 Subpref distribution
Subpref distribution: heat map } SET3(12.1~12.15): how many times a subpref is present
} 18 subpref without any data } Yellow: <5,000 } Green: 5,000~50,000 } Blue: 50,000~100,000 } Red: >100,000
} TOP 5 1. Abidjan(60): 2,260,353 2. San-Pedro(122): 183,124 3. Yamoussoukro(58): 155,956 4. Bouake(39): 126,533 5. Daloa(144): 114003
Subpref Id Count 1 Abidjan 60 2,260,353
2 San-Pedro 122 183,124
3 Yamoussoukro 58 155,956
4 Bouake 39 126,533
5 Daloa 144 114,003 1
2
3
4
5
Subpref distribution: some conclusions } Some subprefs don’t have any antennas. As a result, no
data can be found in SET3.
} Only a few subprefs can get a big data to continue analyze in more details. } The biggest subpref Abidjan overbeats the others greatly
Analysis 3 Subpref movement map
Subpref Movement } SET3: user_id 1~50,000; date:12.1~12.15 } Subpref movement:
} Two continuous call happened subprefs } Not except happened in the same subprefs
} Total subpref movement pairs: 6,213,412 } With -1: 660,935 (including 515,571 is <-1,-1>) 10.6% } Same sub prefecture: 5,392,950 86.8% } Different sub prefecture: 159,527 2.6%
} Various movement pairs <o, d>: about 4,900 } Total possible pairs: 255*255 } 4900/(255*255) = 7.5%
Different Subpref Movements } 4,628 different movement pairs, total 159,527 movements
} In average, every person only have <0.1 movements between different subprefs in 2 weeks
} Mean: 159527/4628 = 34.5 } Med: 3 } Top10 (0.2%): 28,236 changes (17.7%) } Top50 (1%): 56,638 changes (35.5%)
Subpref movement map (all)
Subpref change map (count > 5)
Subpref movement map (count > 100)
Analysis 4 Antenna movement
} SET2: 12.05-12.18 } Users: 50,000 } Total movements: 5,031,117 } With -1: 504,325 (10.0%) } Same antennas: 3,513,504 (69.8%) } Different antennas: 1,013,288 (20.2%) } Movement pairs: 68,000 (4.7% of possible pairs 1200^2)
} Ant change(SET2) vs. Subpref change(SET3)
Different Same Unknown Change pairs Antenna 20.2% 69.8% 10.0% 4.7% Subpref 2.6% 86.8% 10.6% 7.5%
Antenna movement map (all)
Antenna vs. Subpref
ANT SUBPREF
Antenna movement map (>5)
Detail Movements around Abidjan (>50)
Movements: some conclusions } Little movements between different locations, especially
for subprefs } Subpref: 2.6% } Antanna: 20.2%
} Subpref data[set3] is useful for high level statistics } And with a very, very big and fine data set
} 500,000 users, each for 150 days
} Antenna data[set2] is more useful when taken into some big cities’ detail map.
Analysis 5 Day and Night call distribution
} Analyze the whole SET2 } different types of days:
} Weekday/Weekend } Holidays in the data set
¨ Christmas: 12.25 ¨ New Year: 1.1 ¨ Easter Monday: 4.9
} Expect the those holidays ¨ Christmas:2011.12.24(Sat.)-2012.1.8(Sun.) ¨ Easter Monday: 2012.4.9
} Day/night } Day: 10:00-18:00, Night: 20:00-8:00 } (neglect calls between 8:00-10:00 and 18:00-20:00)
Day and Night call distribution Abidjan: weekday-day } yellow < green < pink < red
Day and Night call distribution Abidjan: weekday-night
Day and Night call distribution Abidjan weekday: day vs. night
day night
Day and Night call distribution Abidjan weekday } Use (day_calls/night_calls) as metric
} Total_day_calls/total_night_calls = 1.6 } Yellow(<1.1), green(1.1,1.4),blue(1.4,1.8),pink(1.8,2.4),red(>2.4)
Day Calls
hierarchy
Night Calls
Day and Night call distribution Yamoussoukro weekday
Conclusions } Challenges in the data set
} Missing data } -1 for unknown antenna } 100 out of 3600 hours without data } Actually, many antennas didn’t get any data for a period of time
} Big data } Especially for SET 3 (up to 30G) } Think carefully about performance and efficiency before carrying a
actual experiment. } Avoid bugs in the experiment seriously.
} High spatial data (SET2) can be very useful in the area around Abidjan. In the other places, it may make little difference with coarse data (SET3)
Data set summary } SET1: Antenna to Antenna calls on an hourly basis } SET2: Individual antenna trace for two weeks } SET3: Individual subpref trace for 150 days } SET4: Ego call graph
} The anonymized identifiers assigned to the individuals are identical for all time slots but are unique for each subgraph. } For each ego, there are 10 graphs which can be taken into account
altogether. } For different egos, same individuals will be represented by different ids.
Data set Size per file Users Date range SET1 350~730 MB / 12.5-4.22
SET2 140~180 MB 50,000 12.5-4.22 (cut to 10 two-week periods)
SET3 2~3 GB 500,000 12.1-4.28 SET4 3~5 MB 5,000 12.5-4.22
} Sort SET3 users } by phone calls } by subprefs visited
} Sort SET1 antennas
} Something about missing data
SET3: sort users by calls } 12.1-12.15 vs. 12.16-12.30
top users sorted by calls } Eg.
} If one record means one call or one SMS } Calls per day: 22000/15 = 1467 calls/day } Calls per hour: 1467/24 = 61 calls/hour } Must be abnormal user
} Provide some SMS service } Send SPAM SMS
} Need a threshold to eliminate those abnormal users. } Top 0.05%: >4000 } Top 0.1%: >3000 } Top 0.3%: >2000 } Top 0.5%: >1500 } Top 50%(median): 80~100 (7~8 calls/day)
Users distribution by calls } 12.1-12.15: total 500,000 users
SET3: sort users by subpref visited } 12.1-12.15 vs. 12.16-12.30
SET3: sort users by subpref visited } High mobility users often have more calls than average,
but the number is not extremely higher. } several hundred: most 200 ~ 500 (top 20% - 4%)
User distribution by subpref visited
SET3: Subpref movement pattern } (subpref_visited_count, call_count)
} each one is 15-day long period from 12.1 } 492174
} (35,859), (1,1442), (1,830), (0,0), (1,3106) } 436776
} (29,364), (12,104), (1,134), (1,43), (1,19) } 234871
} (32,329), (31,288), (18,188), (15,270), (13,193) } 336137
} (30,493), (19,236), (19,283), (19,285), (10,335) } 128386
} (28,471), (19,438), (21,271), (31,435), (32,426) } 64659
} (29,480), (21,440), (18,338), (22,313), (18,453) } 80582
} (30,367), (10,334), (6,314), (17,432), (4,261) } 365444
} (29,292), (9,240), (11,321), (15,305), (17,307) } 439046
} (27,675), (11,367), (6,84), (11,156), (6,353)
Instant peak
Always high mobility
Movement Case Study user: 492174
} 12.1-12.15, after that only appear in Abidjan
Movement Case Study user: 336137
12.1-12.15 12.16-12.30
12.31-1.14 1.15-1.29
Movement Case Study user: 128386
12.31-1.14 1.15-1.29
12.1-12.15 12.16-12.30
SET3: subpref sorted by users } Count different users in each subpref during a period.
Data missing } Subpref 109
Antenna:356
SET2: users sorted by calls } Not many spam users as seen in SET3
} >10000: only 1 over 50000*10
} About top 1% users have >1000 calls over two-week period (similar as SET3)
12.5~12.18 users on -1 } Total users: 50000
} Users without -1: 77% } Users with -1:
About 1/3 ‘-1’ occurs on 90%
above users
SET1: sort <o,d> by calls number } top 100 <o, d> which o=d } 12.5-12.18
As antennas distributed much more densely than other area, Abidjan doesn’t show any outstanding results.
TOP 60 self call antennas
12.5-12.18 12.19-1.1
1.2-1.15 1.16-1.29
Detailed Analysis ANT 956
Self calls each day [south-west] ANT 956 } calls: calls sum for each day } hours: how many hours which has data for each day
[south-west] ANT 973
[south-west] ANT 999
South West } Abnormal
} 12.15-1.25
} 2.17 } 3.14 } 3.24 } 4.10 } 4.15 } 4.19
[east]ANT 44
[east]ANT 5
East } Abnormal
} 2.15 } 3.24 } 4.10 } 4.15 } 4.19
[north] ANT 611
[north]ANT 855
[north]ANT 257
[north]ANT 717
[Abidjan]ANT 27 } Low calls in sundays
[Abidjan]ANT 114
[Abidjan]ANT 418
[Abidjan]ANT 919
Daily total calls
Daily valid hours over all antennas } 2.15-2.17, 3.14, 3.24, 3.29, 4.10, 4.15, 4.19
-1 related } -1 à -1
} -1à other
} Other à -1
SET1: Top 4000 <o,d> which o!=d } https://www.google.com/fusiontables/DataSource?
docid=1pD4t0bzl9aH3rZE0xl-xVkEQJ2YAw-hpkSsAZsY
-1 vs. detected antenna } Before 2012.4.1, detected antenna and -1 have the
similar tendency in calls.
o, d different <o, d> } Select Top 100 <o, d> where o != d }
} Detailed location data (SET2) only has two-week period for each user. It may be not sufficient to do prediction. } Maybe only habits repeated each day can make sense
} For a single device, it’s not good that every task will be forwarded to it.
TODO } 理清SET4的各个时间段内的图中id之间的关系 } SET1
} 跟各组antenna之间的电话通信的数量进⾏行排序,观察特点 } O = D } O != D
} SET3 } 对500,000个⼈人进⾏行以下的排序
} 打电话次数从⼤大到⼩小 } 到过的区域从多到少
¨ 对这些移动⽐比较多的⼈人的轨迹进⾏行⼀一下分析 } 对每个区域究有多少⼈人曾经在这⼀一区域出现过,进⾏行⼀一个排序
} SET2 } 可以做和SET3类似的分析⼯工作
} 对-1出现的特性进⾏行分析 } 分别针对SET1,SET2,SET3
} 上⾯面这些实验可能的话都可以在多个时间段内跑⼀一下,尤其是SET3,因为SET3对应的ID在各个时间段内都是⼀一⼀一对应的。
} 应该把上述实验的中间结果以较好的形式存⼊入数据库中,便于进⼀一步分析。
} SET 1,2,4 } 0: 12.5 – 12.18 } 1: 12.19 – 1.1 } 2: 1.2 – 1.15 } 3: 1.16 – 1.29 } 4: 1.30 – 2.12 } 5: 2.13 – 2.26 } 6: 2.27 – 3.11 } 7: 3.12 – 3.25 } 8: 3.26 – 4.8 } 9: 4.9 – 4.22
} SET 3 } A: 12.1 – 12.15 } B: 12.16 – 12.30 } C: 12.31 – 1.14 } D: 1.15 – 1.29 } E: 1.30 – 2.13 } F: 2.14 – 2.28 } G: 2.29 – 3.14 } H: 3.15 – 3.29 } I: 3.30 – 4.13 } J: 4.14 – 4.28
Focus on Abidjan
Select ants with >50% valid data } Total 376 antennas:
} >50%: 277 antennas
Select ants with >70% valid data } >70%: 255 antennas
Select ants with >80% valid data } >80%: 221 antennas
Select ants with >90% valid data } >90%: 191 antennas
Calls per hour } Choose 12.8-12.10, while 12.10 is Saturday
Abijan area 1
Abijan area 2
Energy saving } Nr: received result number } Nt: assigned task number } Cconn: consumption of connection } Csens: consumption of task sensing
} Naïve method to do these task } Assigning the tasks to Nr workers and then receive results } Energynai = Nr * ( 2 * Cconn + Csens)
} First call assigned: more energy-saving } Just assigning the tasks to first Nr workers who make calls, to save the
first connection consumption, then set a connection to upload results as soon as finishing sensing
} Energyfr = Nr * (Cconn + Csens)
Our method
} Use our current method } Assign task when worker makes a call and receive task the next call } Energycur = Nt * Csens
} Then, the energy difference between our method and naïve method } diff = Energyorg – Energycur = 2 * Nr * Cconn – (Nt – Nr) * Csens
} If diff > 0, which means actually saving some energy, then } 𝑪𝒄𝒐𝒏𝒏/𝑪𝒔𝒆𝒏𝒔 > 𝑵𝒕−𝑵𝒓/𝟐𝑵𝒓
} Same induction to our method and first call assigned method } 𝑪𝒄𝒐𝒏𝒏/𝑪𝒔𝒆𝒏𝒔 > 𝑵𝒕−𝑵𝒓/𝑵𝒓
} 𝑪𝒄𝒐𝒏𝒏/𝑪𝒔𝒆𝒏𝒔 > 𝑵𝒕−𝑵𝒓/𝟐𝑵𝒓 } Set psuc: the percent of those tasks which return results. } 𝑁𝑟=𝑝𝑠𝑢𝑐𝑁𝑡 } 𝑪𝒄𝒐𝒏𝒏/𝑪𝒔𝒆𝒏𝒔 > 𝑵𝒕−𝑵𝒓/𝟐𝑵𝒓 = 𝑵𝒕−𝒑𝒔𝒖𝒄𝑵𝒕/𝟐𝒑𝒔𝒖𝒄𝑵𝒕 = 𝟏−𝒑𝒔𝒖𝒄/𝟐𝒑𝒔𝒖𝒄 à kconn~sens
} psuc = 0.5 (flooding, seen as worst case) à kconn~sens = 1/2 } psuc = 0.9 à kconn~sens = 1/18
} If comparing to first assigned method: } k’conn~sens = 𝟏−𝒑𝒔𝒖𝒄/𝒑𝒔𝒖𝒄 = 2 kconn~sens
} Since Cconn is a more fixed value and } Csens < (1 / kconn~sens) * Cconn
} So, for saving energy actually, Csens can’t be too big } Even at high psuc = 0.9
} Csens < 18Cconn (vs. naïve method) } Csens < 9Cconn (vs. first assigned method)
Another method framework: guarantee that energy could be saved
} A modification to first assigned tasks } Still only assign Nr tasks to workers
} Difference } Not choose first Nr workers who make a call
} Can use some prediction algorithms here to judge whether to assign } Not upload results as soon as the sensing finished
} upload until the next call, if ¨ the T(next_call) is in the acceptable delay
} Otherwise, actively create a connection to send the sensing result } Here is a problem: if in antenna – phone cases, a phone maybe go
outside of the antenna or an area. What should we deal with it? ¨ Solution 1: add more if condition: worker didn’t move out of the area ¨ Solution 2: don’t care where he makes the next phone call ¨ Assumptions behind these two solutions are different
} Solution1: actively upload result before leaving the area } Different area has different data collecting center, and the communication
before these centers is difficult, or has very high energy consumption.
} Solution2: don’t care where the next call is } Different area has a same data collecting center, so uploading sensing
result from what area doesn’t matter at all.
} Diff between Solution2 and Solution1 is that Solution2’s result must contain the data showing where the data is sensing } But in most applications this data will also recorded, even it’s redundant.
So this may not make much difference.
} Some intermediate
} Problems } If a area contains too much antennas, it may lead to some uneven
sensing.
1
2
3
Antennas for each region } 1
} 196,909,1000,739,1030,425,744,542,892
} 2 } 279,994,40,124,394,742,908
} 3 } 292,746,307,344,143,738,821,245,509,839,1231,735,998