Analysis Patterns

41
ANALYSIS PATTERNS

Transcript of Analysis Patterns

Page 1: Analysis Patterns

ANALYSIS PATTERNS

Page 2: Analysis Patterns

FASTEST SCORERS

CRICKET“ I’ve always been curious… who

among India’s prolific one-day run-getters had the best strike rate?

Sachin?

Sehwag?

What about the rest of the world?

Page 3: Analysis Patterns

LET’S TAKE ONE DAY CRICKET DATA

Country Player Runs ScoreRate MatchDate Ground VersusAustralia Michael J Clarke 99* 93.39 30-06-2010 The Oval EnglandAustralia Dean M Jones 99* 128.57 28-01-1985 Adelaide Oval Sri LankaAustralia Bradley J Hodge 99* 115.11 04-02-2007 Melbourne Cricket Ground New ZealandIndia Virender Sehwag 99* 99 16-08-2010 Rangiri Dambulla International Stad. Sri LankaNew Zealand Bruce A Edgar 99* 72.79 14-02-1981 Eden Park IndiaPakistan Mohammad Yousuf 99* 95.19 15-11-2007 Captain Roop Singh Stadium IndiaWest Indies Richard B Richardson 99* 70.21 15-11-1985 Sharjah CA Stadium PakistanWest Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002 Sardar Patel Stadium IndiaZimbabwe Andrew Flower 99* 89.18 24-10-1999 Harare Sports Club AustraliaZimbabwe Alistair D R Campbell 99* 79.83 01-10-2000 Queens Sports Club New ZealandZimbabwe Malcolm N Waller 99* 133.78 25-10-2011 Queens Sports Club New ZealandAustralia David C Boon 98* 82.35 08-12-1994 Bellerive Oval ZimbabweAustralia Graeme M Wood 98* 63.22 11-01-1981 Melbourne Cricket Ground IndiaEngland Ian J L Trott 98* 84.48 20-10-2011 Punjab Cricket Association Stadium IndiaIndia Yuvraj Singh 98* 89.09 01-08-2001 Sinhalese Sports Club Ground Sri LankaIreland Kevin J O'Brien 98* 94.23 10-07-2010 VRA Ground ScotlandKenya Collins O Obuya 98* 75.96 13-03-2011 M.Chinnaswamy Stadium AustraliaNetherlands Ryan N ten Doeschate 98* 73.68 01-09-2009 VRA Ground AfghanistanNew Zealand James E C Franklin 98* 142.02 07-12-2010 M.Chinnaswamy Stadium IndiaPakistan Ijaz Ahmed 98* 112.64 28-10-1994 Iqbal Stadium South AfricaSouth Africa Jacques H Kallis 98* 74.24 06-02-2000 St George's Park Zimbabwe

Page 4: Analysis Patterns

Against which countries are higher averages

scored?

Which countries’ players score more per

match?

Page 5: Analysis Patterns

Which player scores the most per ball?

The player with the highest strike rate is an obscure South African whose name most of us have never heard of.

In fact, this list is filled with players we have never heard of.

Page 6: Analysis Patterns

ODI STRIKE RATES OF THE WORLD

We want to see the prioritised performance. That is, what is the strike rate of the established players?

Page 7: Analysis Patterns

Most analysis answers the question

“Which is are the top 10 X”?Which are my top products?

Which are my top branches?

Who are my best sales people?

Which vendors have the highest cost per unit?

Which divisions are spending the most money?

In which hours does the under 12 segment watch TV most?

Which customer segment has the highest revenue per user?

Page 8: Analysis Patterns

THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY

Country Player Runs ScoreRate MatchDate Ground VersusAustralia Michael J Clarke 99* 93.39 30-06-2010 The Oval EnglandAustralia Dean M Jones 99* 128.57 28-01-1985 Adelaide Oval Sri LankaAustralia Bradley J Hodge 99* 115.11 04-02-2007 Melbourne Cricket Ground New ZealandIndia Virender Sehwag 99* 99 16-08-2010 Rangiri Dambulla International Stad. Sri LankaNew Zealand Bruce A Edgar 99* 72.79 14-02-1981 Eden Park IndiaPakistan Mohammad Yousuf 99* 95.19 15-11-2007 Captain Roop Singh Stadium IndiaWest Indies Richard B Richardson 99* 70.21 15-11-1985 Sharjah CA Stadium PakistanWest Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002 Sardar Patel Stadium IndiaZimbabwe Andrew Flower 99* 89.18 24-10-1999 Harare Sports Club AustraliaZimbabwe Alistair D R Campbell 99* 79.83 01-10-2000 Queens Sports Club New ZealandZimbabwe Malcolm N Waller 99* 133.78 25-10-2011 Queens Sports Club New ZealandAustralia David C Boon 98* 82.35 08-12-1994 Bellerive Oval ZimbabweAustralia Graeme M Wood 98* 63.22 11-01-1981 Melbourne Cricket Ground IndiaEngland Ian J L Trott 98* 84.48 20-10-2011 Punjab Cricket Association Stadium IndiaIndia Yuvraj Singh 98* 89.09 01-08-2001 Sinhalese Sports Club Ground Sri LankaIreland Kevin J O'Brien 98* 94.23 10-07-2010 VRA Ground ScotlandKenya Collins O Obuya 98* 75.96 13-03-2011 M.Chinnaswamy Stadium AustraliaNetherlands Ryan N ten Doeschate 98* 73.68 01-09-2009 VRA Ground AfghanistanNew Zealand James E C Franklin 98* 142.02 07-12-2010 M.Chinnaswamy Stadium IndiaPakistan Ijaz Ahmed 98* 112.64 28-10-1994 Iqbal Stadium South AfricaSouth Africa Jacques H Kallis 98* 74.24 06-02-2000 St George's Park Zimbabwe

Take every column in the data

Find the top value by that column

Country South Africa has the highest strike rate of 76%Player Johann Louw has the highest strike rate of 329%Runs 164 runs has the highest strike rate of 156%MatchDate12-03-2006 has the highest strike rate of 136%Ground AC-VDCA Stadium has the highest strike rate of98%Versus United States has the highest strike rate of 104%

Page 9: Analysis Patterns

AUTOLYSISA PRODUCT THAT ENCAPSULATES BUSINESS

ANALYSIS PATTERNS

Page 10: Analysis Patterns

SPATIAL FREQUENCY ANALYSIS

Page 11: Analysis Patterns
Page 12: Analysis Patterns

12

100

YEAR

S O

F IN

DIA

’S

WE

ATH

ER

1901

1911

1921

1931

1941

1951

1961

1971

1981

1991

2001

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Page 13: Analysis Patterns

TEMPORAL FREQUENCY ANALYSIS

Page 14: Analysis Patterns

14

IMPACT OF THE BUDGET ON STOCK PRICES

Page 15: Analysis Patterns

15

RESTAURANT FOUND AN UNUSUAL DIP IN SALESA restaurant chain had data for every single transaction made over a few years. Plotting this as a time series showed them nothing unusual.

However, the same data on a calendar map reveals a very different story.

Specifically, at the bottom left point-of-sale terminal, sales dips on every Wednesday. At the bottom right point-of-sale terminal, sales rises on every Wednesday (almost as if to compensate for the loss.)

It turns out that the manager closes the bottom-left counter every Wednesday afternoon due to shortage of staff, assuming that it results in no loss of sales. There is, however, a net loss every Wednesday.

Page 16: Analysis Patterns

HOW BIRTHDAYS AFFECT MARKS

Page 17: Analysis Patterns

17

BANK FOUND ALL LOANS BEFORE 20TH POOR

Every loan disbursed after the 20th of the month, i.e. from the 21st to the end of the month, shows consistently lower non-performing assets (i.e. better quality) than any loan disbursed prior to the 20th.

The bank mapped this back to their incentive scheme. The sales team’s commission is based only on loans disbursed until the 20th. Hence new loans are squeezed into this period without regard for their quality.

The personal finance division of a bank, focusing on retail loans, drove its sales through a branch sales team.

A study of the non-performing assets of loans generated over the course of one year shows a strange pattern.

This representation, known as a calendar map, can show some interesting patterns, particularly weekday-based patterns, as the next example will show.A similar visual helped a telecom company identify specific days on which their competitors’ market share rose significantly, enabling them to negate

the strategy.

Communicating data visually is the most effective way to a shared understanding

Page 18: Analysis Patterns

A brief aside on this distribution...

Page 19: Analysis Patterns

Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.

June borns score the

lowest

The marks shoot up for Aug borns

… and peaks for Sep-borns

120 marks out of 1200

explainable by month of birth

An identical pattern was observed in 2009 and 2010…

… and across districts, gender, subjects, and class X & XII.

“It’s simply that in Canada the eligibility cut-off for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”

-- Malcolm Gladwell, Outliers

Page 20: Analysis Patterns

PATTERN OF “BIRTHS” IN INDIA IS SKEWEDThis is a birth date dataset that’s obtained from school admission data for over 10 million children. When we compare this with births in the US, we see none of the same patterns.

For example,• Is there an aversion to the 13th or is there a local cultural

nuance?• Are holidays avoided for births?• Which months have a higher propensity for births, and

why?• Are there any patterns not found in the US data?

Very few children are born in the month of August, and

thereafter. Most births are concentrated in the first half

of the year

We see a large number of children born on the 5th, 10th,

15th, 20th and 25th of each month – that is, round

numbered dates

Such round numbered patterns a typical indication

of fraud. Here, birthdates are brought forward to aid

early school admission

More births Fewer births … on average, for each day of the year (from 2007 to 2013)

Page 21: Analysis Patterns

THIS ADVERSELY IMPACTS CHILDREN’S MARKSIt’s a well established fact that older children tend to do better at school in most activities. Since many children have had their birth dates brought forward, these younger children suffer.

The average marks of children “born” on the 1st, 5th, 10th, 15th etc. of the month tend to score lower marks. • Are holidays avoided for births?• Which months have a higher propensity for births, and

why?• Are there any patterns not found in the US data?

Higher marks Lower marks… on average, for children born on a given day of the year (from 2007 to 2013)

Children “born” on round numbered days score lower marks on average,due to a higher proportion of younger children

Page 22: Analysis Patterns

RANK SCALE DISTRIBUTIONS

Page 23: Analysis Patterns

23

AN ENERGY UTILITY DETECTED BILLING FRAUD

This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large

number of readings are aligned with the slab boundaries.

Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh).

Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.

An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available.

Most fraud detection software failed to load the data, and sampled data revealed little or no insight.

This can happen in one of two ways.

First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary.

Or, more realistically, there’s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.

Page 24: Analysis Patterns

24

TN CLASS X: ENGLISH

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 25: Analysis Patterns

25

TN CLASS X: SOCIAL SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 26: Analysis Patterns

26

TN CLASS X: MATHEMATICS

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 27: Analysis Patterns

27

CBSE 2013 CLASS XII: ENGLISH MARKS

Page 28: Analysis Patterns

CLUSTERED CORRELATIONS

Page 29: Analysis Patterns

68% correlation between AUD &

EUR

Plot of 6 month daily AUD - EUR

values

Block of correlated currencies

… clustered hierarchically

Page 30: Analysis Patterns

RESTAURANT: PRODUCT SALES CORRELATION

Page 31: Analysis Patterns

31

RESTAURANT: PRODUCT SALES CORRELATION

Page 32: Analysis Patterns

MAXIMAL TEXTUAL SEGMENTATION

Page 33: Analysis Patterns

33

WHAT TOPICS DID THE YOUNG & OLD FOCUS ON?

P.W.D.

Health and

family welfare

Revenue

Rural Developme

nt and Panchayat

Raj

Social Welfar

e

Urban Development

Water Resour

ces

Minor Irrigati

on

Fuel

Housing

Agriculture

Primary Educati

on

Primary and

Secondary Education

Woman & Child

Development

Higher Educat

ion

HomeCoope

rative

Forest

Adminisrative

Reforms

Labour

Food & Civil

Supplies

Tourism

Finance

Animal Husbandry

Transportatio

n

Horticulture

Muzrai

Haz & Wakf

TransportMedical

Education

Medium and Large Industries

Excise

Major & Medium Industrie

s

Kannada &

Culture

Textile

Fisheries

Parliamentary Affairs and Human

Rights

Adult Educat

ion

Rural Water

Supply and Sanitation

Mines & Geolog

y

Small Industri

es

Youth and

Sports

Sugar

Planning and

Statistics

Agricultural

Marketing

Rural Water Supply

Fisheries & Inland

water transport

Small Scale

Industries

Youth Service & Sports

Sericulture

Law & Human Rights

Prison

Planning

Information &

Technology

Public Library

Young Old

Based on assembly session questions, Karnataka, 2008-2012

Page 34: Analysis Patterns

34

THE LANGUAGE OF TWEETSBased on 1 week of geo-coded tweets from India, this visual shows words sized by frequency. Words on the left (in red) are used by people with few followers, while those on the right (in green) is the reverse.

High-followers use significantly more hash-tags and are perhaps more polite with ‘good morning’s and ‘thank you’s

People with low followers tend to talk more about ‘know’, ‘traffic’, ‘high’ etc

Page 35: Analysis Patterns

35

PARLIAMENT DECISIONS

promotion scheme

project

approved

development

agreement amendment

central

act

section

limited

billlaning

plan

government

new

ltd

phaseapproval

sector

state

settinginvestment

pradesh

policy

four

programme

amendments

indianextensioninstitute

commission

nhdp

technology

proposal

iii

implementation

fund

establishment

equity

assistancecooperation

transfer

infrastructure

corporation

international

mou cabinet

company

public

year

revised

construction

services

continuation

approves

stateseducationadditional

financial

revision

sponsored

port

mission

centrally

basis

signing

protection

management

capital

bank

two

projects

research

upgradation

rural

special

land

delhi

employees

existing

committee

relief

convention six

crore

payment

power

health

cost

package

institutionsacquisition

control

restructuring

air

grant

field

university

scheduled

PRE-2009 2009 AND AFTERDecisions related to intervention, assistance and relief were almost entirely concentrated in pre-2009

The number of international agreements has declined dramatically between pre-2009 and post-2009

A significant rise in the number of decisions related to the States is

seen post 2009 – in contrast with the focus on “Central” pre-2009

Decisions to increase the number of lanes on highways grew significantly

post-2009, especially as part of the CCI (Cabinet Committee on Infrastructure)

decisions

Page 36: Analysis Patterns

36

WHAT DO FINANCIAL ANALYSTS ASK IBM VS MSFT?

Page 37: Analysis Patterns

BIPARTITE NETWORK CLUSTERING

Page 38: Analysis Patterns

38

How does Mahabharata, one of the largest epics with 1.8 million words lend itself to text analytics?

Can this ‘unstructured data’ be processed to extract analytical insights?

What does sentiment analysis of this tome convey?

Is there a better way to explore relations between characters?

How can closeness of characters be analysed & visualized?

VISUALISING THE MAHABHARATA

Page 39: Analysis Patterns

Tata TeleservicesTata Consultancy Services

Tata Business Support ServicesTata Global BeveragesTata Infotech (merged)

Tata Toyo RadiatorHoneywell Automation India

Tata CommunicationsA G C Networks

Tata Technologies

Tata ProjectsTata PowerTata FinanceIdea CellularTata MotorsTata SonsTata SteelTayo RollsTata SecuritiesTata CoffeeTata Investment Corp

A J EngineerH H MalghamH K SethnaKeshub MahindraRavi KantRussi ModySujit Gupta

A S BamAmal GanguliD B EngineerD N GhoshM N BhagwatN N KampaniU M Rao

B MuthuramanIshaat Hussain

J J IraniN A PalkhivalaN A Soonawala

R GopalakrishnanRatan Tata

S RamadoraiS Ramakrishnan

DIRECTORSHIPS AT THE TATASEvery person who was a Director at the Tata Group is shown here as an orange circle. The size of the circle is based on the number of directorship positions held over their lifetime.Every company in the Tata Group is shown here as a blue circle. The size of the circle is based on the number of directors the company has had over time.Every directorship relation is shown by a line. If a person has held a directorship position at a company, the two are connected by a line.The group appears to be divided into two clusters based on the network of directorship roles.

Prominent leadersbridge the groups

Second group of companies

First group of companies

Some directors are mainly associated with the first group of companies

Some directors are mainly associated with the second group of companies

Page 40: Analysis Patterns

Manual exploration Automated insightsMore

problems

Tougherproblems

EXCEL

TABLEAUQLIK

RSASSPS

S

TENSORFLOW

THEANO

SPOTFIRE MICROSTR

ATEGYCOGNOS

CAFFE

Deep insights

TORCH

This fills a gap in thepattern-based analysis space

AUTOLYSIS