College Data Mining

1. Briefly discuss how the Lerner College of Business could use data mining in each of the

following situations:

a. When deciding which undergrads to admit to the College of Business as internal transfers

from other Colleges :

A data mining tool can help ‘Lerner College of Business’ in selecting best candidates out of the

total number of applications for internal transfers in their admissions database. It can also

predict with certain accuracy, if the candidate will graduate or not:

Classification: On a training data set, rule induction algorithm will learn to separate the

subjects of study according to class labels like: “undergraduate”, “transfer”, “same majors”,

“different majors”, “current college”, “cumulative grade point average”, etc. Once the training

data set starts producing relevant classifications, the model can then be used on validated data

set.

Another technique that can be used to segregate the best candidates out of the lot is the use of

decision trees through if-and then statements. A sample decision is represented on the next

page. This technique is useful when the data labels or variables are finite and hierarchical as

compared to neural networks or rule induction algorithms.

Clustering: can be a useful technique, if the data labels are unknown. Typologies like k-means

clustering or TwoStep can be used to segregate the data points according to similarity in a large

data set. By the TwoStep clustering algorithm, we can differentiate between the “transfers” and

“other applicants”. This can further be confirmed by k-means algorithm.

Prediction: The National Student Clearing House now allows community colleges and

universities to match their data. This means that data miners and decision makers can now

compare academic behavior of a student at a community college, if s/he is applying at Lerner,

especially if they are transferring from another major - to predict what their transfer outcome

might be: “dropouts” “speeders” or “laggards”. Using the “transfers” cluster and then splitting

it further into “speeders” who quickly complete their degrees because of their privileged socio-

economic backgrounds or “laggards” who take their time in completing it or “dropouts” who

will never complete the course. Other variables they will compare are student demographics,

courses taken, units accumulated and financial aid – then doing supervised data mining through

neural networks (Neural Net) and rule induction algorithms (C5.0 or C&RT) simultaneously can

give the tool a prediction accuracy from anywhere between 72 – 80%.

b. When scheduling classes for MBA students in the part-time program, i.e., which classes to

offer each semester and which night to offer them on.

Similar data mining techniques that are explained earlier can be used in scheduling classes for

part-time MBA students at Lerner. Also, these techniques can help manage how each class

should be placed in the curricula and how to place them per week.

Pattern recognition can help identify hidden patterns among which courses a part-time MBA

student takes per semester and how frequently the classes for each are scheduled per week in

other universities. This can help formulate the generic rules for credits required for part-time

students.

An association rule between the generated course results from the above experiment can be

compared with the availability of the lecturers which can help in the formation of the academic

calendar accordingly. Another rule can help determine which students go on vacations

frequently and which don’t – this can give the college an insight into which semester the core

courses should be placed in and how electives should be scheduled in semesters where

probability of vacations taken by lecturers and students is higher.

A MapReduce implementation, though a big data tool, can help the university understand the

traffic conditions on different weekdays around the university area. This can further provide

insight as to which days the university can schedule its classes.

Classification: on the basis of data labels like “pre-requisite courses taken”, “professional

profile”, “majors selected”, “credits completed”, “years of experience”, etc. can be done to

segregate those part-time students that can be interested/eligible in taking up the courses that

are available, on those nights when the lecturers are free and when students too can come to

the college.

Clustering: Skills that can be beneficial for the completion of each course can be clustered

together. Association rules can be applied to these clusters to identify relationships among

different skills and their commonality among different courses. The clustered and most popular

skills can be grouped into a pre-requisite course and can be placed in the beginning of the

academic year for the MBA student.

On the basis of historical data, sequential relationships can be determined between which

classrooms will be available each night and their respective capacity. A Prediction algorithm

(C4.5) can be used in determining the number of applicants that will enroll for classes

scheduled each semester.

2. Accenture is an international consulting firm. Go to the URL below and listen to the mp3

audio file entitled “Analytics Panel during the Tribeca Film Festival”. Then list the three most

important points that you feel it makes.

This talk at Tribeca Film Festival (2009) revolves around: Analytics. Decisions. Execution.

Analytics: Each definition suggested by the experts in this talk holds true. It is statistically

rigorous technique against data, but you need to define the right attributes, ask the right

questions, each of these attributes should be correctly prioritized and weighed, so that you can

take the right decisions based on the analysis done and then execute them timely.

It is not only about technology, but about people and processes as well. It needs to be

engrained in the business processes and supported by heuristics. The companies who use

analytics are innovative in a rigorous way. They do it consistently. It is not a one-off gimmick.

Decisions: Decisions can be about anything – small or big. They can be about predicting injuries

or optimizing expenditure or even which movie will be the biggest hit of the year or who is

statistically best to star in it!

You cannot expect analytics to give you 100 percent data. The case in point is how well you can

decide on the basis of data as compared to data absence. The aim is not to get the best solution

ever, but better than any competitor. Studies show that managers can make better decisions

with the help of data. The intent is to get a balance between art and science. If you don’t ask

the right questions at the right time, you cannot make analytics help you. This is where

knowledge and experience come into picture.

Execution: The last major point this group makes is that just taking right decisions based on

facts is not enough. It is imperative to make sure that these decisions are executed and

implemented till the Z. There should be parameters, metrics and ways to measure effectiveness

of the experiment being conducted. That is how they can ensure that the decision is reaping

results as desired or not. If not, they would have to tweak their questions, then do the analytics

and re-run the implementation – till the time the result is nearer to the desired goal. This has to

be a consistent effort too. May be starting bottoms up can give analytics an edge to develop

over time, it can be easier to start, get some early wins, cheaper and can have cleaner

performance metrics to match-up with.

College Data Mining

Documents

Transcript of College Data Mining