Text Mining SAS-L Topics

18
Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas

description

Text Mining SAS-L Topics. Larry Hoyle, Policy Research Institute, University of Kansas. SAS-L topics. Read each weekly topic list from http://www.listserv.uga.edu/archives/sas-l.html Parse topic, HTMLdecode Strip “Re: “ /* strip variations of re: */ - PowerPoint PPT Presentation

Transcript of Text Mining SAS-L Topics

Page 1: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31

Text Mining SAS-L Topics

Larry Hoyle, Policy Research Institute, University of Kansas

Page 2: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 SAS-L topics• Read each weekly topic list from

http://www.listserv.uga.edu/archives/sas-l.html

• Parse topic, HTMLdecode

• Strip “Re: “ /* strip variations of re: */

topicRE = prxparse('/^ *[R|r][E|e] *: *(.*)/');

if prxmatch(topicRE, topic) then do;

topic = prxposn(topicRE, 1,topic);

end;

• Proc SQL to aggregate topic counts across weeks

Page 3: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 SAS-L 2005

• 35324 thread/topic lines in the html files• 7081 threads after merging across weeks and a

little cleaning

Page 4: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31SAS-L Top Threads in Number of Messages

Page 5: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 Text Miner on the SAS-L topics

Page 6: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31

Page 7: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31

Page 8: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31

Page 9: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31

Page 10: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 Largest clusters

Page 11: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 Smaller Clusters

Page 12: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 Message Content

Page 13: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 Web scraping with tmfilteroptions noxwait;

%macro aweek(week=0501a);

x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week";x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredposts\&week";

libname sugi31 'C:\ddrive\projects\sugs\sugi31\SASLBOF\datasets';

%tmfilter(dataset=sugi31.SL&week.,dir=C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week,destdir=C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredPosts\&week,URL=http://listserv.uga.edu/cgi-bin/wa?A1=ind&week.%NRSTR(&L=sas-l),

depth=1,links=sugi31.SL&week.L,norestrict=' ',

numchars=2000)

%mend aweek;

%aweek(week=0501a);%aweek(week=0501b);

Page 14: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31 Parse date and sender

Page 15: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31Using a 10% sample of message text

Page 16: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31Using a 10% sample of message text

Page 17: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31Filter out too common terms, listserv

Page 18: Text Mining SAS-L Topics

Hoyle paper 019-31

SUGI 31Filter out too common terms, listserv