EXPANDING IDENTIFIERS TO NORMALIZING SOURCE
CODE VOCABULARYPRESENTED BY DAWN LAWRIE
LOYOLA UNIVERSITY MARYLAND
IN COLLABORATION WITH DAVE BINKLEY
Friday, October 7, 11
VOCABULARY MISMATCH
DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS
EXAMPLE
REQUIREMENT - “FEATURE LOCATION”
SOURCE CODE - “FEATURELOCATION”
OR WORSE “FLOC”
Friday, October 7, 11
PURPOSE OF NORMALIZE
COPE WITH VOCABULARY MISMATCH
SOURCE CODE
OTHER SOFTWARE DOCUMENTS
Friday, October 7, 11
EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURELOCATION
FLOC
Friday, October 7, 11
EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURE LOCATION
FLOC
SPLITTING PROBLEM
Friday, October 7, 11
EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURE LOCATION
F LOC
SPLITTING PROBLEM
SPLITTING PROBLEM
Friday, October 7, 11
EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURE LOCATION
FEATURE LOCATION
SPLITTING PROBLEM
SPLITTING ANDEXPANSION PROBLEM
Friday, October 7, 11
WHY NORMALIZE?
MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES
UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS
Friday, October 7, 11
NORMALIZE PROBLEM STATEMENT
FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS
FLOC FEATURE LOCATION
Friday, October 7, 11
NORMALIZE ALGORITHM
TERMINOLOGY
HARD-WORD - WHITEHOUSE_LAWN
SOFT-WORD - WHITE-HOUSE_LAWN
Friday, October 7, 11
NORMALIZE ALGORITHM
TERMINOLOGY
HARD-WORD - WHITEHOUSE_LAWN
SOFT-WORD - WHITE-HOUSE_LAWN
(2)
Friday, October 7, 11
NORMALIZE ALGORITHM
TERMINOLOGY
HARD-WORD - WHITEHOUSE_LAWN
SOFT-WORD - WHITE-HOUSE_LAWN
(2)
(3)
Friday, October 7, 11
NORMALIZE ALGORITHM
Friday, October 7, 11
NORMALIZE ALGORITHM
STRLEN STRING LENGTH
Friday, October 7, 11
MACHINE TRANSLATION APPROACH
EL PAPA VISITA LA IGLESIA
Friday, October 7, 11
MACHINE TRANSLATION APPROACH
EL PAPA VISITA LA IGLESIA
THEFATHERPOTATOPOPE
VISITSVISITORHIT
THE CHURCH
Friday, October 7, 11
MACHINE TRANSLATION APPROACH
EL PAPA VISITA LA IGLESIA
THEFATHERPOTATOPOPE
VISITSVISITORHIT
THE CHURCH
Friday, October 7, 11
MACHINE TRANSLATION APPROACH
EL PAPA VISITA LA IGLESIA
THEFATHERPOTATOPOPE
VISITSVISITORHIT
THE CHURCH
STRONG COHESION
Friday, October 7, 11
MACHINE TRANSLATION APPROACH
EL PAPA VISITA LA IGLESIA
THEFATHERPOTATOPOPE
VISITSVISITORHIT
THE CHURCH
STRONG COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM
Friday, October 7, 11
NORMALIZE ALGORITHM
STRLEN
Friday, October 7, 11
NORMALIZE ALGORITHM
STRLENS-TRLEN
ST-RLEN
STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM
STRLENS-TRLEN
ST-RLEN
STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N
E(RLEN) = {RIFLEMEN}
Friday, October 7, 11
NORMALIZE ALGORITHM
STRLENS-TRLEN
ST-RLEN
STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N
E(RLEN) = {RIFLEMEN}
WILDCARD EXPANSION
R*L*E*N*
Friday, October 7, 11
NORMALIZE ALGORITHM
STRLENS-TRLEN
ST-RLEN
STR-LENSTRL_ENSTRLE_NS_T_RLENS-TR-LENS_TRL_ENS_TRLE_NST_R_LENST_RL_ENST_RLE_NSTR_L_ENSTR_LE_NSTRL_E_NS_T_R_LENS_T_RL_ENS_T_RLE_NS_TR_L_ENS_TR_LE_NS_TRL_E_NST_R_L_ENST_R_LE_NST_RL_E_NSTR_L_E_NS_T_R_L_ENS_T_R_LE_NS_TR_L_E_NST_R_L_E_NS-T-R-L-E-N
E(ST) = {SET, STOP, STRING}E(RLEN) = {RIFLEMEN}
E(STR) = {STEER, STRING}E(LEN) = {LENDER, LENGTH}
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEER
VSSTR
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEERLENDERLENGTH
LENDERLENGTH
VSSTR
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEERLENDERLENGTH
LENDERLENGTH
VS
1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS
STR
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEERLENDERLENGTH
LENDERLENGTH
VS
+ +
1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS
COHESIONBCOHESIONA
STR
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEERLENDERLENGTH
LENDERLENGTH
VS
+ +
1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS
2. SELECT EXPANSION THAT MAXIMIZES COHESION
COHESIONBCOHESIONA
STR
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEERLENDERLENGTH
LENDERLENGTH
VS
+ +
1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS
2. SELECT EXPANSION THAT MAXIMIZES COHESION
COHESIONBCOHESIONA
STR
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
STRING STEERLENDERLENGTH
LENDERLENGTH
VS
+ +
1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS
2. SELECT EXPANSION THAT MAXIMIZES COHESION
COHESIONBCOHESIONA
STRING
STR
Friday, October 7, 11
NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLEN
Friday, October 7, 11
NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN
Friday, October 7, 11
NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
Friday, October 7, 11
NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLENSTRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION
STRING LENGTH
Friday, October 7, 11
ADDING CONTEXT
Friday, October 7, 11
ADDING CONTEXT
DIR
Friday, October 7, 11
ADDING CONTEXT
DIR E(DIR) = {DIRECTION, DIRECTORY}
Friday, October 7, 11
ADDING CONTEXT
DIR E(DIR) = {DIRECTION, DIRECTORY}
CONTEXT = {FORWARD, BACKWARD}
Friday, October 7, 11
ADDING CONTEXT
FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS
USED IN BOTH PART 1 AND PART 2
DIR E(DIR) = {DIRECTION, DIRECTORY}
CONTEXT = {FORWARD, BACKWARD}
Friday, October 7, 11
NORMALIZE IMPLEMENTATION
USES GenTest TO SPLIT IDENTIFIERS
RETURNS MULTIPLE SPLITS
GOOGLE 5-GRAM DATASET
Friday, October 7, 11
EVALUATION
Program Loc SLoc Unique Ids
which-2.20 3,670 2,293 487
a2ps-4.14 62,347 38,436 4,393
Program Selected Ids Hard Words Soft Words
which-2.20 487 903 1214
a2ps-4.14 211 459 618
Friday, October 7, 11
EVALUATION
THREE GROUPS OF IDENTIFIERS
STANDARD LIBRARY CALLS
NAMES FROM STANDARD HEADER FILES / KEYWORDS
DOMAIN NAMES
Friday, October 7, 11
EVALUATION
THREE GROUPS OF IDENTIFIERS
STANDARD LIBRARY CALLS
NAMES FROM STANDARD HEADER FILES / KEYWORDS
DOMAIN NAMES
THREE GROUPS OF IDENTIFIERS
DOMAIN NAMES
Friday, October 7, 11
EVALUATION
THREE GROUPS OF IDENTIFIERS
STANDARD LIBRARY CALLS
NAMES FROM STANDARD HEADER FILES / KEYWORDS
DOMAIN NAMES
THREE GROUPS OF IDENTIFIERS
DOMAIN NAMES
Program Filtered Ids Reported Ids
which-2.20 152 335
a2ps-4.14 46 166
Friday, October 7, 11
EXAMPLE EXPANSIONS
id Top 10 Expansion
Top Expansion
nextchar next_character next_character
indfound index_found_need index_found
optarg option_are_g optarg
itemno i_them_not itemno
Friday, October 7, 11
RESEARCH QUESTIONS
WHAT IS THE OVERALL ACCURACY OF NORMALIZE?
DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY?
CAN THE EXPANDER INFORM THE SPLITTER?
CAN THE SPLITTER INFORM THE EXPANDER?
Friday, October 7, 11
ACCURACY ON DOMAIN IDS
Friday, October 7, 11
SOURCE OF EXPANSION WORDS
SOURCE CODE
INTERNAL DOCUMENTATION
MANUAL
Friday, October 7, 11
BEST VOCABULARY SOURCE?
Friday, October 7, 11
FUTURE WORK
EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA
EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES
EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASK
Friday, October 7, 11
SUMMARY
IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS
DEGRADES PERFORMANCE OF IR TECHNIQUES
NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLY
Friday, October 7, 11
QUESTIONS?
Need an identifier split?GenTest Splitter available at
splitit.cs.loyola.edu
Friday, October 7, 11
Top Related