1. Introduction · The Waterfall model was the first known software life-cycle model expressly ......

56
1. Introduction 1.1 Problem Statement Chrobis Web Designs (CWD) is a small, independent web design company run by IT consultant Rob Moriarty, which produces custom web sites and underlying supporting systems for a range of clients. Given the bespoke nature of their work, CWD uses a small team of in-house programmers and a number of external contractors in order to produce its products. The products themselves range widely in terms of technology; PHP, XHTML, CSS, Java, JavaScript, Python, Perl, XML and Ruby are some of the many programming languages used to create often complex systems, with tens of thousands of lines of code typically found in even the relatively smaller projects. It is a requirement of CWD that all code on their projects is produced from scratch or from pre-existing CWD code; on occasions where this is not practical, non-native CWD code must clearly be marked as so to the project manager (typically Mr Moriarty himself). Mr Moriarty feels that this ensures his programmers are fully conversant with every system and can thus more easily diagnose problems and bugs, and that production of their own code gives them the ability to better more commonly-used code libraries, interfaces and widgets; thus, giving them a commercial edge. In addition, CWD frequently releases older components under the GNU General Public Licence (GPL) as part of the company's ethos of giving back to the community. It is therefore essential on all of these counts that CWD can be certain the code is the original work of its employees. Recently the company was threatened with legal action when code belonging to programmer who was not an employee of CWD was found in one of its products. A code review taken shortly thereafter revealed a number of instances where this had occurred. CWD is anxious to ensure that this does not occur in future. The company currently uses an existing commercial web-based anti-plagiarism system, but this has failed to highlight any of the aforementioned cases. This problem is in itself part of a much larger problem – proving that any given file of source code was indeed produced by the claimed author. Further research (see section 3.3) has concluded that present systems and approaches are largely 1

Transcript of 1. Introduction · The Waterfall model was the first known software life-cycle model expressly ......

1. Introduction

1.1 Problem Statement

Chrobis Web Designs (CWD) is a small, independent web design company run by IT

consultant Rob Moriarty, which produces custom web sites and underlying

supporting systems for a range of clients. Given the bespoke nature of their work,

CWD uses a small team of in-house programmers and a number of external

contractors in order to produce its products. The products themselves range widely

in terms of technology; PHP, XHTML, CSS, Java, JavaScript, Python, Perl, XML and

Ruby are some of the many programming languages used to create often complex

systems, with tens of thousands of lines of code typically found in even the

relatively smaller projects.

It is a requirement of CWD that all code on their projects is produced from scratch

or from pre-existing CWD code; on occasions where this is not practical, non-native

CWD code must clearly be marked as so to the project manager (typically Mr

Moriarty himself). Mr Moriarty feels that this ensures his programmers are fully

conversant with every system and can thus more easily diagnose problems and

bugs, and that production of their own code gives them the ability to better more

commonly-used code libraries, interfaces and widgets; thus, giving them a

commercial edge. In addition, CWD frequently releases older components under the

GNU General Public Licence (GPL) as part of the company's ethos of giving back to

the community. It is therefore essential on all of these counts that CWD can be

certain the code is the original work of its employees.

Recently the company was threatened with legal action when code belonging to

programmer who was not an employee of CWD was found in one of its products. A

code review taken shortly thereafter revealed a number of instances where this had

occurred. CWD is anxious to ensure that this does not occur in future. The company

currently uses an existing commercial web-based anti-plagiarism system, but this

has failed to highlight any of the aforementioned cases.

This problem is in itself part of a much larger problem – proving that any given file

of source code was indeed produced by the claimed author. Further research (see

section 3.3) has concluded that present systems and approaches are largely

1

inadequate, relying principally on comparing a specimen against other matching

code (which may not be locatable or even exist). An approach is needed whereby

authorship of a file can be inferred by close examination of its characteristics.

1.2 Project Aim and Objectives

The aim of this project is to design, develop and implement a system which

establishes a “digital fingerprint” for a given programmer - a collection of

regularly-occurring idiosyncrasies in the style and design choices made by the

author in the creation of his/her source code files, which can be used to verify

future work by that author.

This can be broken into a number of smaller objectives:

● To research any previous work in the field of digital fingerprinting, systems

for establishing authorship and related anti-plagiarism work.

● To determine why existing systems and past efforts are insufficient.

● To catalogue a full list of known design and stylistic choices a programmer

can make in the creation of source code in a particular language, which do

not affect the functionality of the end program.

● To develop a system which recognises sufficient numbers of these to

establish a regular pattern (“digital fingerprint”) common to every file from a

given author written in a particular programming language

● To be able to use the system to precisely support or refute an author's claim

to a given file of source code

● To compare, where data is available, the success rate of the system when

compared with other systems designed for similar purposes.

1.3 Minimum Requirements of Project

The minimum requirements for the successful completion of the project are:

1. Product a comprehensive list of individual decisions a programmer may

make in writing source code that do not affect the overall functionality;

2. Create a system which identifies a sufficient number of these decisions

within a given set of source code files by a particular author to be able to

stereotype the author's style, distinct from that of others;

3. Create a system which can compare a source code file asserted to belong to

an given author with the author's established previous work, and accurately

2

predict whether the source code file was written by the author in question

1.4 Potential Extensions to Project

There are a variety of potential extensions to the minimum requirements which

could be considered. Some or all of these may potentially be incorporated if the

project successfully delivers on its minimum requirements.

● The ability to read and establish authorship in multiple programming

languages

● The ability to spot programming patterns carried across different languages

by the same programmer

● A scripting/plug-in system such that support for additional languages or

tests can be easily added

● A graphical user interface

● A submission system for authors to submit their own code for checking

● The ability to deactivate tests for particular idiosyncrasies that are overruled

by corporate style guidelines

● The ability to detect if only certain sections of a programmer's code deviate

from his/her digital fingerprint

1.5 Relevance to Degree Programme

This project is the specialist project for a BSc Computing with Industry, and reflects

the following objectives for that programme as outlined in Fores (2007):

● Demonstrates the potential of technological developments in hardware and

software for supporting and improving the efficiency and effectiveness of

systems.

● Utilises good engineering practice in the design and construction of software

and information systems.

● Demonstrates an understanding of the mathematical, theoretical and

conceptual foundations of information and software engineering.

● Demonstrates analytic, design, engineering and evaluation skills necessary to

specify, develop, test and implement information and software systems.

● Shows high competency in computer programming.

The following course modules are deemed in particular to be essential prerequisites

for the work being undertaken:

3

● COMP2680 – Application of practical software development

● COMP3470 – An understanding of the people issues involved in the

implementation of such a system

● COMP3410 – Technologies for knowledge management will undoubtedly

come into play in terms of data mining from the original source

1.6 Deliverables

● A list detailing the style and design choices a programmer can make which

do not affect the overall functionality of the program, some of which can be

used to create a “digital fingerprint”

● A computer program that can infer a programmer's style given a set of

source code files written by that programmer; and either verify or refute the

authorship of the programmer when given an additional file.

1.7. Changes from Mid-Project Report

The project as proposed in the mid-project report (having replaced a previously

unworkable project idea) was deemed ultimately not workable, mainly due to the

lack of a direct problem owner and a great number of other papers produced on the

topic suggested. Whilst some of the background reading and ideas regarding

plagiarism from the mid-project report are still applicable, the project is radically

different in aim, objectives and scheduling from what was presented there. The

final project idea was confirmed on 13/02/08, and the project schedule thus

reflects both a late start as well as a much compressed research period, given that

much of the background reading and research from the previous project idea was

still applicable.

4

2. Project Management

2.1 Initial Project Schedule

A number of factors must be taken into consideration when drafting the initial

project schedule:

● Given that the start date is February 13th (see section 1.7) and the due date is

March 23rd, a highly compressed project schedule must be put together. It

has to be assumed that weekends will be included to allow sufficient time for

tasks to be completed.

● Some items are still tentative: the progress meeting has yet to be scheduled,

and given the delayed start to the project an extension to the final deadline

may ultimately be appropriate.

● The project methodology has yet to be finalised; hence design and

implementation tasks and their projected lengths may be subject to change.

This final report will however be written in parts as each stage progresses,

reducing the overall workload after the software has been produced.

Based on the above, an initial project schedule Gantt chart has been put together

and can be seen in Appendix B; a HTML-browsable version of the schedule is

available on the accompanying project CD. A brief summary of the primary

milestones follows:

Date Milestone

February 13th Submit revised aim and minimum requirements

February 17th Project methodology selection made

February 20th Programming languages (to be read, and to be used in system construction) are selected

February 29th Completion of research; design commences

March 6th Completion of design; implementation commences

March 14th Progress meeting

March 27th Completion of implementation; evaluation and testing commences

April 5th Completion of evaluation and testing; possible extensions commence

April 12th Extensions completed; final report collation and writing commences

April 23rd Report produced and submitted

April 25th Electronic copy of report submitted

5

2.2 Researching Project Methodologies

Any software project methodology contains two key elements: a development

process, and a modelling language to articulate the software being produced [2].

Before the project can commence, an appropriate development process must be

selected and tailored to this project's specific needs; as well as the most relevant

and useful modelling language to the task at hand.

2.2.1 - Researching a development process

The key criteria in choosing a development process for this project are that it is

known to be fast (given the compressed schedule, a major consideration) and

suitable to software engineering (given the main focus on the software deliverable).

It will need to encompass the whole project lifecycle (on this basis, Structured

Systems Analysis and Design Method (SSADM) was discounted as a framework as it

only covers system analysis and design), and take into account there there is only

one person running the project (on this basis, XP (Extreme Programming) was

discounted as it is typically team and pair-programming focussed).

Four methodologies met these criteria and were sufficiently well-documented, with

numerous examples of successful usage. The following sections detail the relevant

development processes, and the merits of utilising each in this project.

2.2.1.1 – The Waterfall model

The Waterfall model was the first known software life-cycle model expressly

articulated in 1970 by Dr Winston Royce as a flawed method for development, but

its simple, linear and modular nature ensured it gained a huge following in systems

development (Royce, 1970). In Royce's original model, seven basic stages were

proposed:

i. Requirements specification: the capturing and statement of system requirements

ii. Design: including modelling, interface design etc.iii. Construction: of the proposed system in full.iv. Integration: of the components of the system.v. Testing and debugging: validating the system against the requirements

specificationvi. Installation: of the proposed system into the target environmentvii. Maintenance

At each stage a final, discrete product is created which should be sufficiently

6

articulated that a different team could start the next stage with no prior knowledge.

A diagram showing this can be seen in Appendix C. The emphasis is placed on tying

up each stage concretely and producing large quantities of documentation charting

the progress.

The model has been criticised for its failure to account for changing needs and has

been supplemented by a variety of modified waterfall models that advocate part

iteration or slightly different stages. These offshoots include a further-refined

waterfall from Dr Royce himself; the Sashimi Model (DeGrace, 1990) and the Iterfall

Model; and indirectly the Spiral Model (Boehm, 1988). Most of these simply adapt

the existing model with some added iteration or overlap.

Advantages: The waterfall's simplicity and linear nature makes for a very

straightforward run through the project, and as the most widely-used methodology

to date there exist innumerable examples of its use and documentation. The

principle of a single run through the development is appealing in terms of speed

and reduced complication – if it is done correctly.

Disadvantages: The high focus on discrete standalone products at the end of each

stage may be far too bureaucratic and unnecessary for a project of this size. Whilst

there is always the temptation to make a single, ostensibly infallible run through

the lifecycle, even the most experienced developers will note that there are nearly

always flaws, changes in requirements and extra considerations which occur late in

development in this model.

2.2.1.2 – Rapid Application Development (RAD) methodology (including DSDM)

Rapid Application Development (RAD) was developed at IBM in the 1980s as an

alternative to the traditional waterfall model (Martin, 1991). First formally

published by Dr James Martin in 1991, this methodology seeks to vastly accelerate

the speed and reduce the costs of a project, with a much higher emphasis on the

software deliverables rather than extensive paperwork. The mantra is “there is

nothing so wasteful as doing with great efficiency that which doesn’t need to be

done” - instead of spending a long time documenting and implementing features

which may be wrong or unnecessary (or which have changed since the

specification), RAD attempts to capture and verify requirements as early as

7

possible.

Sequentially more-functional prototypes are produced iteratively, on the 80:20

principle that 80% of the functionality can often be produced in 20% of the

development time (Arthur, 1997). As a prototype meeting minimum requirements is

produced early, the likelihood of the end-user changing the requirements by the

time implementation starts is reduced. This is often a symptom of protracted

design and development phases in the waterfall model where a working product is

not seen by the end-user; if the resulting implementation then does not meet

requirements, a large amount of time has been wasted. RAD has been criticised as

“Rough And Dirty” by some industry users as the trade-off of a high speed

schedule has seen sometimes lower-quality end products (Howard, 2002); but it has

notably produced large-scale applications in periods of 60 to 90 days. A diagram

showing RAD strategy can be seen in Appendix C.

RAD lacked a formal specification at the time of its creation, so an offshoot named

Dynamic Systems Development Method (DSDM) was developed by a non-profit

consortium of organisations with a vested interest in speedy project delivery, often

under tight financial and time constraints (Stapleton, 1999). DSDM consists of three

phases: Pre-Project (projects, identified, funding sourced and commitment made),

Project Life Cycle (the development itself) and Post-Project (ongoing maintenance

and improvement of the project, often returning to phase two). The first and third

phases are fairly self-explanatory, but the meat of the method is contained in the

middle phase, which has five stages:

i. Feasibility study: is the project likely to meet the end-user needs? Is DSDM

suitable to the task? What risks are inherent?

ii. Business study: requirements of the system discussed and produced usually

under MoSCoW principles, and a suitable time boxed project plan is

produced.

iii. Functional model iteration: priority functionalities are used to generate a

functional prototype and model for review. Ongoing testing occurs.

iv. Design and build iteration: the functional model is combined with feedback

and non-functional requirements into a working design prototype; ongoing

testing continues.

8

v. Implementation: end-users are given the full design prototype and

documentation; if successful it forms the delivered system; if not, the

process loops back to previous stages.

Advantages: RAD in the form of DSDM would greatly benefit the project due to the

tight schedule by accelerating development and ensuring that mostly-functional

software is produced early, ensuring that in the event of a delay there is at least a

prototype with 80% of the functionality.

Disadvantages: The problem owner for this project (CWD) desires minimal input

into the system, seeing the primary goal as largely self-explanatory; as a technical

audience, the problem owner is more concerned with seeing a system that works

quickly than with his input into this. Many of the early stages may be unnecessary.

2.2.1.3 – The Spiral model

Despite being officially published prior to RAD, the Spiral model is essentially a

combination of the waterfall method with prototyping methods (precursors to RAD)

(Boehm, 1988). Created in 1988 by Dr Barry Boehm, the Spiral model was the first

model to advocate the importance of an iterative approach to development. Four

stages are cyclically repeated in the model:

i. Determine objectives: both initially, and following each cycle of the model

ii. Identify and resolve risks: problems in achieving the objectives are mitigated,

and a working prototype (of either documentation of software) is created.

iii. Development and test: the prototype is expanded into a working system and

thoroughly tested. If there is further work to be done, the cycle moves to

stage four (or else, finishes).

iv. Plan the next iteration.

This approach produces regular prototypes and documentation, and constantly

feeds back risks and results into the process. The incremental development catches

risks at an earlier stage. Most notably, it tends to be used in the development of

very large and complicated systems.

Advantages: The core of the project is produced earlier as in RAD again helping if

the project is cut short soon; the constant focus on risks and problems is

9

comprehensive and thorough, and this in turn helps estimates of scheduling, which

is vital to the project.

Disadvantages: The frequent use for larger projects is not accidental; the constant

emphasis on prototyping everything (including the requirements and development

plan) are undoubtedly a paperwork overkill for a project of this size and scope.

2.2.1.4 – Unified Processes

The Unified Process is best described as a repeating series of mini-waterfalls, with

the components of each waterfall growing or shrinking in importance with each

iteration. The first incarnation of this process put forward was Rational Unified

Process (RUP) created by Rational Software (Kruchten, 2003). It was released under

the current name in 1998 but its precursors were developed over the preceding

decade (Jacobson, 1999). Aside from RUP, common offshoots include the Agile

Unified Process (AUP) and the Open Unified Process (OpenUP) amongst many

others.

There are a succession of waterfalls composed of identical stages in each model,

e.g.:

RUP: Business Modelling, Requirements, Analysis and Design, Implementation, Test,

Deployment

AUP: Model, Implementation, Test, Deployment, Configuration Management, Project

Management, Environment

There are four phases (Inception, Elaboration, Construction, Transition) which may

each contain one or more waterfalls (see Appendix C), and which lend different

emphasis to each of the stages. Full charts showing the relevant stages and cycles

can be found in Appendix C. As UML sprang from the same background as the

original Unified Process, RUP, most forms of the process are associated with UML

and adhere closely to use cases and modelling, most of which is often done using

UML.

Advantages: A widely popular set of processes, the various Unified Processes

harness the simplicity and power of use cases to keep development focussed;

various incarnations also add great emphasis on agility, which could keep the

10

project short. A clear mantra on the relative lengths of each stage and on iteratively

capturing information should help avoid problems with the final product and

ensure stages do not overrun.

Disadvantages: Six or seven stages across four phases makes for between 24 and

28 discrete phases to keep track of. The heavily use case-driven nature may be

unnecessary for a project tackling this particular challenge of software engineering

which has far less focus on a user experience.

2.2.2 – Researching a modelling language

The modelling language is the format in which the concept and design of the

system will be expressed. An appropriate modelling language will convey the

pertinent details in a simple, easily-understandable format; be appropriate to the

discipline of software engineering; and in keeping with the necessary speed of the

project as outlined in the project schedule (Appendix B), be easy to grasp and use

as well as portray the stages of the project.

2.2.2.1 – Fundamental Modelling Concepts

Fundamental Modelling Concepts (FMC) is a specification and doctrine of modelling

that is oriented towards more software-intensive systems; accordingly, there are a

significant number of symbols and entities used to represent the low-level

fundamentals of a software system such as read/write access, processes,

communication channels etc. It can be abstracted to show higher or lower level

views, but is clearly geared towards lower-level representation (Knoepfel, 2005).

FMC is comprised of three types of diagram:

1. Block Diagrams: depicts the composition structure of static systems, showing

the connections, interactions and relationships between the agents in a

system whether they be human, servers both hardware and software or

processes.

2. Petri Nets: show dynamic systems in transition; whether this be changes in

running processes or values passed through a program at low levels.

3. Entity Relationships: show the composition of attributes into entities within

the system and the relationships between each.

11

Advantages: The extremely software-oriented nature of these diagrams lends

themselves to a very detailed description of an overall system, presumably much

reducing implementation time by being explicit at very low levels.

Disadvantages: The diagrams seem complex and would undoubtedly form a steep

learning curve. This approach may be incompatible with the more agile and iterative

development processes listed in the previous section as they tend to emphasise

high-level detail first which is fleshed out; FMC seems to encourage mapping

systems out extremely explicitly at a lower level first. This may be too time-

consuming an approach. There appears also to be little support for areas of system

analysis (such as requirements capture).

2.2.2.2 – Unified Modelling Language

The Unified Modelling Language (UML) was first published in 1996 by employees of

Rational Software. It was an amalgamation of previous modelling strategies and

since release 1.0 has been specified by the not-for-profit Object Management

Group, a consortium of interested parties. Relatively recently it reached revision 2.0,

and is widely used across a number of industries due to its great flexibility. Its

history is closely linked the the Unified Process (Quatrani, 2005).

UML specifies thirteen different types of diagram, which can be loosely grouped

into three classes:

1. Structure diagrams (class, component, composite structure, deployment,

object, package): items that must be present within the system and the

relationships between them. This can for example be classes and their

attributes/methods and inheritance; or the components of software and

hardware interacting.

2. Behaviour diagrams (use case, activity, state machine): what must actually

happen in the overall system either from a user perspective, a task-oriented

perspective or a process perspective.

3. Interaction diagrams (communication, interaction overview, sequence, timing):

these show the sequence and flow of information, communication and

message passing and control through the system at various levels.

12

Advantages: A highly-adopted and widely-used industry standard, UML is widely

recognised and understood. Many of the diagrams are extremely simple and quick

to produce but convey useful information succinctly, such as the famous Use Case

diagrams. The flexibility of the UML schema makes it easily adaptable to the

project. Consistent symbols throughout all the diagram aid in quick production.

Disadvantages: Care must be taken to ensure that only those diagrams which are

needed are used; there is a very wide field to choose from. Upon investigation, UML

is not quite as specially suited to producing detailed descriptions of low-level

software processes as FMC – it is far less succinct, though the same nature makes it

more easily comprehensible.

2.2.2.3 – Systems Modelling Language

A highly-specialised and adapted form of UML, Systems Modelling Language

(SysML) aims to be more specifically tailored to systems engineering than the more

generic UML [14]. It has a much reduced number of diagrams (seven to UML's

thirteen). Upon further examination of the language however, it appears to remove

most of the software-centric features of UML (which its proponents criticise as

having a “software bias”) whilst using the same notation and introducing non-

software related diagrams such as the Parametric diagram, which is geared towards

examining the constraints of structural elements in a physical system.

Whilst the methodology purports to be applicable to all manner systems

engineering purposes, it seems instantly obvious that an approach with a similar

learning curve of notation to UML but with less software-oriented features (and

more redundant physical systems diagrams) is instantly inferior to UML. The other

major diagram addition (a requirements diagram) causes overlap with the Use Case

diagram but seems designed for more detailed cases; this in itself would seem to

defeat the objective of having simple, easily-constructed diagram in UML. This

modelling language therefore can be appropriately disregarded as forming, in terms

of this project, a less useful version of UML.

2.3 Chosen Project Methodology

The selected project methodology will follow a modified DSDM methodology (in the

RAD tradition), using UML to model the system.

13

The Waterfall model was proposed as flawed by its creator upon inception; and

whilst there is a temptation to breeze through the stages sequentially without

revisiting any of them for the sake of speed, this has almost universally proved

fatal to even small projects. The excessive documentation and defined specification

produced at the end of this stage negates the usefulness of even more iterative

incarnations of this methodology.

The Spiral model as noted again seems geared towards constant reassurance and

risk mitigation of the numerous problems found in very large-scale projects, which

this is not. Again it seems to rely on extensive production of documentation,

verification of that documentation and consultation on it by the end-users; the

problem owner in this case desires minimal input and only that documentation

which is fundamentally necessary (i.e. to this report) must therefore be produced.

The Unified Process family seems to cut down on much of this – however the very

user-driven nature of it combined with the administrative complexity of the

process (multiple phases of multiple stages) makes it less appealing than the final

choice.

SysML was discarded early due to the reasons stated in it being fundamentally

similar to UML but less software-focussed and more physical-system/business

focussed; it was therefore always an inferior choice to UML in this regard. Between

UML and FMC, UML offered the greatest and most flexible range of diagrams to

cover this project; FMC has little support for dealing with requirements and is

geared towards a level of low-level system detail which would not feasibly be

producible in the allotted timespan for the project.

UML's flexibility and simplicity makes it the perfect choice- quick to produce and

master, easy to understand. While the first and final phases of DSDM (pre-project

and post-project) as well as the first stage of the middle phase (the “feasibility

study”) seem redundant to this particular project, the mentality of producing a

prototype with the majority of essential functionality early in the project makes for

an excellent methodological choice. If the project overruns the eye-wateringly tight

schedule, development can be cut short with hopefully a still mostly-usable and

useful product already produced. The consistent prototyping involved will also

14

expose programming difficulties and required modifications that will undoubtedly

be thrown up by a programming challenge of this nature. Most languages

(programming or human) have numerous exceptions, irregularities and esoteric

terminology and syntax alongside their more mainstream uses; given the number of

problems thus likely to incur in testing, an approach which provides early

prototypes with frequent revision is the most rational choice. The project schedule

has thus been updated to better reflect this adapted DSDM methodology.

15

3. Background Research

3.1 Searching Relevant Literature

To better understand the nature of the problem, examine the merits of any

previously-attempted solutions and devise a method of solving the problem, a

literature search will be conducted.

The topics of search will be authorship as a concept, and the methods of

establishing it both manually and with the aid of computer technology; the concept

a unique “fingerprint” in text or source code; the history of attempts to produce

automated tools to prove or disprove authorship; and what programming language

and tools would best be suited to producing a tool that could determine authorship.

In addition, a comparison of programming languages in use by the project owner

will be made to determine which would be the most suitable to analyse in this

project; the chosen language will then be researched to determine what individual

choices programmers utilising it can make which might uniquely identify their

“style”.

Sources to search will include the University of Leeds Library; electronic resources

from other libraries and repositories; previous dissertation or final year project

work on similar areas of research; holdings of the British Library via their document

supply centre in the case of rare papers or extracts; and use of public Internet

resources and information for providing general overviews, definitions and to

supplement other more established sources where appropriate.

Main keywords: stylometrics, authorship, authorship attribution, digital

fingerprinting, textual fingerprinting, identifying authors.

3.2 Defining Authorship

Authorship, as defined by the Chambers 21st Century Dictionary, is “the origin or

originator of a particular piece of writing” (Harrap, 2003). This is by no means a

new field of study – work into determining the author of a work of unknown

original has been happening at least as long as historians and religious academics

have made attempts to attribute portions of the Bible (Baayen et. al, 2002). An

author lends by their nature a credence to their work; many authors have attempted

16

to remain anonymous to avoid personal criticism and shunning (White, 1994),

whereas in modern academia, the citing of recognised authors and works forms the

basis for academic acceptance of a work.

Applications for the process are wide; everything from counter-terrorism (Chen,

2005) to attribution of ancient Biblical texts (Person, 1997), which can be of huge

cultural and religious importance. More recently the process of determining an

author has been referred as “stylometrics” or “authorship attribution studies”

(Somers, 2003), and forms a key branch in the field of linguistics.

There is good reason to examine the historical development and use of authorship

attribution – as a science that has been refined over hundreds of years, the

techniques it employs are mature and well-established, due to the importance of

accuracy when authorship is used in legal cases or criminal detection. Whilst this

project deals with more formulaic works of source code rather than freely-written

prose, there are undoubtedly techniques used on prose which can be readily

adapted to work on source code.

3.2 Learning from Modern Principles of Authorship Attribution

The most basic principle of authorship attribution is that each author

unconsciously develops a style; habitual ways of writing that are consistent through

different works and can potentially be measured. Schommer and Uhde refer to a

“text fingerprint” or “linguistic fingerprint” left by each and every author, even

when that author consciously attempts to modify their style (Schommer, 2008).

Laan observes that the ideas that style can both change and remain the same are

not necessarily conflicting; and that style may be said to have two components –

one which remains consistent, and one which varies through time and given the

nature of the work (Laan, 1995). Filtering out the differences is the primary work of

authorship identification.

Over the course of time the process of author attribution has moved from the more

subjective field of literary criticism and commentary to the more objective fields of

natural language processing and statistical testing. The first landmark example of

this was Mosteller and Wallace's 1964 paper analysing the Federalist Papers,

documents written at the time of the creation of the American Constitution and

17

supporting its ratification (Mosteller, 1963). This approach used the fairly simple

measure of statistical testing for the frequency of a small set of function words

such as “the”, “of”, “upon”, “about” etc. The results differentiated between authors

significantly – papers known to be written by Founding Father James Madison used

the word “upon” an average of 0.23 times per 1000 words; but by sharp contrast,

Founding Father Alexander Hamilton averaged 3.24 uses per 1000 words. By

showing statistical patterns of word usage unique to each man, the Mosteller and

Wallace could infer the authorship of works whose author was not conclusively

known. The idea that frequent usage of certain functional words could easily

translate to programming – decisions regarding preferred control or loop structures

could for example reveal a preference for one particular type (even though many

can used interchangeably), analogous to the word frequencies of Mosteller and

Wallace.

Subsequent approaches in identifying the authorship of the Federalist Papers as

well as many other works built on this success by analysing multiple metrics of

style (Hoover, 2003). Frequently occurring metrics in works include the extent of

vocabulary (or number of unique words in the text) (Madigan, 2005); the

aforementioned frequency of words or “lexical repetition” (Stamatatos, 2001);

frequency of categories of words (e.g. frequency of adjectives vs frequency of

nouns, for example) (Argamon, 2003); analysis in choices of syntax and sentence

construction, including grammatical errors (Madigan, 2005); “collocations” of

words, that is, groups of words whose members indicate the nearby presence of

other group members when found in text (Smadja, 1989); chosen lengths of words,

clauses, sentences and paragraphs; and use of more esoteric and rare words seldom

found elsewhere.

Some of these techniques when properly examined suggest analogous statistical

tests that could be performed on source code. Stamatatos et. al (Stamatatos, 2001)

note the simple simplest measure of vocabulary, V/N – where V is the size of the

vocabulary, and N is the number of total words. Given that the Longman Dictionary

of Contemporary English can define the majority of everyday English using only

2000 words [29], it is clear that the vocabulary of any language often features a

degree of redundancy. The presumed unconscious preferences and habits of the

author therefore determine this to an extent. As aforementioned, choices in control

18

or loop structure, data structure or numbers of unique variable names could be an

analogous programming concept.

Argamon et. al (Argamon, 2003) note that word class frequencies (e.g. pronouns vs.

noun specifiers) successfully showed a distinction between male and female

authors. In programming, authors often have a choice in determining the numbers

of methods, variables or parameters used amongst others; better programmers can

often express things more succinctly, and personal preference may affect the

decision of whether to pass a data structure as a single parameter, or else declare

separate parameters for each item of data. Contrastingly however, Smadja's concept

of word collocation (Smadja, 198) may not be as germane to the domain of

computer programming – most languages strictly prescribe the components of a

programming statement. There may still however be instances – some programmers

will declare variables up front but only initialise when needed, whereas others will

declare and initialise simultaneously. In this case, a limited amount of collocation

might be observed in that early assignment statements in one style incorporate a

value; whereas in another, this is absent.

Word, sentence, clause and paragraph lengths have featured almost as standard in

the most of the papers reviewed so far to some success; and these have

counterparts in source code (length of identifiers, statements, statement blocks).

Testing for occurrences of esoteric and idiosyncratic terms may also have a direct

counterpart in software; indeed one experienced Java programmer contacted for

source code later in the project observed that “older generations of programmers

tend to use the dollar sign in variables and underscores” more frequently, which are

considered more esoteric in modern programming languages. Less regularly-used

features of programming languages may therefore act as very clear signposts in

testing for authorship.

One area that may contrastingly have little application to this project is the

identification of grammatical errors. Whilst grammatical errors can be an

idiosyncratic style in themselves, it is necessary to take the viewpoint that this

project will examine completed source code files, which therefore would be unlikely

to contain idiosyncratic errors that would have been removed during the debugging

process. In this regard, source code is much stricter in prescribing a set syntax than

19

the freer choices available in written prose; there may however exist some limited

choices in syntax which would be measured.

In conclusion, there are a number of ideas that have formed the backbone of

authorship identification in written text that can potentially be translated to the

identification of authorship in source code. The next step is to find any notable

previous attempts at identifying the authorship of source code itself.

3.3 Verifying Authorship with Digital Fingerprinting

A notable piece work in extolling the use authorship of source code by Sallis et. al

(Sallis, 1997) made reference to an emerging scientific field of “software forensics”.

It was proposed that with an increasing rise in computer security threats, there

would be an increasing need for establishing an author's identity, particularly in

criminal cases. This turned out to be particularly prescient – in the intervening

years, an increasing number of examples abound where the work of virus-writers

has been tested for authorship (Longstaff, 1993), or where authorship is disputed in

cases of alleged plagiarism (Vitoria, 2006).

The paper drew on previous works of studies into anti-plagiarism which had

attempted to categorise programming style as means of positively identifying a

work as belonging to a known author other than the claimed creator, and studies

into identifying the authorship of malicious code. Notably, however, it failed to

make any reference to the well-established tradition of detecting authorship in text

and referred to an authorship detection tool in development, though did not

showcase this. A later paper by MacDonell et al (MacDonell, 1997) showed a

screenshot of the tool and identified it as working, but provided no data to

corroborate this, and the tool does not appear to have gained any widespread

acceptance and usage.

By contrast, Spafford and Weeber (Spafford, 1993) in their piece, although making a

fairly specious argument about handwriting styles (which seem to have little

relevance to the subject matter), did acknowledge and discuss the use of written

authorship techniques and noted that many of the statistical tests and feature-

spotting could be applied to source code. They noted the following characteristics

20

of source code which could vary in style:

● Choice of language

● Code formatting

● Special features that might indicate a choice of compiler

● Commenting style

● Variable naming conventions

● Spelling and grammar

● Choices of language features

● Preference in scope of variables (global, local etc)

● Errors (reproduction of similar errors across work)

This work well encapsulated the basis for authorship testing and possible metrics,

whilst not producing a direct tool. One of the most notable first attempts at using

such metrics (albeit in what appears to be a fairly casual observational fashion

rather than by empirical research) was in the work of Longstaff and Schultz

(Longstaff, 2003). They analysed the work of two related computer worms: WANK

and OILZ, which propagated in 1989 and were the first examples of worms

spreading with a political agenda. Particularly they noted that there appeared to be

from style variations three different authors -

1. Author A, who wrote the parts of the code used in proliferation.

Characterised by descriptive and lower-case variable names, use of variables

and GOTO sequences for flow control, convoluted execution flow and

evidence of a deep understanding of networking concepts.

2. Author B, who wrote the malicious parts of the code. Characterised by

frequent use of profanity, capitalisation in variable names, a straightforward

execution flow, and frequent misuse of a language routine.

3. Author C, who assembled the code. Characterised by mixed-use of upper

and lower case characters in method and variable names, single character

variable names, similarities to BASIC programming and attempts to unite

code from the previous two authors.

Such wide variations in style, errors and flow seem almost certainly the work of

different individuals and highlight how radically different code can look; whilst the

individuals responsible were never positively identified, the paper makes a very

convincing argument for the technique used.

21

Whilst any number of attempts have been made to develop anti-plagiarism

software, there appears to have been little progress made in the way of positive

authorship confirmation. It is important here to draw a distinction between the two

– detecting plagiarism frequently involves attempting to prove that an author did

not create a work asserted to be theirs because the work bears strong resemblance

to other existing work. By contrast, authorship identification is concerned with

establishing that a piece of work fits an individuals established style and is indeed

theirs.

Previous efforts in anti-plagiarism have been mainly intra-corpal (comparing items

within a submitted set), or extra-corpal (comparing items with an large external

catalogue) (Helmore, 2005). These terms are both limited in that no database can

ever hold the all other works in existence; and indeed the work itself may be

entirely novel, but not developed by the author in question. Detecting plagiarism

has its place where wholesale reproduction is concerned; but has considerable

limitations. The ability to actively associate code produced with the author by

testing for a defined style however is limited only by the accuracy of the test-

increased refinements should therefore be able to prove authorship or else disprove

it, without needing an incomplete database of other works.

Whilst the detection using statistical tests of style has been coined as “stylometrics”

in the linguistic domain, in the domain of source code this project will use the term

“digital fingerprint” to refer to the unique style detectable in code, and

subsequently used to prove or disprove authorship. The research material covered

here, along with the reasons cited together make a clear case for the need for a

digital fingerprinting tool.

3.4 Choosing a Programming Language to Analyse

The most frequent languages as used by the problem owner (CWD) referenced in

the introduction to this report were PHP, XHTML, CSS, Java, JavaScript, Python, Perl,

XML and Ruby. In choosing which to analyse, considerations should be made as to

the degree of flexibility of the language (i.e. how many stylistic choices can be

made); the ease of reading it and parsing the contents; and how well-supported it is

with documentation to facilitate an understanding of its working necessary for this

22

project.

3.4.1 Markup languages (CSS, XHTML, XML)

Markup languages specify how to structure, format and display text. CWD utilises a

large number of XHTML samples and there are ample examples of this technology

frequently available on the web. The advantages of analysing a markup language

are:

● Whitespace-agnostic, so indents and whitespace are programmer decisions

● Large numbers of samples are easily found on the Internet

● Deprecated tags make for potential esoteric styles that can be analysed.

There appear to be considerable disadvantages however:

● Many files are automatically generated via WYSIWYG tools, removing

programmer variance

● A very limited number of tags, and no usual programming constructs on

which to examine as per previous work in this field (e.g. variables, methods

etc).

● CSS is far too basic; XHTML is too prescribed and XML schemes vary too

much.

3.4.2 Scripting languages (PHP, JavaScript, Python, Perl, Ruby)

These languages are far more representative of normal programming constructs

than markup languages. Scripting languages by and large (although some of the

above can also run as executable programs) do not execute independently but work

to control or modify other applications. The advantages are:

● Again owing to web technologies, samples of many of these are freely found

on the Internet

● The syntax in some cases is slightly simpler than more established

programming languages designed for independent execution, making for an

easier learning curve parsing them.

Conversely however:

23

● The lack of standalone execution means that other programming languages

or accompanying applications may be needed (e.g. HTML in the case of PHP,

Ruby and JavaScript)

● The fact that most of them have object-oriented components as an optional

paradigm will add confusion and possibly duplicate the work (the system

would have to treat files written from different paradigms separately).

● Little formal documentation available on some (such as Ruby); others such as

Perl have considerably less documentation than more modern programming

languages as they have fallen out of favour compared to modern alternatives.

3.4.3 Standalone Programming Language (Java)

Java is the only “standalone” language used by CWD. Developed by Sun

Microsystems, it is ranked the most used programming language in the world

according to TIOBE Programming Community Index[36]. Advantages are:

● Enforcing of a single paradigm in the code will make analysis much simpler

● Very extensive official and unofficial documentation available

● The prevalence of Java will make a software tool that detects authorship in

Java automatically more useful than that of another language

Disadvantages:

● The enforced structures (e.g. every file must declare a class, file names must

be the class) put some limits on creative freedom and hence style

● Strict Java runtime and compile-time checking removes most serious bugs,

so detecting frequently recurring errors will not form part of the authorship

profiling.

In the end, Java has been chosen as the best candidate to analyse. Its prevalence,

the use of documentation, and even its overly verbose nature (compared to more

succinct languages such as Python and Ruby particularly) and complexity will be

useful in producing overall longer files with more constructs that will be easier to

analyse for style.

24

3.5 Choosing a Programming Language to Perform the Analysis

Having chosen to analyse Java source code files, a programming language in which

to write the project had to be selected. Python, Perl, Java and C all have significant

text-processing features, which would form the core of this work. The potential

extension of a GUI would be simplest to implement in Java or C if it were

standalone; in Python or Perl if it were web-based. However the instant deciding

factor was Python's minimalist approach – for example, the “Hello, world” program

can be written in 1 line, compared to C and Java's 7 [37]. The very easy-to-learn,

simple and easy-to-debug syntax of Python, combined with extensive

documentation and inbuilt text-processing capacity, made this ultimately an easy

winner. As a result, the programme will be implemented in Java.

The next step is to obtain a selection of source code files to perform the initial

analysis, and to identify the markers of style that can be found in Java. The next

chapter will address the collection of source code and the style measures that will

be analysed.

25

4. Data Collection and Analysis Overview

This section is mainly concerned with the collection of Java source code files

required for the project; and a summary of all the potential criteria that may be

used in the context of the Java language to differentiate programmer styles,

drawing on ideas from the previous section in the fields of textual authorship,

plagiarism detection and the limited forays made into source-code authorship –

establishing of the “digital fingerprint”.

4.1 Individual Programmer Choices in Java

Having chosen Java to analyse, the official documentation for Java [38] was

examined to determine what possible variations in style were possible. The goal was

to break down the language into its main components, and see how those

components could be used or omitted, and where they were interchangeable. The

results will be used to specify the broad criteria that may be tested for in individual

Java files which could be indicative of a given programmer's style.

4.1.1 Broad categories of Java Tokens

Tokens in every Java program can be broken down into seven broad categories:

whitespace, identifiers, operators, keywords, values, symbols and comments. Each

of these has unique properties which allow for individual style.

Whitespace is mostly completely at the programmer's discretion in Java. Whilst it is

not allowed to break up the five of the other symbols (the exception is comments),

it can appear between them in any amount and in any fashion. Both of the following

are legal Java:

int c =0;int h=1;int r=2; int

c

=

2;

int h = 1;

int r = 2;

Applications to examine include the placing of whitespace at different locations in

the program; choices in indentation length and indentation of specific tokens

26

Identifiers are the names assigned to memory spaces in which Java holds data; they

can identify primitive types (such as integers or boolean values), or complex objects

(classes, interfaces). The rules governing identifiers are:

● The identifier's name must start with a dollar sign ($), underscore (_) or an

alphabetic character (A-Z, a-z).

● The remainder of the identifier may be made up of any of the above three

items and additionally, numbers.

● The identifier may be delineated at the start and end by whitespace, symbols,

operators or comments, but not values or keywords.

● The identifier of the primary class or interface must match the name of the

file.

Applications to examine are choices in length of identifiers; choices of naming in

identifiers; and whether identifiers assigned to different items (e.g. primitive types

vs parameters vs instances) sport different styles of name.

Operators are a subset of symbols used for performing some kind of operation on

the data, such as mathematical or boolean operators. Many are required in different

places (assignment statements require the “=” operator; control structures require

one of the many boolean operators to check conditions), but broadly speaking they

may appear anywhere. A list of operators and the conditions in which they must

appear as well as the order in which they take precedence can be found in some

appendix.

Linking back to the authorship tests looking at the frequency of function words

such as “the”, “of”, “about” etc, it could be said that operators are as close to

analogous of these connecting words as any other component. In this vein,

applications could include testing the frequency of symbols or certain types. It

should be noted however that in textual linguistics this was shown only to be stable

above a certain threshold – in the case of text, 1000 words or more. Since

programming is much more prescribed than free prose, this may be considerably

higher.

Keywords are numerous in Java and perform many different functions and have

different uses. Some notable highlights:

27

● Loop structures: there are four types of loop – 'for', 'foreach', 'while' and 'do

while'. Each can be rewritten into the style of the other with minor

modifications, so greater preference for one or more over the others might

be reflected in style. The 'break' keyword is in itself also a choice as a

method of exit from a loop.

● Decision structures: the most prevalent of these is the “if” keyword, which

can be optionally extended with “else”, and instances of “else if”. Since only

the 'if' keyword itself is truly required for decisions, the choice of partnering

it with 'else' statements, or adding in 'else... if' is entirely at the

programmer's discretion. The 'switch' keyword is a less commonly-used

alternative to this, and again frequency may show style.

● Import: utilising the power of other code is generally managed through

import statements. The writing of these is flexible however; a programmer

may import a whole library, packages from within a library or specific classes

from within the package. Any of these would put the required classes at the

programmer's disposal, so this is another area of style.

● Class machinery: a number of keywords expose choices in object-orientation

– 'static' marks class variables and methods, the numbers of which when

compared with their instance counterparts could reveal a limited preference

(i.e. the decision to place common data and methods in a single location).

The keywords 'this' and 'super' also act as optional shortcuts for referencing

objects.

Values are fairly less revealing; though the type of variable or structure they are

held in can be very revealing. Java's primitive types can sometimes be used

interchangeably for certain things - indiscriminate use of 'float' and 'double'; 'long'

and 'int' for example, to represent the same numbers. Again when looking at

frequency of occurrence of tokens in authorship, it may simply be the numbers of

values held and their types that reveal the most about style; does an author prefer a

single array or multidimensional arrays, for example.

Symbols can mostly be used for determining other things – since a semi-colon

denotes the end of statement, for example, they could be used in measuring

statement lengths. Parenthesis can be used to make the intent of mathematical

expressions clearer, so redundant parenthesis are probably also indicative of a

28

choice.

Comments are entirely optional, making them completely programmer-determined.

Two types of comments (multi-line and single-line) can be used interchangeably

(single-line comments can be stacked on successive lines; so-called 'multi-line'

comments can be opened and terminated within a statement). The positioning,

length, relative length to the program and verbosity of these items are just a few of

the measures of style that could be applied.

4.1.2 Table of choices

The following is the final table of choices a programmer can employ when writing a

single file of Java source code. Multi-file programs add additional complexity which

will not be tackled in the minimum requirements of this project (such as

proportional sizes of files, numbers of files per program etc) but which may be

revisited later on as an optional extension. This fulfils the first minimum

requirement set out in the project introduction.

ID Category Item Choice

C1 Whitespace

Indentation Indentation is optional in Java; its presence and any regularly repeating pattern (indents of three spaces, four spaces etc)

C2 Whitespace

Symbols Amounts of whitespace separating different types of symbol in Java are optional. ( int x = 5 ) ; is equivalent to (int x=5);

C3 Whitespace

Blank lines Usage of blank lines between statements and statement blocks is a programmer choice

C4 Whitespace

New lines The decision to put certain elements on the next line (e.g. the right hand side of an assignment, or the opening brace of a statement block)

C5 Whitespace

Characters Whitespace “characters” (\t, \r, \n) can be determined by the programming editor or IDE; tabs can be used for indentation rather than spaces, new-line and carriage return use may vary

C6 Comments Type Either multi-line comments or single-line comments may be used

C7 Comments Frequency This can be measured in total occurrences, or as a ratio against code

C8 Comments Length Either length of lines in the case of multi-line comments, or number of characters

C9 Comments Position Can be positioned at different places in a file; before a statement, after it, at the start, at the end, on line of code etc

29

C10 File Length File lengths will naturally vary, but overall some programmers will produce smaller files (due to a desire for smaller files, lack of extraneous whitespace and comments or simply efficient programming) and other will produce generally longer files

C11 Code Lines The actual lines of code in a file as a ratio against whitespace and comment

C12 Identifiers Length The character length of all identifiers or of categories of identifier (parameter, local, class, instance etc)

C13 Identifiers Characters Use of whitespace, numbers, digits or alphabetical characters

C14 Statements Import As aforementioned, programmers can choose to import whole packages, subsets, classes etc

C15 Statements Length The average length of statements

C16 Statements Dot notation A user can choose to frequently use length dot notation (this.that.the.other), or else assign such imported methods and variables to new, shorter names. The prevalence of dot notation usage is thus a programming choice

C17 Keywords Overall The presence and choice of all keywords can be measured as proportions of the file's tokens

C18 Symbols Overall The presence and choice of all symbols can be measured as proportions of the file's tokens

C19 Keywords Loops The proportions of loops categorised as one of the four options is a programmer choice (as indeed is their presence; see Keywords > Overall)

C20 Variables Declaration Proportions of variables initialised at point of declaration

C21 Variables Not assigned Proportions of variables that are not assigned values at any points

C22 Variables Reassignments

Numbers of variable reassignments after initial use

C23 Parameters Number Average numbers of parameters passed to a method

C24 Methods Overloading Incidence and numbers of overloaded methods and constructors

C25 Scope Braces Average number of scope levels, i.e. highest number of containing braces for any given item. May be compared with file length.

This covers all of the known fundamental choices a programmer can make. Many of

these criteria lend themselves to multiple tests – for example, C17 would

30

presumably incorporate tests for each and every known keyword. The level to which

these tests could be broken down is unknown – combinations and derivatives of

these tests could stretch into hundreds or even thousands, in an attempt to catch

the most esoteric of styles. For example – what if a programmer typically puts no

spaces between operators and operands, except around mathematical signs ( +, -, /,

*)? Example like this are extreme, but the sheer flexibility of Java and the choices it

permits allow for such styles. The more detailed and numerous the tests created,

the more conclusive the results would be (assuming tests with inconclusive results

are discarded from the analysis process); such a case as mentioned would

presumably, in combination with other tests, be so conclusive as to make the faith

in the authorship identification very high.

The key to creating an initial system therefore is to create those tests which the

simplest and most generic and yet manage to differentiate between most styles.

Failure to differentiate accurately will unlikely disprove the notion that authorship

can be attributed via source code; rather suggest only that the level of testing and

the specificity of the tests may need to be increased.

4.2 Collection of Data for Analysis

4.2.1 Concerns and Choices in Data Needed

Having established the choice of language, development methodology and broad

criteria to look for, it is necessary to gather the material to be analysed. When doing

this, the following must be considered:

● How many known authors need provide work whose style can be analysed

for authorship to prove the system works?

● How many source code files from each author would the system need to

“learn the style”, before it could be effective?

● What is the minimum length of a file required to be useful to stylistic

analysis?

● Are there types of source code file that are unusable for reasons other than

length?

One further issues needs resolution: in the previous section, reference was made to

the fact that levels of testing could go on to great depths in an attempt to establish

esoteric styles, with each tests searching for very specific style data on one of the

programmer choices (C1 to C25 as in Table 4.1.2; the quoted example mentioned

31

spacing around operators). The scope of this project is not great enough to examine

every potential style, so a sensible number of tests that differentiate most styles is

needed.

To ascertain the latter issue, it is apparent that the most logical approach is to have

two data sets:

1. A “training” data set, made up of as many individual source code files from

different authors as possible. This can be used to try and spot the major

differences in style, which can be used to determine which tests are the most

applicable and useful. The eclectic range of authors and styles contained

should ensure that when the program is trained on this data, problems with

item recognition, production of statistics and tokenising are all ironed out.

2. A more detailed “proof of concept” data set. This would be composed of a

much smaller number of authors, each of whom would be required to submit

multiple files. After the first set had been used to determine what tests

should be run, and that the tokenising and item recognition processes work

correctly, this set would be used to actually “learn” the style from files by

each author, identify a digital fingerprint, and then accurately attribute

authorship to further source code file from each author.

This approach avoids the need to collect multiple files from many different authors

(which may be problematic) but ensures that there is still a small selection of styles

to identify, once the initial construction is complete and fully tested on the training

data.

Given this, data will be collected to fill the sets as follows:

Set 1 - “Training”: Unlimited requests will be made for a one week period for

individual files from authors. These will be anonymised and no tracking of the

authorship of each file will be made. The files will be pooled into the training data

set and analysed to spot style differences that often occur. Length and content of

these, provided they are Java files, is unimportant.

Set 2- “Proof of concept”: Having briefly analysed data received for Set 1, a small

number of requests will be made for willing programmers to donate multiple files.

32

The number of files needed from each and any other criteria they must meet will

need to be determined after a working prototype has analysed the training set, and

concrete methods for style detection have been produced that inform these

minimum criteria.

4.2.2 Initial Data Collected

Requests were sent out to known colleagues, students, lecturers and professional

programmers for source code files for the Training dataset. A few publicly available

files of Java code were added into the mix; whilst these will never be used for

production purposes, care was taken to ensure that no rights were infringed. Each

file was anonymised and can be found in the initial data set folder on the

accompanying CD. A total of 72 source code files were obtained for the Training

dataset.

Clearly, the call for the proof-of-concept dataset cannot be made until the

requirements of the digital fingerprint detection process are known. This will be

detailed in the next section and a prototype implemented; after which the requests

will be made for the data. This will be detailed in a later section.

33

5. Modelling and Design

The selected methodology (a modified DSDM development process, utilising UML

2.0 notation) involves the production of prototypes iteratively and incrementally. In

modifying the DSDM, the pre- and post-project phases were removed, as was the

first stage of the remaining phase (Feasibility; which is by and large covered by

background research). A partial stage two will be followed establishing the user

needs, and the prioritised requirements of the system. Following this, successive

iterations of design will be shown in this section, with successive iterations of

implementation shown in Section 6.

5.1 User Needs

In figure 5.1 below, the system's top-level user requirements are shown using a Use

Case Diagram:

The diagram clearly shows that in terms of interaction with the system, users must

be able to submit source code files, which are checked for authorship and/or used

to build up the digital fingerprint. The owner of CWD must be able to see the

results – a report confirming or questioning the authorship of each file.

34

Given these clear needs and taking into account the project's minimum

requirements, we can use the MoSCoW analysis (a frequent feature of DSDM) to

produce a list of requirements. MoSCoW categorises the system into four sets of

requirements – must have (definitely, should have (preferably), could have

(optionally), would have (in ideal conditions):

Must Have

● The ability to input Java files

● The capacity to identify tokens within the Java file

● To be able to produce simple counts of tokens, and know their location

● To produce the minimal needed statistics on each file

● To compare these against previous, known files from that author

● To establish a “digital fingerprint” - commonalities, the combination of

which can be reasonably believed to be very likely indicative of the author

● To report established or disprove authorship

Should Have

● A way of expressing confidence in the digital fingerprint (how reliable is it?)

and subsequent authorship predictions.

● To adaptively choose which tests are generating useful data for any given

user, and discard unimportant results

● To improve the digital fingerprint with successive data.

● A basic submission system and graphical interface

Could Have

● The ability to turn tests on and off to reflect potential corporate style

guidelines

● The ability to accept multiple files as a single “program”

● A more detailed level of statistical data to capture esoteric styles and

increase digital fingerprint accuracy

● Flag up significant variations in the fingerprint in sections of a file

Would Have

● A scripting/plug-in system to allow extra languages and tests to be added

● In-built support for additional programming languages

● The ability to build a digital fingerprint that identified style over multiple

languages

35

The first iteration will tackle the Should Haves. Following the 80:20 principle in

RAD that 80% of the functionality will be achieved by 20% of the work, it is likely

that majority of the Should Haves can be completed relatively easily for first

iteration – the Functional Model.

5.2 First Iteration: Functional Model

In creating the first functional model, it is necessary to tackle the biggest “black

boxes” in the system – how do the use cases marked “Verify Authorship” and “Build

Digital Fingerprint” in Figure 5.1 work?

A set of reasonable lower-level steps is hinted at in the MoSCoW analysis and in

Figure 5.1 itself (shown as “tokenise”, “produce statistics” and “compare against

digital fingerprint”. More usefully, an Activity diagram can be produced to show the

flow of events:

The overall process of the system becomes more apparent. The two biggest

questions remaining are what statistics are to be produced for each grouped list of

tokenised file items; and how the digital fingerprint is generated.

An initial tokeniser was created (see Section 6.1) with the ability to identify item,

classify and remove them. The “training” data set obtained (see accompanying CD)

was scanned , with the number of items not recognised in the tokenisation proving

36

to be around 2% of the file's items, which was considered to be acceptable for the

first iteration. Output was printed to text files as a list of items organised by

position in file, and then shown again organised by type. The input files and the

lists of tokenised items were then examined and compared against the choices table

(4.1.2), to see which variations were most prevalent and produce appropriate tests.

The most notable style variations, categorised by the choices table, were as

observed and written. The observations follow:

C1: Indentation lengths vary highly; are not always consistent

C2: Significant variations in spacing around operators; spacing choices seem to vary

between types of operator (e.g. assignment, mathematics etc) or even whether it was

before or after the operator.

C5: Mainly positioning of braces seems to determine new lines – different styles

show opening and closing braces either on their own lines or own separate lines (or

mixed).

C6: Preference of multi-line and single-line comments vary greatly

C7: Numbers of types of comments and overall numbers vary

C8: Line-span of multi-line comments, length in characters vary

C9: Main preference seems to be choice of positioning comments on/in lines of

code or on their own lines

C10: File lengths naturally vary

C11: Number of lines of code as proportion of file seem wildly different

C12: Length of identifiers varies, particular depending on type (e.g. many choose

shorter names for parameters)

C13: Some styles use underscores; others capitalised letters to start words.

Occasional use of all caps, all lower and some number/dollar usage; seems

idiosyncratic and relatively consistent.

C15: Length of statement blocks most notable

C16: Number of dot-separated names more prevalent and lengthy in some files

C17: Most notable variations in proportions of variables, parameter and method

usage (some files many methods; some with lots of parameters; some with very few

recycled variable names)

C19: Loop choices vary considerably

C25: Number of levels of brace seem indicative in given files.

37

One particularly noticeable variation that did not come properly under the headings

above was the distribution and use of braces. Some programmers seemed to like to

cram them on one line; some seemed to try and match up the column index of

matching braces; some insisted on putting braces in when they were not required

(e.g. an 'if' or “else” with one statement) and others omitted them in this case. This

information and the list above was combined to generate a set of token statistics

(discussed in Section 6.1) that attempted to capture these style variations.

During the production of the statistics it was determined that given time

constraints, it was important to tie up the system and deliver the tokenising and

statistic-producing system as a first functional prototype, as considerable extra

design and implementation would be needed to generate the digital fingerprint.

This process comprised the second iteration.

5.3 Second Iteration: Increased Functionality Prototype

The first iteration did not fully meet the functional requirements of the system and

required extra design work, owing to the decision to complete the prototype's

tokenising and statistic-producing functionalities. One of the rationales behind

choosing DSDM as a development process was the knowledge that there were

severe limitations on project time (10 weeks in total) and thus the production of

successively-functional prototypes in quick succession ensured that if development

had to be cut short, some software model representative of the project aims was

still produced. At this point in the project, implementation of the tokenising system

(the complexities of which produced a number of obvious bugs and exceptions

which needed to be fixed/accounted for) was protracted, leaving a very short

amount of time left. Given the need to finalise the project report, an extension of

seven days to the deadline was negotiated. This left time to produce the report,

plus a few extra days. This extra time would be used for a second iteration, which

would aim to complete the digital fingerprinting functionality and thus the final

project minimum requirement remaining.

When examining solely the question of adding the digital fingerprint functionality

to the system - both ability to “learn” one from a set of files from a known author,

and ability to verify a future files – it seems that the final three Must Have

objectives and the first three Should Haves can be collectively approach as

comprising one overall problem, which will be the subject of this second iteration.

38

5.3.1 Normalising the existing results

The first problem is to “normalise” the statistics, which are measured on a variety

of scales. Putting some uniformity into the scales so that each file can be compared

like-for-like is a necessary step before any comparisons between files can made.

The decision was taken to attempt where possible to put this on a scale of 100 – for

example, items found per 100 lines, proportions of one item versus another in

terms of percentage, items found per 100 words, etc.

There are some statistics which cannot be normalised within reason – things like

file length and word count must be judged in terms of their original numbers.

Clearly therefore, methods of analysing the statistics will be needed which

compensates for the fact that some items are compared on a scale of 0 to 100, and

others have no set scale or boundaries. The final list of statistics, and which were

normalised to a scale of 100, can be seen in Appendix D. Details on the

normalisation can be found in Section 6.2

5.3.2 Weighting and Confidence

The activity diagram shown in section 5.2 illustrates that a set of known

submissions must be received for an author whose identity is known before a

digital fingerprint can be created for that author; once created, submissions can be

verified against the original. Looking at the first three “Should Have” criteria, it

seems logical to suggest that the work of creating a digital fingerprint that

conforms to these ideas can only be facilitated by some kind of weighting system

that gives greater or lesser importance to each set of tests performed on the initial,

verified submissions.

This could only be determined by a formula that analysed the initial set and

determined which tests produced the most uniform data, and which were more

disparate; thus making for any given author tests that show consistent data more

important and removing emphasis on tests where the author's work was shown to

have little consistency. This would form the initial digital fingerprint. When

comparing against a new file to establish authorship, each relevance-weighted test

would then be performed on the new file's data; some kind of confidence value

would be needed that expressed how closely the new file's data conformed to

39

previous data.

Thus, for the purposes of this project, the digital fingerprint is expressed as “the

average values produced by a series of weighted tests, where weights make some

averages more important than others”. Conformance to a digital fingerprint can be

expressed as “values which most closely conform to a given set of tests, with the

importance of each match determined by the weight of the test and the proximity

to the average set of results”.

Methods of weighting and confidence are shown below:

Weighting 100 - (standard deviation of results x 100 )

mean of results

Confidence If within standard deviation range: 100 – 1% for each 1% of difference from average; otherwise, 2% for each 1% difference from average

The weighting gives higher confidence to those tests for which the standard

deviation is a smaller percentage of the mean; for a test where the standard

deviation was zero (I.e. all observed items were identical in result) this would result

in 100% weighting; tests where the standard deviation exceeded the average result

would be given 0% weighting.

In terms of the confidence of whether a tested item matches the established

fingerprint, the confidence is lowered by 1% for each percentage distance it is from

the average – for example, if the average is “100”, a value of 93 would be 7%

distance away, and thus have a 93% confidence rating as matching the fingerprint.

This increases to 2% if it crosses the standard deviation range; in the above

example, if the standard deviation was 3, then the confidence lost would be 1% x 3

(100 to 97) and then 2% x 4 (97 to 93), making for 89% confidence. The effectiveness

of both of these sums will be tested upon the system's implementation.

Finally, a “proof-of-concept” data set was needed. The criteria set for these

datasets (the importance of which is noted in Section 4.2.1) was defined as:

40

● Minimum of four files needed. This assumes that the bare minimum needed

to establish a digital fingerprint is three, so that in each test, if an anomalous

result is found, the other two results will ensure the test is still kept

relatively applicable. The fourth file would be tested against the established

fingerprint.

● No abstract classes or interfaces. These are Java construct where the file

structure differs considerably from standard class files; support for these

can be added later, but for now they may confuse the results

● Minimum of 50 lines – this allows for huge variations in length, but ensures

we have a fairly minimal sample that has room for a couple of methods and

control structures as needed.

This answers most of the concerns raised in 4.2.1. Ultimately seven authors happily

donated code which met the above requirements, which it is believed is sufficient to

establish a small set of differing styles and test for authorship. Authorship testing,

owing to the above confidence and weighting, will be expressed as a percentage

probability that is an average of the confidences for each test, skewed by test

weight. Section 6.2 of the implementation documents the work done to implement

these design ideas.

41

6. Implementation

Using the DSDM development process means multiple iterations of rapid design

and implementation of prototypes. Where sections here are linked to a design

iteration it will be clearly marked.

6.1 First Iteration: Functional Model

The design objectives for iteration one were to tackle most the the Should Haves

from the MoSCoW analysis. The first steps were accepting Java files as input;

tokenising them and producing output lists of tokens. From the beginning Python

demonstrated its worth; the code to take all filenames from a directory is as

follows:

home_dir = os.getcwd() # program directoryresults_dir = (os.getcwd() + "/results")os.chdir(home_dir + "/input")input_dir = os.getcwd() # location of files to be processedinput_files = os.listdir(input_dir)

This short snippet takes the current program directory; sets a directory to print

results to and one to read input from; and produces a list of input files (input_files).

The simplicity of Python's 'for' loop structure then allows iteration over the files by

writing:

for selected_file in input_files:

Some method of verification was needed to ensure the input files were standard

Java class files. To remove complications, it was decided that the simplest way to

achieve this was to check for the presence of the file extension “.java” at the end.

Here Python's text comprehension and manipulation capabilities came in to effect

with the use of 'slices', a construct which easily allows substrings to be produced.

The following code, placed immediately after the opening 'for' loop, effectively

vetoes non-.java files using this method:

if selected_file[-5:] == ".java":

The next task was considerably more challenging – the identification of tokens in a

Java file. One idea was to use and adapt an existing Java compiler; however this was

quickly dismissed for two reasons. First, there are no known compilers written in

Python and translation into Python would take longer than the time allotment for

this project. Secondly, a quick examination of the regular javac compiler (presumed

42

representative of Java compilers) showed a program composed over hundreds of

files of source code, with considerable complexity. Compilers are designed to

understand each reference and word, and perform a level of checking,

comprehension and memory management (not to mention actual compilation). As a

result the effort required to remove any useful tokenising code seemed more

extensive than writing a simple tokeniser from scratch.

Context matters heavily in Java as in all programming languages – the word

“automobile” might be an identifier; but if immediately preceded by “//” it is a

comment; if encased in quotes, a string. The following three items were identified

as “blocking” elements – the state of being “inside” them overrode all other

considerations:

Multi-line comments Delineated by “/*” and “*/”

Single-line comments

Delineated by “//” and end of line

Strings Delineated by single or double quotation marks

Therefore it seemed sensible to remove these items first, so a tokeniser could more

positively identify the remaining items. What is important to note that is that these

“blocking” items, which “block” the normal use of other words, also block each

other. A sentence surrounded by quotation marks is NOT a valid Java string if it is

inside a comment; if a single-line comment has been declared, any number of

multi-line delineators written on the rest of that line do not comprise valid

comments in their own right. Furthermore it was necessary to be able to identify

these blocking items if they stretched across multiple lines, ensuring that their full

length was identified and removed from the parsing process.

The simplest solution is to test for the presence of all three opening delineators

(quotation marks, “//” or “/*”) - the one with the lowest index on the line is

accepted to take precedence, and initiates its blocking item until it is closed off. The

indices inside the blocking element are removed from the checking process which

continues until there are no blocking elements left on the line. In the code, this was

implemented as follows:

string_found = current_modifieds[line_number].find("\"")mlc_found = current_modifieds[line_number].find("/*")

43

slc_found = current_modifieds[line_number].find("//")

while (string_found > -1) | (mlc_found > -1) | (slc_found > -1):lowest_result = tests.lowest_index([string_found, mlc_found, slc_found])if lowest_result == mlc_found:

// some code elif (lowest_result == slc_found):

// some code elif (lowest_result == string_found):

// some code

This basic algorithm demonstrates the premise. A separate module “tests” was set

up in order to perform basic tests on the input data (such as the above, “find me

the lowest item of these three, and report its name”), including the tests for

particular token types.

Because of the importance of context in Java, it was decided three copies of a

source code file's data should be kept, to track which items had been tokenised and

where they were placed. The three types and their uses are outlined below:

Name Contents Rationale

current_lines Exact replicas of each file's lines

Kept completely unmodified to verify original state of lines

current_nonblocking

Exact replica of file lines with blocking items (strings and comments) removed

When tokenising anything other than strings and comments, the surrounding context of a word or symbol can be checked without accidentally including anything blocked by a comment or string

current_modifieds Replica of file lines with currently-recognised items removed

Used to determine what still needs to be tokenised in a file

In order to meet the rest of the Should Have criteria, a number of tests were then

added in to systematically find recognisable tokens, remove them from the

“current_modifieds” set and collate them. A “data_store” array was set up to keep

track of the various items discovered, and arrays of normal Java keywords and

symbols were created to be scanned for in the file. Context-sensitive tests were

created which took the known keywords and symbols and used their position to

infer the position of other file items. Overall, the following tests were added:

Test Function

whitespace_check Locate tabs, indents, whitespace characters

44

keyword_finder Used to locate initial Java keywords; and once it has identified user-defined keywords (such as class names), locates these also

dot_notation Identifies class, packages etc referred to via dots (i.e. that.that.theother().method)

constructor_finder Locates constructors – primarily used to identify names of external classes used, to then locate other references to these classes

class_finder Given a list of class names identified, scans file for other references to these class names

name_finder Locates the first-time use of an identifier (variable or method declaration or parameter, e.g)

repeat_identifiers Locates subsequent references to identifiers

symbol_finder Locates all symbols including operators

number_finder Locates literal numerical values in the file

Collectively these tests scored around 98% of file items recognised (varying on

which files were input) when run. The full coding and structure of the tests can be

seen in the source code files on (the accompanying project CD). Finally, the results

were output to files (corresponding results for each file can be found in the

“results” folder of the project). An early work in progress of the tokenisation and

results-production components was successfully demonstrated at the project

progress meeting (shown in the project schedule).

As discussed in section 5.2, the results files and their corresponding input files

were then examined to find the most obvious style differences, which were found

and listed in the aforementioned section. Using these as guidelines, a

“stats_generator” module was produced to generate appropriate statistics and print

the results to results files. The “generate” method in the module does just this –

items are counted from the results lists and a second corresponding “stats” file is

produced in the “stats” directory of the source tree.

The first four Should Haves were thus completed. The system produced raw data

necessary for the detection of style and creation of a digital fingerprint, but it was

as yet unclear how the data would be manipulated to do this. The decision was

made to consider this prototype a proof-of-concept of the tokenisation and

statistic-producing parts of the system, and to back to modelling and design to

45

plan a second iteration with the capacity to manipulate the statistics and produce a

detectable digital fingerprint. This early functional prototype fulfilled most of the

second minimum requirement for the project, in that it produced enough data to

identify a distinctive style for a given author; the actual identification process

though (mostly covered by the final minimum requirement) will be covered by the

next iteration.

6.2 Second Iteration.

The first task was to perform the statistic normalisation. An extra function in the

stats_generator module was added to normalise the existing statistics on to scales

where possible, and produce the results as CSV files. As a sample, sets of five files

from five of the authors were put through this process. The results were collected

and averages and standard deviations added, ready for the weighting and

confidence. The collated results from this can be seen in the “Stats test”

spreadsheet on the project CD.

Unfortunately at this stage, the implementation had begun to overrun its timebox,

and emphasis had to be placed on a completed project report. Observations on this

and whether or not the inability to produce the digital fingerprint from the collated

statistics constitutes a failure of the project can be read in Appendix A.

46

Bibliography

1. Fores, S. (2007). School Taught Student Handbook for Returning Students

2007-2008, School of Computing, University of Leeds, page 18.

2. Jimack, P., Efford, N., Fores, S., Clark, M. (2007). Project Topics, School of

Computing, University of Leeds,

http://www.comp.leeds.ac.uk/tsinfo/projects/software_g.html [April 2008]

3. Royce, W. (1970). Managing the Development of Large Software Systems,

Proceedings of IEEE WESCON vol. 26, pp1-9.

4. DeGrace, P., Hulet Stahl, L. (1990). Wicked Problems, Righteous Solutions: A

Catalog of Modern Software Engineering Paradigms, Prentice Hall

5. Boehm, B.(1988). A Spiral Model of Software Development and Enhancement,

IEEE Computer magazine vol. 21, issue 5

6. Martin, J. (1991). Rapid application development, Macmillan Publishing Co.

7. Arthur, L. J., (1997). Quantum improvements in software system quality,

Communications of the ACM vol. 40, issue 6

8. Howard, A. (2002). Rapid application development: rough and dirty or value-

for-money engineering?, Communications of the ACM vol. 45, issue 10

9. Stapleton, J. (1999). DSDM: Dynamic Systems Development Method,

Proceedings of Technology of Object-Oriented Languages and Systems, 1999,

British Computer Society.

10.Kruchten, P. (2003). The Rational Unified Process: An Introduction, Addison

Wesley

11.Jacobson, I. (1999). The unified software development process, Addison

Wesley

12.Knoepfel, A., Groene, B., Tabeling, P. (2005). Fundamental modeling concepts

– effective communication of IT systems, Wiley

13. Quatrani, T. (2005). Introduction to UML 2.0, Object Management Group,

http://www.omg.org/news/meetings/workshops/MDA-SOA-WS_Manual/00-

T4_Matthews.pdf [April 2008]

14. www.sysml.org [April 2008]

15. Harrap, C. (2003). Chamber 21st Century Dictionary, Harrap Publishers

16.Baayen, H., Neijt, A., Tweedie, J. (2002). An experiment in author attribution,

Proceedings of the JADT, University of Paris

47

17. White, J.W., Kowalski, R.M. (1994). A Feminist Analysis, Psychology of Women

Quarterly, Blackwell Synergy

18. Chen, H. (2005). Artificial Intelligence for Homeland Security, Intelligent

Systems, IEEE vol 20. issue 5

19. Person, R. F. (1997). The Kings-Isaiah and Kings-Jeremiah, Walter de Gruyter

20. Somers, H., Tweedie, F. (2003). Authorship Attribution and Pastiche, Springer

21. Schommer, C., Uhde, C. (2008). Textual Fingerprinting with Texts from Parkin,

Bassewitz, and Leander, Arxiv Pre-Print

22. Laan, N. (1995). Stylometry and Method. The Case of Euripides, Literary and

Linguistic Computing 1995 10(4): 271-278

23. Mosteller, F., Wallace, D.L. (1963). Inference in an Authorship Program,

Journal of the American Statistical Association

24. Hoover, D.L. (2003). Multivariate Analysis and the Study of Style Variation, Literary and Linguistic Computing 2003 18(4):341-360

25.Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D., Ye, L. (2005). Authorship Identification on the Large Scale, Rutgers University Press

26. Stamatatos, E., Fakotakis, N., Kokkinakis, G. (2000). Automatic Text Categorization in Terms of Genre and Author, Computational Linguistics 2000, vol 26. issue 4, pp 471-495

27. Argamon, S., Koppel, M., Fine, J., Shimoni, A.R. (2003). Gender, genre, and

writing style in formal written texts. Text-Interdisciplinary Journal for the

Study of Discourse, Walter de Gruyter

28. Smadja, F. A. (1989). Lexical Co-occurence: The Missing Link, Oxford

University Press

29. www.ldoce.com [April 2008]

30. Longstaff, T.A., Schultz, E.E. (1993). Beyond preliminary analysis of the WANK

and OILZ worms: a case study of malicious code, Computers and Security, vol.

12 issue 1

31. Spafford, E.H., Weeber, S.A. (1993). Software forensics: Can we track code to

its authors?, Computers and Security, vol 12, pp 585-595

32. Vitoria, M. (2006). Navitare Inc. v Easy Jet Airline Co Ltd., Crown Publishings

33. Sallis, P., Gray, A., MacDonnell, S. (1997). Software Forensics: Extending

Authorship Analysis Techniques to Computer Programs, Proceedings of the

3rd Biannual Conference of the IAFL, 1997 1-8

34. MacDonnell, S., MacLennan, G., Kilgour, R., Gray, A., Sallis, P. (1997).

48

IDENTIFIED: Software Authorship Analysis with Case-Based Reasoning,

Proceedings of the Addendum Session of the 1997 International Conference

on Neural Information Processing and Intelligent Information Systems.

35. Helmore, S. (2005). Detecting Bought Programming Coursework Solutions,

School of Computing, University of Leeds

36. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html [April

2008]

37. http://en.wikibooks.org/wiki/List_of_hello_world_programs [April 2008]

38. http://java.sun.com/reference/docs/ [April 2008]

49

Appendix AReflections on Project Experience

In summary the project was, I feel, a worthwhile experience and one I would happily do again, incorporating lessons I have learned during this project.

One of the great joys of the project was the research. The opportunity to read into very specific areas in depth was a fantastic learning experience in particular; a great opportunity to become “an expert” in a very specific field, such that one can write knowledgeably about it. To ensure the project can make a germane contribution to that particular field, it is necessary to read decades' worth of papers on the subject, giving an insight into both the development of the technology and the thought processes of some of computing's leading thinkers. The resulting solution very much feels to the creator as though he/she has stood on the back of giants and worked on topics that have been grappled with by such brilliant minds; this is in itself very fulfilling.

I would highly recommend to students attempting projects that they send a great deal of time at the start of their project familiarising themselves with the expectations the School places on such projects and what they are expected to deliver. Whilst projects posted on News are always valid, the student needs to make them applicable to the framework of a final-year project. In my own case, failure to properly grasp the expectations of a final-year project led to a number of poor choices of project and/or proposed solutions; ultimately therefore the choice of project went through many revisions. This led to a very short timescale within which to produce the solution and report, which undoubtedly caused the project to suffer both in terms of lack of time to produce the solution (implementation had to stop before full realisation of the solution could be made) and lack of time to finalise the report. This clearly emphasises the need to resolve such teething problems early; the project schedule is clearly designed with an adequate amount of time for completion, and delays are likely to hinder progress and deliver poorer final results.

This project does not however constitute a failure to my mind; the substantial research and theoretical reasoning behind the proposed system along with a partial proof-of-concept, I believe demonstrate sound work and a platform for future work to be built upon. Completion of the implementation could be taken up by anyone with access to the source code and this paper; and whilst the implementation is incomplete it is my judgement that the theories are sound and backed up by exhaustive research.

Students attempting this area of research in future should consider the theories put forward and examine the source code of the implementation as it stands, as a possible way forward. The bibliography, in conjunction with the research section, sets out a wide number of resources from the fields of linguistics, natural language processing, plagiarism detection and computing in general. I would advise that future endeavours into this field continue to examine all of these areas to gain the maximum advantages offered by all of these areas of research.

50

In conclusion my thanks go to my long-suffering project supervisor; and I would again exhort participating students to full utilise the wealthy mine of information and resources that supervisors offer; and I hope that this paper can have contributed at least in a small way into the thinking and design of digital fingerprinting systems.

51

Appendix B

52

Appendix CThe Waterfall Model

RAD

53

DSDM

The Spiral Model

54

Unified Process

55

Appendix D

The above represents a sample; the full spreadsheet (at 76 columns) can be seen on the attached project CD

56

La

be

l

comments_mlc_count

comments_slc_count

comments_mlc_linespan

comments_slc_chars

comments_mlc_chars

comments_mlc_total_char

comments_slc_total_char

comments_total_char

comments_mlc_proportion

comments_slc_proportion

comments_num_lines

comments_mixed_lines

code_all_lines

code_lines

code_exclusive_lines

Programmer A6.85 2.74 4.8 27 143 717 55 772 92.88 7.12 75.34 1.82 73 53.42 97.444.92 4.92 8.67 57 313 939 171 1110 84.59 15.41 280.33 0 61 47.54 1006.74 6.74 5.67 47 169 1015 282 1297 78.26 21.74 316.85 0.35 89 47.19 97.624.15 5.7 4.13 35 106 1696 790 2486 68.22 31.78 204.66 0.38 386 63.21 98.778.77 0 5.6 0 175 878 0 878 100 0 0 0 57 43.86 100

Av 6.29 4.02 5.77 33.2 181.2 1049 259.6 1308.6 84.79 15.21 175.44 0.51 133.2 51.05 98.77Std 1.81 2.68 1.74 21.8 78.54 377.93 315.8 689.06 12.39 12.39 134.77 0.75 141.87 7.62 1.24

Programmer B0.22 6.67 2 31 75 75 931 1006 7.46 92.54 206.89 1.18 450 77.11 96.833.64 3.64 6 52 151 303 104 407 74.45 25.55 189.09 0 55 50.91 1001.72 0 6 0 153 153 0 153 100 0 0 0 58 65.52 1001.49 0 6 0 156 312 0 312 100 0 0 0 134 69.4 1000.77 7.69 7 40 150 150 403 553 27.12 72.88 310 0.25 130 65.38 98.82

Av 1.57 3.6 5.4 24.6 137 198.6 287.6 486.2 61.81 38.19 141.2 0.29 165.4 65.66 99.13Std 1.3 3.61 1.95 23.66 34.73 104.26 395.73 324.92 42.53 42.53 136.91 0.51 163.52 9.52 1.38

Programmer C8.24 2.35 5.43 2 169 1184 4 1188 99.66 0.34 4.71 0 85 40 1001.37 12.33 6 3 110 110 27 137 80.29 19.71 36.99 0 73 75.34 1003.85 6.41 5.33 5 175 527 29 556 94.78 5.22 37.18 0 78 60.26 1004.47 3.35 7.13 6 285 2282 37 2319 98.4 1.6 20.67 0 179 54.75 1003.97 0.32 6.32 45 261 6530 91 6621 98.63 1.37 14.44 1.1 630 57.94 99.73

Av 4.38 4.95 6.04 12.2 200 2126.6 37.6 2164.2 94.35 5.65 22.8 0.22 209 57.66 99.95Std 2.47 4.67 0.73 18.4 71.82 2594.61 32.28 2623.73 8.07 8.07 14.23 0.49 239.36 12.65 0.12

Programmer D10.71 3.57 3.33 45.5 92.5 555 91 646 85.91 14.09 162.5 0 56 48.21 1006.25 10.94 3.75 39.43 69.25 277 276 553 50.09 49.91 431.25 0.36 64 51.56 96.973.45 4.6 4 23 109 327 92 419 78.04 21.96 105.75 0 87 64.37 1006.67 0 7 0 268 1072 0 1072 100 0 0 0 60 46.67 1005.04 3.36 3.33 42 93.67 562 168 730 76.99 23.01 141.18 0 119 62.18 100

Av 6.42 4.49 4.28 29.99 126.48 558.6 125.4 684 78.21 21.79 168.13 0.07 77.2 54.6 99.39Std 2.7 4 1.55 18.86 80.37 314.86 103.1 245.71 18.21 18.21 159.8 0.16 26.28 8.15 1.36

Programmer E0 8.09 0 17.64 0 0 247 247 0 100 142.77 5.67 173 73.41 88.980 3.39 0 24.25 0 0 97 97 0 100 82.2 4.12 118 79.66 95.740 3.54 0 23 0 0 92 92 0 100 81.42 4.35 113 74.34 95.240 2.61 0 29 0 0 116 116 0 100 75.82 3.45 153 72.55 96.40 3.88 0 23.25 0 0 93 93 0 100 90.29 4.3 103 70.87 94.52

Av 0 4.3 0 23.43 0 0 129 129 0 100 94.5 4.38 132 74.17 94.18Std 0 2.17 0 4.04 0 0 66.67 66.67 0 0 27.48 0.81 29.66 3.33 2.99