UNECA ACS ASSD

35
UNECA ACS ASSD African Handbook on Census Data Processing, Analysis and Dissemination St. Georges Hotel, Pretoria 15 November 2009

description

UNECA ACS ASSD. African Handbook on Census Data Processing, Analysis and Dissemination St. Georges Hotel, Pretoria 15 November 2009. Data Capture Methods. Traditional Key from Paper (KFP) Scanning model Key from Image (KFI) Optical Mark Recognition (OMR) - PowerPoint PPT Presentation

Transcript of UNECA ACS ASSD

Page 1: UNECA ACS ASSD

UNECA ACSASSD

African Handbook on Census Data Processing, Analysis and

Dissemination

St. Georges Hotel, Pretoria15 November 2009

Page 2: UNECA ACS ASSD

Data Capture Methods

Traditional Key from Paper (KFP)

Scanning model Key from Image (KFI) Optical Mark Recognition (OMR) Optical Character Recognition (OCR) Intelligent Character Recognition (ICR) Intelligent recognition (IR)

Internet (IRS) Handheld (PDA, Laptop, Net book etc.)

Page 3: UNECA ACS ASSD

Forms Type (Source)Structured Semi-Structured Unstructured

Page 4: UNECA ACS ASSD

Scanning ModelsType Acronym Description

Key from Image KFI

Identical to Key from paper method,

however incorporation of data entry from

a scanned image

Optical Mark Recognition OMR

Data is produced from response marks

on the instrument during or post

scanning

Optical Character

Recognition OCR

OCR technology recognizes machine-

printed characters on an instrument

Intelligent Character

Recognition ICR

ICR technology recognizes handwritten

characters on an instrument

Intelligent recognition IR

OCR technology recognizes

handwritten and cursive characters on

an instrument

Page 5: UNECA ACS ASSD

OMR

Page 6: UNECA ACS ASSD

OMR

OMR is a technology that allows an input device (e.g. imaging scanner) to read hand-drawn marks such as small circles or squares on specially designed paper. OMR is captured by contrasting reflectivity at predetermined positions on a page.

Page 7: UNECA ACS ASSD

OMR

OMR information is converted from marks into the form of numbers or letters and put it into the computer.

There are two known methods of applying OMR technology in data processing, namely

Form based OMR, and Image based OMR

In form based OMR, one works with a specialized document that contains timing tracks along one edge of the form to indicate to the scanner where to read for marks which look like black boxes on the top or bottom of a form.

In image based OMR, the scanned image is run through processing or interpret engines for a computer to electronically determine the mark received from the form.

In effect, form based OMR does the ‘reading’ of data at scan time, whilst image based OMR can apply the creation of data during any subsequent process.

Key difference, with form based OMR one cannot add fields for interpretation after scanning whilst with image based OMR, these can be added as and when required. However, with form based OMR, images can be saved during the scanning process and would require a KFI process for any further verification or exceptions management

Page 8: UNECA ACS ASSD

KFP, KFI, OMR

Page 9: UNECA ACS ASSD

OMR Advantages and Disadvantages

Advantages Form based OMR is a data collection technology that does

not require a recognition engine. Therefore it is fast, using minimum processing power to process forms and its costs are predictable and defined

OMR capture speeds range around 4000 forms per hour and one can process quite a lot within a short period of time.

Disadvantages OMR cannot recognize hand-printed or machine-printed

characters. With OMR, images of forms are not captured by scanners so

electronic retrieval is not possible. Tick boxes may not be suitable for all types of questions If a user wants to gather large amounts of text then OMR can

complicate data collection. There is also the possibility of missing data in the scanning

process, incorrectly or unnumbered pages can lead to them being scanned in the wrong order.

Page 10: UNECA ACS ASSD

OMR Best Practices

The entire process must be tested: Information Capture Recognizing Verifying Results

Questionnaire design and preparation is a critical aspect

Forms must be easily scannable and in a good condition at scan time otherwise transcription will be required

Enumerators must take particular care in filling out questionnaires

Completeness and consistency checks must be in place Careful care must be taken for the condition of the

Questionnaire (dust, humidity, transportation, etc)

Page 11: UNECA ACS ASSD

OMR Lessons Learnt

OMR, in any form can be extremely powerful tool for use in data processing of large surveys and censuses, however they need to be carefully controlled and managed

To achieve high accuracy, well structured design and good quality printing of forms is critical. This primarily brings to the fore the issue of costs as this printing can be extremely costly and limited geographically as service providers are far and few between.

Although OMR data is relatively accurate, it is important to do detailed testing and constant review of data being produced to ensure that the right fields are being read. One can do this via various methods like an independent comparison of OCR read values versus KFI based values from the same images.

Exceptions can also be easily corrected with images available on hand for correction.

Page 12: UNECA ACS ASSD

KFI

Page 13: UNECA ACS ASSD

KFI

The actual process of KFI is quite similar to that of KFP in that the data capturer still enters in data manually; however instead of capturing from a manual form, he/she captures data directly from an image.

Page 14: UNECA ACS ASSD

KFI Advantages and Disadvantages

Advantages Preparatory time

Minimal time required to implement changes and modifications. Online verification

A major advantage is the fact that verification of instruments occurs at the time of data entry and therefore errors and discrepancies can be picked up easily. However, this can be negated with data entry clerks independently changing content on the instrument to if the system hampers their performance due to constant error messages

Disadvantages Production time

In KFP processes, no computer aided recognition occurs. Therefore, the data capturer will type each and every character as displayed on the questionnaire

Keying errors Keying errors are bound to occur as each and every character of information

is being captured manually. As capturers try to reach their targets and increase performance, errors will start to creep in.

Entry clerk changes data due to tight validation If tight validation is put into place only allowing the clerk a set number of

values for entry, any inconsistent information will be changed to the easiest value the clerk can select. In this way invalid and out of range data is not consistently edited and correct and results in data problems downstream.

Page 15: UNECA ACS ASSD

Example of multi-type form

47

OCR

OMR

ICR

Page 16: UNECA ACS ASSD

Example of Census Form

Page 17: UNECA ACS ASSD

OCR/ICR

Page 18: UNECA ACS ASSD

OCR/ICR

With scanning technology steadily becoming cheaper and more accessible and advancements in the development of recognition algorithms, OCR and ICR technology have became the foundation of image and forms processing around the world. This was done via two primary methods, OCR and ICR.

OCR technology recognizes machine-printed characters on a form, whilst ICR technology recognizes handwritten characters on a form. OCR technology and the ability to read machine printed characters have largely been solved as accuracy thresholds are mainly between 99 and 100%.

Key difference between OCR and ICR is that OCR is more accurate than ICR due to the large amount of variations which occur in handwriting. Nevertheless, ICR is a great advancement in character recognition as there is virtually no limit on the types of data that can be collected and converted. Albeit, this needs to be done with great care and attention to editing and data confrontation to avoid problems

Page 19: UNECA ACS ASSD

OCR ICRSegmentation of text

Page 20: UNECA ACS ASSD

OCR ICRSegmentation of text

3 1 2 2 4 3 0 8 9 1

Engine A + Engine B

Page 21: UNECA ACS ASSD

Types of Recognition Engines

Different types of OCR/ICR/OMR engines are used to recognize characters (numeric or alpha-numeric).

NESTOR

EXPERVISIONKADMOS TISICR

LIGATURE

Clear Image

JustICRRecoStar

ParaScript

AEG

A2iA

Page 22: UNECA ACS ASSD

Majority Voting Rules : Engines

3 3 8 3Unanimous = ?

ICR 1 ICR 4ICR 3ICR 2

Majority = 3

Page 23: UNECA ACS ASSD

Alpha Recognition - Voting

*oshua Jo*hu* J*sh*a

VOTING

ICR A ICR B ICR C

Joshua

Page 24: UNECA ACS ASSD

False Positive Marking

Page 25: UNECA ACS ASSD

OCR/ICR Advantages and Disadvantages

Advantages Recognition engines used with imaging can capture highly

specialized data sets Engines can be made to learn regional characteristics and its effects

on handwriting Large saving on resources (human and machine) due to computer

assistance in 80% of keying processes. OCR/ICR recognizes machine-printed or hand-printed characters. Scanning and recognition allowed efficient management and

planning for the rest of the processing workload Quick retrieval of images for editing and reprocessing

Disadvantages Technology is costly May require significant manual intervention if not implemented

properly Additional workload to enumerators-ICR has severe limitations when

it comes to human handwriting Characters must be hand-printed/machine-printed with separate

characters in boxes Ineffective when dealing with cursive characters

Page 26: UNECA ACS ASSD

OCR/ICR Lessons Learnt

ICR/OCR is technology that can benefit data processing immensely. However it must be carefully designed and implemented to avoid problems creeping into the production cycle.

Algorithm development has improved over time and is getting much better, however if handwriting is poor, more data will be sent for correction and therefore resulting in greater workload for operators.

Forms design and proper printing is key to the process in being successful

Barcodes can play a vital part to proving a unique description to the form and instruments should be treated as forms before being treated as households.

Page 27: UNECA ACS ASSD

OCR/ICR QA/Exceptions

One of the major issues of ICR/OCR is the fact that one places trust in the processing engine that it is providing data that is of excellent quality and is a direct reproduction of the instrument.

Therefore it is vital to undertake QA processes on any OCR/ICR data to ensure that the conversion process was of adequate quality. This can be done by a sample based recapture of data in an independent system to ascertain a data quality rate or as the inverse the error rate. This can either be utilized a a measure of quality with further options of rejection to ensure that only acceptable levels of data is sent through the system.

For exceptions, it has been found that tracking and correcting small cases through a bulk system can prove to be problematic and it would be more advantageous to follow a KFP solution for all exceptions. In this way, the bulk production system runs and is not hampered by exceptions.

Page 28: UNECA ACS ASSD

Internet Data Collection

Page 29: UNECA ACS ASSD

Internet Data Collection

The most common methods of data collection for surveys and censuses are personal interviewing and self enumeration. The growing number of respondents with access to the Internet introduces a new data collection alternative that is likely to become increasingly important in the future.

Like computer assisted telephone and personal interviewing, computer assisted self interviewing using the Internet permits an interactive exchange with the respondent through intelligence built into the computer application.

While promising, Internet surveys also face a variety of challenges in survey coverage, in survey design, in security of confidential information, and in mastery of new and rapidly changing technologies

Page 30: UNECA ACS ASSD

Internet Data Collection

The most important deciding factor on whether internet data collection should be a viable alternative is the rate of internet penetration in the respective country.

Some countries have high penetration rates, like in Europe were some countries boast penetration rates of between 80 and 90 percent. However in Africa, where recent statistics indicate average internet penetration at around 6.7%, the internet can play an important part of a multi channel data collection system in Censuses and surveys

Page 31: UNECA ACS ASSD

Internet Data Collection

The functional requirements for Internet questionnaires describe an interactive application where interview questions are presented to the respondent and actions are taken based on the responses

The Internet consists of heterogeneous client hardware and software. The software or browser supports published and de facto standards which allow Web pages to be displayed and execute on the client computer. One needs to be careful to design an interface as simple and adaptable as possible such that it can be displayed correctly on any universal browser or web interface.

Page 32: UNECA ACS ASSD

Internet Data Collection

Since the Internet is a public network, security vulnerabilities exist. They include the following: Eavesdropping, i. e., intermediaries can listen in

on private conversations; Theft, data stolen during the course of

transmission or from a computer or network; and Impersonation, a sender or receiver using a false

identity for communication. The NSO needs to address these

issues to provide respondents with a secure and private method to use the Internet for data collection.

Page 33: UNECA ACS ASSD

Internet Data Collection

Security for Internet data collection had to be addressed at three levels: (1) the security of communication

between the respondent and the NSO;

(2) the security of respondent data at the NSO, and

(3) the security of the NSO network

Page 34: UNECA ACS ASSD

Internet Data Collection

Since Web data collection is in its infancy, this is only the beginning.

As Web technology matures, guidelines for Web questionnaire design will be further tested, standardized, and documented.

With these advances and increasing Web skills in the general public, respondents will find Web questionnaires increasingly easy to use.

The ease of use and intuitiveness of a Web questionnaire is important since we do not have the luxury of training the respondent.

The Web also offers the opportunity to use graphics, audio, and video to improve the overall interview experience for the respondent.

Page 35: UNECA ACS ASSD

Thank you…

I reiterate…We still need your valuable inputs to

make this document better….