A Universal approach to Building QA Models

(C) 2014Logrus International

A UNIVERSAL APPROACH TO BUILDING QA MODELS

Leonid Glazychev, Logrus International Corporation


QA MODEL: GENERAL CONSIDERATIONS

Reflecting perception and priorities of the target audience Concentrating on factors producing the strongest impression Separating global and local factors/issues

Universal applicability Covering the whole spectrum of materials

From slightly post-edited MT to ultra-polished manual translations Same approach for knowledge bases and marketing leaflets

Common approach Only adjusting acceptance criteria/thresholds based on expectations

Viability Clear, not overly complicated Process-oriented, i.e. applicable in the real world

Flexibility Concentrating on methodology Particular criteria/issue classification can be taken from elsewhere, for instance:

Based on MQM or other public source Based on legacy client-sourced criteria…


REAL-LIFE SCENARIO: NO REFERENCE TRANSLATIONS AVAILABLE

Two major criteria for any translation are always Adequacy (Correctly conveys the meaning) and Fluency (Readability) NEITHER of these depends on translation origin, target audience, brand impact, etc.

No need to delve into technical details or error counts if the text is Unreadable (incomprehensible) or Inadequate (inaccurate)

Acceptance thresholds depend on a number of parameters Goals Target audience Speed Expected longevity and brand impact, etc.

Assessment is relatively quick Often scanning through the text is sufficient

Especially so when quality is really low One needs to be bilingual or have a bilingual expert ready just in case


MAKING REAL-LIFE LQA AS OBJECTIVE AS POSSIBLE NONE of the two major criteria are completely objective

An expert panel would produce a normal opinion curve around the average value In real life there is no expert panel, but a single evaluator!

The grade assigned by this particular person will NOT be arbitrary, but… It might fall anywhere within the standard ±2σ range It depends on the individual’s taste, background, etc.

That is why both criteria can be called SEMI-OBJECTIVE or EXPERT OPINION-BASED Both criteria NOT too accurate by design!

Consequences EACH of these two major criteria should be evaluated SEPARATELY

Accurate but incomprehensible texts are as useless as fluent but inadequate ones Two independent “coordinates”, can’t be combined mechanically

EACH should be evaluated on a threshold-based PASS/FAIL basis Acceptance range needs to accommodate the whole spectrum of potential expert opinions

Marketing text: Between 8 and 10 (10-point scale) Knowledge base: Between 5 and 8 (10-point scale)

The minimal scale to be used is a 10-point one, to accommodate the normal curve properly Smaller scales just do not provide sufficient granularity

Acceptance threshold defined by the area, visibility of materials, time constraints, target audience, etc.


THE TECHNICAL FACTOR

Only content that passes on both accounts is further analyzed for technical imperfections Terminology inconsistency or deviations Style guides, country standards Tags, placeholders Formatting

Technical issues are OBJECTIVE Grades expected to be similar irrespective of the reviewer’s personality

A typo is still a typo An error in country standards is still an error anyway

Issue categories can be based on MQM or other public source Legacy client-sourced criteria

Error weights and acceptance thresholds depend on multiple factors Expectations, target audience, time, brand impact, etc. Each “quality vector” contains error weights for each category and acceptance levels

A limited number of “quality vectors” cover the whole spectrum The resulting technical (objective) quality grade is the third apex of the quality triangle


THE QUALITY TRIANGLE (OR SQUARE)

ADEQUACY

FLUENCY

TECHNICAL

MAJOR ERRORS

Acceptance Range Filters


CASE-STUDY: US ACA SPANISH WEBSITE REVIEW

Organized by GALA (Globalization and Localization Association, www.gala-global.org) Logrus developed and provided methodology Logrus organized the review and provided analytics Volunteer effort, crowdsourcing-based approach

Complicated special rules, strict definitions, lengthy training, etc. out of the question Contributors chosen among language professionals only

Simplified “quality square” methodology applied Major errors (10 = None, 0 = More than 2) Readability (fluency, 0 - 10) Adequacy (accuracy, 0 - 10) Technical (0 – 10)

18 language pros reviewing the website: www.CuidadoDeSalud.gov

http://www.gala-global.org/

http://www.cuidadodesalud.gov/


CASE-STUDY: US ACA SPANISH WEBSITE REVIEW (II)

Major errors: None (11), More than 2 (7), 1 grade ignored Takeaways

Not too objective! YOUR reviewer could contribute to ANY of the bars Only threshold-based criteria really work

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

Actual Results Normal Distribution

Rating (0 - 10)

Rating Popu-larity

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5


Rating (0 - 10)

Rating Popu-larity

Readability / IntelligibilityMean value: 6.2, Std. Deviation: 2.1

Adequacy / AccuracyMean value: 6.6, Std. Deviation: 1.9


CASE-STUDY: US ACA SPANISH WEBSITE REVIEW (III)

Biggest opinion spread for technical errors Illustrates the gap between professional and crowdsourced work No detailed criteria or training applied Should be the most objective factor

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5


Rating (0 - 10)

Rating Popu-larity

Technical IssuesMean value: 4.8, Std. Deviation: 2.3 Overall review results still quite reliable/convincing

Not a big surprise given the website initial quality… “Obamacare’s poorly translated Spanish website

frustrates users”, AP, January 12, 2014

0 1 2 3 4 5 6 7 8 9 100123456789

10


Rating (0 - 10)

Rating Popu-larity


WHY SEMI-OBJECTIVE AND OBJECTIVE FACTORS SHOULD NOT BE COMBINED Scope and nature

Objective factors are “local”, each applies to a particular small segment (sentence) Semi-objective factors typically apply to the text as a whole or its large chunks Semi-objective evaluations imprecise by definition, can’t be used in formulas

Natural variation might affect the summary score dramatically Importance/weight

Adequacy and fluency issues are way more important than most others Their relative weight will exceed everything else by orders of magnitude Combined summary result too dependent on adequacy/fluency

Almost no sensitivity to other factors Cost, Time, Viability

No reason to waste time on counting/grading technical errors for an incomprehensible or incorrect text


THE “QUALITY TRIANGLE/SQUARE” APPROACH RECIPE Preparation

Select/build the appropriate issue classification for objective errors Select/set the acceptance thresholds and error weights vector Define show-stoppers

Process Apply expert opinion-based (semi-objective) criteria with a PASS/FAIL result

Adequacy (Accuracy) Fluency (Readability)

Apply objective criteria based on error classification/typology (acceptable docs only) Language (spelling & grammar) References, lack of (over-/under-)translations Country and other standards Terminology, Style Guide and explicit client’s guidelines Tags, placeholders, formatting, etc.

Ignore Subjective Complaints Obtain 3 or 4 resulting ratings for each reasonably translated document

Adequacy (Accuracy) Fluency (Readability) Objective (Technical) error rating [Major problems]


SUMMARY

QA approach equally applicable to almost all real-life translations (without an existing reference) Works for MT post-editing or even raw MT output Complements the MQM back-end providing the methodology for quality assurance

The only things that need to be chosen or fine-tuned are Issue catalogue (for objective issues/errors) The vector comprising all acceptance thresholds and error weights

Can be chosen from a limited number of preset templates (content profiles) See concept details in tcworld as of February, 2012, Of power adapters and language quality assurance:

http://www.tcworld.info/tcworld/translation-and-localization/article/of-power-adapters-and-language-quality-assurance/




SEPARATE CASE: REFERENCE TRANSLATIONS AVAILABLE

There are plenty of time/money-saving, automated methods to get a ballpark quality evaluation Applicability area is narrowed dramatically:

Comparing different MTs or Different versions of the same MT Evaluating test translations

Results might be quick and cheap, but Not directly related to quality of the translation Rather illustrating translation’s closeness to the benchmark one

Can be used for developing/improving MTs or quickly evaluating new translators/students Very limited usability for real-life translation scenarios

A Universal approach to Building QA Models

Documents

Transcript of A Universal approach to Building QA Models