MAMA: W3C Validator research - triin.net · the WDG's web site, there are many similarities between...

Author: Brian Wilson, Opera Software ASA

© copyright 2008. All rights reserved.

MAMA: W3C ValidatorresearchIndex:

1. About markup validation - an introduction2. Previous validation studies3. Sources and tools: The URL set and the validator4. What use is markup validation to an author?5. How many pages validated?6. Interesting views of validation rates, Part 1: W3C Member companies7. Interesting views of validation rates, Part 2: Alexa Global Top 5008. Validation badge/icons: An interesting diversion?9. Doctypes

10. Character sets11. Validator failures12. Validator warnings13. Validator errors14. Summing up...15. Appendix: Validation methodology

1. About markup validation - an introduction

MAMA is an in-house Opera research project developed to create a repeatable and cross-referenceableanalysis of a significant population of web pages that represent real world markup. Of course, part ofthat examination must also cover markup validation - an important measure of a page's adherence to aspecific standard. The W3C markup validation tool produces useful metrics that add to the rest ofMAMA's breakdown of its URL set. We'll look at what validation reveals about these URLs, what itmeans to validate a document and what benefits or drawbacks are derived from the process.

The readership of this section of MAMA's research is expected to be the casual Web page author outfor a relaxing weekend browse as well as those developing the W3C Validator tool itself, looking forincisive statistics about the validation "State Of The Union". As a result of this diverse audience, it isexpected that many readers will find that some sections are redundant or mystifying (possibly both atthe same time even!). Feel free to skip around the article as needed, but the best first time reading flowis definitely a linear read-through. Some of the data presented may need some prerequisite knowledge,but it is hoped that even the most detailed examinations here may be of interest to all readers in someway. There are some positive trends, some surprises, and some disappointments in the figures tofollow.

A quick summary:

The good news: Markup validation pass rates are definitely improving over time.The bad news: The overall validation pass rate is still miserably low and is not increasing as fast asone would hope

2. Previous validation studies

There are two previous, large-scale studies of markup validation that we can compare MAMA's resultsto regarding markup validation trends. Direct correlation with these previous studies was not anoriginal goal of MAMA, but it is a happy accident, given that many of MAMA's design choices happento coincide.

• Dec. 2001: Dagfinn Parnas' "How to cope with incorrect HTML" thesis; University of Bergen,Norway

• Jun. 2006: Renee Saarsoo's "Coding practices of web pages" bachelor thesis [PDF, InEstonian] [English summary]

1

www.princexml.com

Prince - Personal Edition

This document was created with Prince, a great way of getting web content onto paper.

http://people.opera.com/brian/mamavalidation/introduction.htm

http://elsewhat.com/thesis/

http://triin.net/2006/05/10/veebipraktikad.pdf

http://triin.net/2006/05/10/veebipraktikad.pdf

http://triin.net/2006/06/12/Coding_practices_of_web_pages

The analysis tools and target URL group were roughly the same between MAMA and these otherprojects. Both Parnas and Saarsoo's studies used the WDG validator (see next section) which sharesmuch of the same back-end mechanics with the W3C validator. Both studies also used the DMoz URLset (see next section). The main difference between the URL sets used lies in the amount of DMozanalyzed; where MAMA's research overlaps with Parnas' and Saarsoo's studies, we will attempt tocompare results.

Study Date URL Set Full DMoz Size Study Set Size

Parnas Dec. 2001 DMoz ~2.5 million ~2.4 million

Saarsoo Jun. 2006 DMoz ~4.4 million ~1.0 million

MAMA Jan. 2008 DMoz ~4.7 million ~3.5 millionFig 2-1: URL Set Sizes of Validation Studies

3. Sources and tools: the URL set and the validator

[For more details about the URLs and tools used in this study, take a look at the MethodologyAppendix section of this document.]

Treading on familiar ground: The Open Directory Project (DMoz)

There is a lot of coverage [elsewhere] about the DMoz URL set and the decision to use it as the basis ofMAMA's research. MAMA did not analyze ALL of the DMoz URLs though. Transient network issues,dead URLs and other problems inevitably kept the final URLs analyzed from being bigger than its finaltotal of about 3.5 million. The number of URLs from any given domain was limited in order todecrease per-domain bias in the results. This was an important design decision, because DMoz has abig problem with domain bias (~5% of all URLs in it are solely from cnn.com, for example). Parnasand Saarsoo did not do this, but it has proven to be a useful strategy to employ. I set an arbitrary per-domain limit of 30 URLs, and this seems to be a fair limitation. This restriction policy also helps trackper-domain trends - if any are noticeable they will be presented where they seem interesting.

Any comparison of MAMA's data to other similar studies, even if they also use DMoz, must take intoaccount that DMoz grows and changes over time as editors add, freshen or delete URLs from its roster.URLs can grow stale or obsolete through removal, and domains can and do die on a distressinglyregular basis. The aggregation source of these URLs remains the same, but the set itself is an evolving,dynamic entity.

The W3C validator

To test the URL set, MAMA used the W3C Markup Validator tool (http://validator.w3.org/, v. 0.8.2released Oct. 2007), which uses the OpenSP parser for its main validation engine. The W3C MarkupValidator is a free service from the W3C that helps authors improve the quality of their documents bychecking adherence to standards via DTDs. The Parnas and Saarsoo studies both used the WDGvalidator, but for MAMA's analysis, the W3C validator was the validation tool of choice. As stated onthe WDG's web site, there are many similarities between these two validators,

"Most of the previous differences between the two validators have disappeared with recentdevelopment of the W3C validator".

So, even though the validators used are different, there is significant overlap between MAMA'svalidation study data and the other previous studies. The W3C Quality Assurance group has producedmany excellent tools and processes over the years and that hard work definitely deserves to beshowcased in a study like this. Kudos to the W3C validator team!

2

http://htmlhelp.com/tools/validator/

http://www.dmoz.org/



http://validator.w3.org/



http://htmlhelp.com/tools/validator/differences.html.en

http://htmlhelp.com/tools/validator/differences.html.en

4. What use is markup validation to an author?

Why would an author validate a document at all? A validator does not write a web page for you - theinspiration and perspiration must still come completely from the author. There doesn't appear to beany real negative consequences to omitting this step. Sticking rigorously to a standard does notnecessarily spell success...Using a validator on a page and correcting any problems it brings to lightdoesn't guarantee that the result will look right on one browser, let alone some or all of them. Stickingrigorously to a standard does not necessarily spell success...a valid page may render poorly, withoverlaps and illegible content, or not at all in one or more browsers. On the other hand, an invalid pagemay render exactly the way an author was expecting.

Both authors and readers have come to expect that all browsers perform impeccable error recovery inthe face of the worst tag soups the Web can throw at it. Forgiveness is perhaps the most under-appreciated yet important feature we expect from a browser. But that is asking a lot, especially for theincreasingly lightweight devices that are being used to browse the Web. If there are any consequencesfor sloppy authoring practices, it would be here.

Henri Sivonen properly framed the role of the markup validator in an author's toolkit:

"[A] validator is just a spell checker for the benefit of markup writers so that they can identifytypos and typo-like mistakes instead of having to figure out why a counter-intuitive errorhandling mechanism kicks in when they test in browsers."

Continuing with the spell checker analogy, there are no dire consequences for a page failing to validate,just as there is seldom a serious consequence of having spelling typos in a document - the overall fullmeaning is still conveyed well enough to get the point across.

Using the spell checker analogy also helps dispel a practice that the W3C encourages, something thatwe'll talk more about in a later section - proclaiming that a page has been validated. This is a pointlessexercise and means nothing (W3C tool evangelism aside). It is like saying a document has been spell-checked at some time during its history. Any subsequent change to a document can introduce errors -both spelling and syntax-wise - and make the claim superfluous code baggage. As we will show in latersections, pages that have passed validation in the past often do not STAY validated!

Markup validation is a useful tool to help insure that a page conforms to a target you are aiming for.The most obvious thing to take away from the entirety of the MAMA research is that people are BAD atthis "HTML thing". Improper tag nesting is rampant, and misspelled or misplaced element andattribute names happen all the time. It is very easy to make silly, casual mistakes - we all make them.Validation of Web pages would expose all these types of simple (and avoidable) errors in moments.

For even more (and probably better) reasons to validate your documents, have a look at the W3C'sexcellent document on the subject: "Why Validate?".

5. How many pages validated?

The raw validation numbers

The validator's SOAP response has an <m:validity> element with Boolean content values of "true"and "false". A "true" value is considered a successful validation. MAMA found that 145,009 out of3,509,170 URLs passed validation.

Study Date Passed Validation Total Validated Percentage

Parnas Dec. 2001 14,563 2,034,788 0.71%

Saarsoo Jun. 2006 25,890 1,002,350 2.58%

MAMA Jan. 2008 145,009 3,509,170 4.13%Fig 5-1: Validation pass rate studies

3

http://lists.w3.org/Archives/Public/public-html/2008Apr/0322.html

http://validator.w3.org/docs/why.html

Another interesting view of MAMA's URL validation study is how many domains in MAMA thatcontained ANY page that validated: 130,398 (of 3,011,661 distinct domains validated) [4.33%]

Validation rates where select web page authoring features are also involved

Let's now ask the same basic "does it validate?" question multiple ways, keeping our main variable(validation rate) constant, while varying other criteria. This has the potential to say some interestingthings about the validation rates as a whole while also providing insight to biases that can arise whenmixing popular factors and technologies found in web pages. Note: instead of listing overall URLtotals, the totals mentioned are only for the URLs that use each technology.

AuthoringFeatureUsed

Criteria used to matchQuantityValidating

TotalQuantityUsingTechnology

Percentage

Script/Javascript

• Any "javascript:" URL• Any external script pointed to by

SCRIPT element• Any script embedded in a SCRIPT

element• Any known event handler content

(for attributes beginning with"on")

99,299[90,233]

2,617,828[2,306,921]

3.79%[3.91%]

CSS

• Any STYLE attribute content• Any content of STYLE element• Any external stylesheet pointed to

by LINK element (rel=stylesheet)

129,893[117,361]

2,821,141[2,487,898]

4.64%[4.72%]

AdobeFlash

• EMBED: MIME type of the SRCattribute contains "flash"

• PARAM: Element contains thestring ".swf" or "flash"

• OBJECT: MIME type of the objectcontains "flash"

• Script: Any mention of "flash" or".swf"

44,491[41,058]

1,176,227[1,050,121]

3.78%[3.91%]

Frames • Usage of the FRAMESET element5,905[5,741]

378,033[354,321]

1.56%[1.62%]

FONT• Usage of the FONT element

(common, CSS-obsoletedformatting markup)

29,723[27,491]

2,061,422[1,762,528]

1.44%[1.56%]

IIS WebServer

• Detection of "iis" string in HTTPheader "Server" field

24,743[22,227]

883,854[769,375]

2.80%[2.89%]

ApacheWeb Server

• Detection of "apache" string inHTTP header "Server" field

110,834[99,866]

2,347,328[2,011,088]

5.38%[4.97%]

Fig 5-2: Validation pass rates relating to various featuresQuantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities

4

Validation, Content Management Systems (CMS) and Editors

MAMA looked at the META "Generator" value to find popular CMS and editors in use for the followingtable looking for any noticeable trends in validation rates. One might expect per-domain numbers tobe more interesting in this case than per-url because sites are often developed using a single platform,but there is very little difference between the two views. In general, CMS systems generate valid pagesat markedly higher rates than the overall average, with "Typo" variants leading at almost 13%. On theother hand, the editor situation has some wild differences. Microsoft's FrontPage has a VERY widedeployment rate, but a depressingly low validation pass rate of ~0.5%. Apple's iWeb editor, however,has a freakishly high validation rate. Kudos to iWeb for this happy discovery.

EditorQuantityPassingValidation

TotalOccurrences

Percentage

Apple iWeb2,051

[2,016]2,504

[2,465]81.91%

[81.78%]

Microsoft FrontPage1,923

[1,846]347,095

[305,220]0.55%

[0.60%]

Adobe GoLive1,086

[1,057]41,865

[39,035]2.59%

[2.71%]

NetObjects Fusion802

[793]26,355

[25,466]3.04%

[3.11%]

IBM WebSphere626

[585]32,218

[24,460]1.94%

[2.39%]

Microsoft MSHTML518

[502]40,030

[38,328]1.29%

[1.31%]

Microsoft Visual Studio272

[245]22,936

[21,051]1.19%

[1.16%]

Adobe Dreamweaver205

[198]5,954

[5,647]3.44%

[3.51%]

Microsoft Word154

[153]24,892

[22,503]0.62%

[0.68%]

Adobe PageMill100[92]

15,148[12,142]

0.66%[0.76%]

Claris Home Page48

[41]6,259

[4,798]0.77%

[0.85%]Fig 5-3: Validation pass rates relating to editors

Quantities are per-URL. Numbers in "[]" brackets indicate per-domainquantities

5

CMSQuantityPassingValidation

TotalOccurrencesOf CMS

Percentage

Typo2,301

[2,170]18,067

[16,930]12.74%

[12.82%]

Joomla2,248

[2,233]34,852

[34,237]6.45%

[6.52%]

WordPress1,494

[1,472]16,594

[16,046]9.00%

[9.17%]

Blogger30

[30]9,907

[9,808]0.30%

[0.31%]Fig 5-4: Validation pass rates relating to CMS

Quantities are per-URL. Numbers in "[]" brackets indicateper-domain quantities

6. Interesting views of validation rates, Part 1: W3C Member companies

The W3C is the organization that creates the markup standards, and the markup validator used in thisstudy. One would hope that the individual companies that support and comprise the W3C wouldspearhead the effort to follow the standards that the W3C creates. Well, it turns out that is indeed thecase. The top pages of W3C member companies definitely adhere to markup standards at much higherrates than the rest of the Web. However, these "standard-bearers" (pun intended) could definitely dobetter at this than they currently do.

In Feb. 2002, Marko Karppinen validated 506 URLs of all the W3C member companies at that time.Only 18 of these pages passed validation. Compared to Parnas' validation study of the DMoz URLs justtwo months before, the W3C member company validation rate of 3.56% was considerably better thanthe 0.7% rate for URLs "in the wild", but it is nothing for the paragons of web standards to brag about.Such a low validation pass rate could easily be perturbed by any number of transient conditions orother factors.

Saarsoo also did a study of W3C member company validation rates in Jun. 2006. By that point, thevalidation situation had improved nicely for the member companies to 17.00%. Fast-forwarding nowto Jan. 2008 [W3C member company list snapshot] and we see that the general Web-at-large hascaught up to, and even exceeded the previous validation pass rate of W3C member companies fromKarppinen's study era. The general validation pass rate in the DMoz population is now running at~4.13%, and the W3C member company pass rate is a strong 20.15%, with more member companiesthan ever claiming the validation crown.

W3C Member List Study DateTotal InMember List

TotalValidated

PassedValidation

Percentage

Marko Karppinen Feb. 2002 506 506 18 3.56%

Saarsoo Jun. 2006 401 352 61 17.00%

MAMA Jan. 2008 429 412 83 20.15%Fig 6-1: W3C Member Company List Validation Studies

Just showcasing the increased validation rate does not tell the whole story. Saarsoo left an excellentdata trail to compare the present validation pass rate to. It is interesting to note that although theoverall pass rate has increased, many of the sites that passed validation previously no longerdo so at the time of writing. Achieving a passing validation status does not seem to be as hard asmaintaining that status over time. Compared to Saarsoo's study, there are just as many URLs thatpreviously validated but currently do not as there are URLs that maintained their passing validationstatus.

6

http://www.markokarppinen.com/20020222.html

http://www.w3.org/Consortium/Member/List

http://www.triin.net/2006/03/05/Validating_sites_of_W3C_members

http://people.opera.com/brian/mamavalidation/mama-w3cmemberlist.htm

Validation comparison Quantity

URLs that validated before and do now 25

URLs that validated before but do not now, and are still in W3C company list 25

URLs that validated before but are no longer in W3C company list 11Fig 6-2: Validation comparison to Saarsoo W3C Member Company study

Saarsoo commented in 2006 on the dynamic nature of the W3C company roster. From early 2002when there were 506 member companies, dipping down to 401 in mid-2006 to the present time (early2008) where we find the list back up to 429. To put the change in some perspective, the net loss ofcompanies in the list over this time-frame is 77, which is almost as many companies as the numberthat currently pass validation. Put simply, a pessimist might say that a company on this list is justabout as likely to drop out of the W3C as it is to achieve a successful validation.

The W3C Member List successful validation Honor Roll

In his 2002 study, Karppinen prominently listed the W3C member companies whose main URLspassed validation in order to,

"highlight the effort that goes into making an interoperable web site".

This is an excellent idea, and is becoming a bit of a time-honored tradition that both the Saarsoo studyand this one has followed. The first list from Karppinen was easy to keep inline with the rest of thestudy, because it was (unfortunately) short and sweet. As the pass rate has improved over time, this listbecomes progressively longer. This is the goal though; everyone wants the list to be too long to easilydisplay. [See the Honor Roll list here]

And the crown goes to...

Two companies' URLs have maintained valid sites throughout all three studies from 2002-2008.These companies deserve extra congratulations for this feat.

• Joint Info. Systems Comm. of the UK Higher Ed. Funding Council (JISC)• Opera Software (the company the author works for)

Many sites are constantly changing, but being a member of an organization that creates standardsshould be compulsion enough to attain a recognized level of excellence in those standards. Saarsooended his 2006 look at the W3C member list with an optimistic wish for the future,

"Maybe at 2008 we have 50% of valid W3C member sites."

Unfortunately, that number is nowhere close to the current reality. It may be too much for the W3C torequire its member companies' sites to pass validation, but they should definitely try to push for higherlevels than they currently attain, to serve as a good example if nothing else.

7. Interesting views of validation rates, Part 2: Alexa Global Top 500

About the Alexa Global Top 500

Lets look at another "interesting" small URL set, the Alexa service from Amazon. Alexa utilizes webcrawling and user-installed browser toolbars to track "important sites". It maintains, among manyother useful measures, a global "Top 500" list of URLs considered popular on the Web. The Alexa listwas chosen primarily because the size of the list was similar in size to the W3C list - so even thoughMAMA might be comparing apples to oranges, at least it compared a farily equal number of apples andoranges. The W3C company list skews toward academic and "big money" commercial computer sites.

7

http://people.opera.com/brian/mamavalidation/mama-w3cvalidationhonorroll.htm

http://www.ukoln.ac.uk/

http://www.opera.com/

http://www.alexa.com/

The Alexa list is representative of what people actually use/experience on the Web on a day-to-daybasis.

While few could argue that Alexa's "Top 500" list is relevant and popular, there are some definitebiases in its list:

• It is prejudiced toward big/popular sites with many country-specific variants, such as Google,Yahoo! and eBay. This ends up reducing the breadth of the list. Google is the most extremeexample of this, with 63 of the 487 URLs in the analyzed set being various regional Googlesites.

• It includes the top pages of domain aggregators with varied user content, such asLiveJournal, Facebook and fc2.com. These top pages are not representative of the widevariety of the user-created content they contain.

• The list consists entirely of top-level, entrance, or "surface" pages of a site. There is nointentional "deep" URL representation.

Validating the Alexa Top 500

On 28 Jan., 2008 the then-latest Alexa Top 500 list was inserted into MAMA [Jan. 2008 snapshot list,latest live version]. About half of these URLs were already in MAMA having been part of other sources.Of the 500 URLs in this list, 487 were successfully analyzed and validated. Only 32 of these URLspassed validation (6.57%). This is a slightly higher percentage rate than the much larger overall MAMApopulation, but the quantity and difference are still too small to declare any trends.

Alexa Top 500 List Study Date Passed Validation Total Set Size Percentage

MAMA Jan. 2008 32 487 6.57%Fig 7-1: Alexa Top 500 Validation Studies

For future Alexa studies

OK, so the Alexa Top 500 does have some drawbacks...should the URL set be tossed out entirely? Canthis set be improved? Aside from the Top 500, Alexa has a very deep catalog and categorization ofURLs, some of them available freely, but most available only for a fee. Some categories of URLsinclude division by country and by language. Alexa currently has publicly available lists of the top 100URLs each for 21 different languages (2,100 URLs) and 117 countries (11,700 URLs). Note: the per-country list represents popularity among users in a country, not sites hosted in the country. Anundoubtedly interesting expanded list of the Alexa Global Top 500 could be created by aggregating allof these sources, which would probably yield 5,000-10,000 URLs (if duplicates were eliminated).

If the validation rates of the Alexa Global Top 500 are studied in the future, the current version of theTop 500 list of URLs will likely be quite different than it is at this time of writing. The topicality of thelist - a strength that promotes the relevance of the analysis - also makes cross-comparisons over timedifficult. Documenting the list that was used in each analysis will be helpful in doing that.

8. Validation badge/icons: An interesting diversion?

Before MAMA had validated even a single URL, the author discovered this page at the W3C's site:http://www.w3.org/QA/Tools/Icons. This page lists icons that,

"may be used on documents that successfully passed validation for a specific technology,using the W3C validation services"

It seemed like an interesting idea to compare the pages that were using these images claimingvalidation versus how they actually validate. This can only be a crude measure for a number of reasons,but by far the main one is: an author can easily host the validation icon/badge on their own server andname it anything they want.

8

http://people.opera.com/brian/mamavalidation/mama-alexaglobaltop500list.htm

http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none

http://www.w3.org/QA/Tools/Icons

For those gearheads in the audience that have some "regexp savvy", the following Perl regularexpression was used to identify validation icon/badges utilizing the W3C naming scheme. This patternmatch was used against the SRC attribute of the IMG elements of URLs analyzed:

Regexp: /valid-((css|html|mathml|svg|xhtml|xml).*?)(-blue)?(\.png|\.gif|-v\.svg|-v\.eps)?$/i || /(wcag1.*?)(\.png|\.gif|-v\.svg|-v\.eps)?$/i)

This seems to fully capture all the variations of the W3C's established naming conventions (anycorrections are very welcome if it doesn't). Note that the regexp errs on the cautious side and can alsocapture unintended matches like JPEG files matching the naming scheme. One might think this anerror, but it turns out it is not. JPEG versions of the validation icons are not (currently) listed on theW3C's website, but a random spot-check of JPEG images thus detected by MAMA ARE validationbadge icons! In this case, what appears to be false-positives are actually valid after all.

ex: http://www.w3.org/Icons/valid-html401-blue.png is stored as 'html401-blue'

Validation rates of URLs having validation badge/icons

Let's look now at the list of W3C Validation Image Badges found in MAMA by URL [also by domain].Even with the various pitfalls that could occur with MAMA's pattern matching, there is still acomparison that is interesting to look at: how many pages that use a badge actually validate? If weconsider that the only type of badge of real interest in our sample is an HTML variant (html, xhtml),looking for the substrings "html" and "xhtml" within this field in MAMA gives us:

Type Of BadgeIdentified

ActuallyValidated

Total

xhtml 5,480 11,657

html 10,995 22,033Fig 8-1: Validation rates of URLs

with Validation icons

This is just under 50% in each case - which is frankly a rather miserable hit ratio. If these URLs do notvalidate, do they bear ANY resemblance to the badge they are claiming?

Comparison of stated validation badge/icon type versus actual detected Doctype

Next, let's try comparing the actual Doctypes detected compared to the badges claiming compliance tothose respective Doctypes. Doctypes detected in both the validator and MAMA analyses are listed forcomparison. The situation definitely improves here over the previous figures. Note: fatal validationerrors cause the validator to under-report Doctypes by reporting no Doctype at all in such cases.

Type Of BadgeIdentified

ValidatorDetectedDoctype

MAMADetectedDoctype

Total accordingto badge/icon

xhtml 10,553 11,054 11,657

html 20,570 21,475 22,033Fig 8-2: Reported Validation icon type versus MAMA-detected

Doctype

The validation badges certainly increase public awareness of validation as something for authors tostrive for, but it does not appear to be the best measure of reality. For the half of badged URLs thatclaim validation compliance but currently do not validate, one has to wonder whether they ever didvalidate in the past. Pages definitely tend to change over time and removing or updating an icon badgemay not be high on most author's list of "Things To Do". The next time you see such an icon, considerits current state with a grain of salt.

9

http://people.opera.com/brian/mamavalidation/validatorimgslist-url.htm

http://people.opera.com/brian/mamavalidation/validatorimgslist-dom.htm

For future W3C badge studies

After this survey was completed, the following rather prominent quote was noticed on the W3C'sValidation Icons page,

"The image should be used as a link to re-validate the document."

It may be useful to incorporate this fact to further identify validation badges in the future.

9. Doctypes

What are we looking at?

First up is the Doctype. The Doctype statement tells the validator which DTD to use when validating -it is the basic evaluation metric for the document. MAMA used its own methods to divine the Doctypefor every document, but the validator actually detects the Doctype in two slightly different ways: oneby the validator itself, and the other by the SGML parser at the core of the validator.

Source ofDoctype

Information being used

MAMA Detected Doctype statement

Validator SOAP <m:doctype> content

Validator 'W09'/'W09x' warning messagesFig 9-1: Detected Doctype factors used in this

study

This is a good time to dissect a Doctype and see what makes it tick. Let's look at a typical Doctypestatement, and then examine all of its parts:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Component Description

"<!DOCTYPE" The beginning of the Doctype

"html" This string specifies the name of the root element for the markup type

"PUBLIC"

This indicates the availability of the DTD resource. It can be a publiclyaccessible object ("PUBLIC") or a system resource ("SYSTEM") such as alocal file or URL. HTML/XHTML DTDs are specified by "PUBLIC"identifiers.

"-//W3C//DTDXHTML 1.0Transitional//EN"

This is the Formal Public Identifier (FPI). This compact, quoted stringgives a lot of information about the DTD, such as its Registration,Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the"XHTML 1.0 Transitional" part). If the processing entity does not alreadyhave local access to this DTD, it can get it from the System Identifier (nextportion).

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

The System Identifier (SI); the URL location of the DTD specified in theFPI

">" The ending of the DoctypeFig 9-2: Components of a DTD

10



MAMA's analysis stores the entire DOCTYPE statement, but the validator's SOAP response onlyreturns a portion of it - generally the FPI, but some situations may return the SI instead or evennothing at all if an error condition is detected. These situations are infrequent though; only 70 URLsanalyzed by the validator returned the Doctype's SI, for example.

!Doctypes!

The validator examined 3,509,170 URLs overall. Of those, the validator says that 1,474,974 (42.03%)"definitely" did not use a DOCTYPE (indicated by an empty content for the "<m:doctype>" elementin the SOAP response). In addition to the empty "<m:doctype>" element in the SOAP response, thevalidator also returns explicit warnings for the instances it does not encounter a Doctype statement:specifically, warning codes 'W09' and 'W09x' are generated by the SGML parser layer of the validator.Is there any correlation between these warning codes and the "official" empty Doctype mentioned inthe SOAP response? The quick answer is yes. 1,373,352 URLs have either the 'W09' or 'W09x'warnings. Looking closer for a direct correlation, 1,371,899 URLs were issued a 'W09'/'W09x' warningAND do not have a Doctype listed in the SOAP response. This leaves 1,453 URLs that had some sort ofvalidator-detectable Doctype, but a warning for No Doctype was issued. Sampling several URLs fromthe above set showed that in every case the Doctype statement was not at the very beginning of thedocument. So, it appears that the OpenSP parser does not like this, but the validator itself is OK withthis scenario.

MAMA also looked at Doctypes in its main analysis. Let's compare cases where both tools found noDoctype. MAMA found 1,720,886 URLs without a Doctype. This is a rather large discrepancycompared to the validator's numbers above. We must alter this figure further because the SOAPresponse for a validation failure error returns empty "<m:doctype>" and "<m:charset>" elements.To improve the quality of our comparison, we must exclude all URLs with a positive failure count.After this minor adjustment, the numbers are much more in line with each other. To the numbers:

Situation Qty

MAMA detected no Doctype 1,465,367

Validator detected no Doctype 1,474,974

MAMA and Validator both detected no Doctype 1,423,478

MAMA detected no Doctype but Validator did 41,889

Validator detected no Doctype but MAMA did 51,496Fig 9-3: Scenarios where Doctype is not present

The final two numbers are the most interesting. These discrepancies are still quite large (~3% of theoverall 'no Doctype detected' count). What could account for this? Some reasons noticed for thedifferences (there could be others):

• MAMA did not look for a Doctype in the destination document of a META refresh/redirect.The validator appears to do this.

ex: http://disneyworld.disney.go.com/wdw/parks/parkLanding?id=TLLandingPage

• MAMA does not handle gzipped content but it was occasionally served to it anyway. Thevalidator appears to handle this.

ex: http://nds.gamezone.com/gamesell/p29690.htm

• MAMA looked anywhere in the document for a Doctype, but the Validator only looks near thebeginning of the document. A rather large set of URLs unfortunately fit this description.

ex: http://www.ruready.com/

• URL content can change over time, including the addition or deletion of Doctypes. MAMA'sanalysis occurred in November 2007, and the validation of those same URLs happened in

11

http://disneyworld.disney.go.com/wdw/parks/parkLanding?id=TLLandingPage

http://nds.gamezone.com/gamesell/p29690.htm

http://www.ruready.com/

January 2008 - over 2 months later. In sampling random parts of the URL set where MAMAdid not initially detect a Doctype, a current, live analysis by MAMA does indeed detect aDoctype in most cases tried. Other than a bug existing in MAMA (unfortunately, alwayspossible in any software), this is the best explanation to put forth.

Doctype statement present details

What about URLs that had validator-detectable Doctypes? We'll linger on the comparison betweenMAMA's Doctype detection and the Validator's before looking in depth at what those Doctypes were.

Situation Qty

MAMA detected a Doctype 1,788,294

Validator detected a Doctype 1,625,509

MAMA and Validator both detected a Doctype, and were the same 1,583,620

MAMA and Validator both detected a Doctype, and they were different 36,119Fig 9-4: Scenarios where Doctype is present

Where MAMA and the Validator both found a Doctype, they disagree 2.28% of the time. Other thanthe aforementioned time delay between the MAMA and Validator analyses, could there be otherreasons to account for this difference? Scanning a list of results for MAMA/Validator Doctypes thatdiffered, there may indeed be a trend - a positive one at that. Of the 36,119 URLs that changedDoctype, 23,390 of them (64.76%) changed from an HTML Doctype to an XHTML Doctype. There area few reasons mentioned above that could be affecting these results, and the above numbers could bea coincidence, but this looks like a data point supporting the gradual shift from HTML to XHTML.

To summarize the per-URL and per-domain frequency tables for validator Doctype, Transitional FPIflavors have a lock on the top three most popular positions. The other variants trail far behind. If adocument has a Doctype, it is likely to be a Transitional flavor of XHTML 1.0 or (even more likely)HTML 4.0x. XHTML 1.0 Strict dominates over any other Strict variant (98% of all Strict types).

Totals for common substrings found in the Validator Doctype field

A survey of the FPIs the validator exposed is like a microcosm of the evolution of HTML - there aredocuments claiming to adhere to "ancient" versions from the early days all the way through to thelanguage's present XHTML incarnations. Searching for a few, well-chosen substrings demonstratesthis variety well, and we can see how well an author's choice of Doctype FPI results in actually passingvalidation. Out of the 1,625,509 URLs exposing a Doctype to the validator, Strict Doctypes passvalidation twice as often as the other flavors, and XHTML Doctypes are much are heavily favored forpassing validation than other Doctypes. More could be said about the final two items in the table below(to say the least), but that is left for a future discussion.

12

http://people.opera.com/brian/mamavalidation/validatordoctypes-url.htm

http://people.opera.com/brian/mamavalidation/validatordoctypes-dom.htm

Doctype Flavor QtyPercentageOf Total

PassingValidation

Percentage OfFlavor

"Transitional" 1,341,024 82.50% 112,348 8.38%

"Strict" 100,002 6.15% 17,502 17.50%

"Frameset" 57,225 3.52% 4,133 7.22%

Doctype Markup Language QtyPercentageOf Total

PassingValidation

Percentage OfMarkup Language

" html 4" (HTML 4 variants) 987,701 60.76% 66,535 6.74%

" xhtml 1.0" 544,622 33.50% 71,537 13.14%

" html 3.2" 44,642 2.75% 1,753 3.93%

" xhtml 1.1" 19,984 1.23% 4,074 20.39%

" html 2" 4,792 0.29% 176 3.67%

" html 3.0" 884 0.05% 44 4.98%

"WAP" 789 0.05% 468 59.32%

" xhtml 2" 11 0.00% 0 0.00%Fig 9-5: Detection of substrings in the Doctype field

The studies from Parnas and Saarsoo did not use the W3C validator, and as a consequence there wasnot such an extreme focus on Doctype usage. Generally, the validator they used only tracked whether aDoctype was used at all. The main reported error type in Parnas' study was a missing Doctype, withonly 18.8% of URLs having one present. By the time of Saarsoo's study, the number of URLs having aDoctype moved up to 39.08%. Fast-forward to now and that number has grown considerably yet again- to 57.7% according to the W3C validator. This is a very respectable increase over time. If few authorsare actually creating valid documents, at least most of them seem to understand that there IS astandard they should be adhering to.

Doctypes for our small special interest URL sets

Backtracking just a little, the next two tables are a quick look at the Doctypes used for the W3Cmember company URLs and the Alexa Top 500 list. Almost 76% of those URLs passing validation areXHTML variants in the W3C company set, and in the Alexa list it is almost 66%.

13

Doctype FPIPassedValidation

TotalPercentageOf FPI Type

-//W3C//DTD XHTML 1.0 Transitional//EN 36 145 24.83%

-//W3C//DTD XHTML 1.0 Strict//EN 23 45 51.11%

-//W3C//DTD HTML 4.01 Transitional//EN 16 95 16.84%

-//W3C//DTD XHTML 1.1//EN 4 8 50.00%


-//W3C//DTD HTML 4.01//EN 1 7 14.29%

-//W3C//DTD HTML 3.2//EN 0 1 0.00%

-//W3C//DTD HTML 4.01 Frameset//EN 0 1 0.00%

-//W3C//DTD HTML 3.2 Final//EN 0 1 0.00%

-//W3C//DTD XHTML 1.0 Strict//FI 0 1 0.00%

-//W3C//DTD XHTML 1.0 Frameset//EN 0 1 0.00%

[None] 0 85 0.00%Fig 9-6: Doctype FPIs of W3C Member Company websites and validation rates

Doctype FPIPassedValidation

TotalPercentageOf FPIType

-//W3C//DTD XHTML 1.0 Strict//EN 10 37 27.03%

-//W3C//DTD XHTML 1.0 Transitional//EN 9 130 6.92%



-//W3C//DTD HTML 4.01//EN 2 12 16.67%

-//W3C//DTD XHTML 1.1//EN 2 5 40.00%

-//iDNES//DTD HTML 4//EN 1 1 100.00%

-//W3C//DTD HTML 4.01 Frameset//EN 0 1 0.00%

-//W3C//DTD XHTML 1.1//EN http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd

0 1 0.00%

-//W3C//DTD XHTML 1.0 Strict //EN 0 1 0.00%

-//W3C//DTD XHTML 1.0 Transitional//ES 0 1 0.00%

-//W3C//DTD HTML 4.0 Strict//EN 0 1 0.00%

[None] 0 193 0.00%Fig 9-7: Doctype FPIs of Alexa Top 500 websites and validation rates

10. Character sets

In the previous section on Doctypes, there are many ways to look at just a single variable (presence orlack of a Doctype). Now, with character sets it becomes even more complex. Even a simplistic view ofcharacter set determination can involve at least three aspects of a document. MAMA, the validator andthe validator's SGML parser ALL have something to say about the choice of a document's characterset. To cover every permutation and difference between the many possible charset specification vectors

14

would definitely exhaust the author and most likely bore the reader. Every effort will be made topresent some of this data in a way from that is not TOO overwhelming.

There are three main areas of interest when determining the character set to use when validating adocument:

• The charset parameter of the "Content-Type" field in a document's HTTP Header• The charset parameter of the CONTENT attribute for a META "Content-Type" declaration• The encoding attribute of the XML prologue

For brevity, these will be shortened to "HTTP", "META" and "XML" respectively.

Character set Differences between MAMA and the Validator

An important difference exists between MAMA and the Validator when talking about character sets.There is an HTTP header that allows a request to specify which character sets it prefers. MAMA sentthis "Accept-Charset" header with a value of "windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1".This header field value is used by Opera (9.10), and MAMA tried to emulate this browser as closely aspossible. The character sets that were specified reflect the author's own particular language bias. Thevalidator is another story. It does not send an "Accept-Charset" header field at all. This may causedifferences between the two and affect the reported character set results.

MAMA's view of character sets

First up is a look at what MAMA was able to determine about these three fields, and how they are usedin combination with each other. The totals here account for all cases where a non-empty value waspresent for any of the HTTP/META/XML charset specification types. The following tables show thefrequencies for the different ways that character sets are established and mixed. A document can havenone, any or all of these factors. Note: The XML level in Fig 9-1 appears to be very low in comparisonto the other specification methods, but this is because the number of documents with an XMLdeclaration is also rather low. Looked at in this way, that ratio actually the highest, being even morefavorable than the META case at 96,264 of 104,722 URLs (91.92%). Fig 9-2 offers a breakdown of allthe combinations of ways to specify a character set. By a large majority, authors do this using only theMETA element method. The final table, Fig 9-3, shows what happens when more than one source for acharacter set existed in a document, and whether these multiple values agreed with one another.

CharsetSource

Number ofOccurrences

Total WhereAny CharsetSpecified

PercentageWhere AnyCharsetSpecified

HTTP 686,749 2,626,206 26.15%

META 2,361,221 2,626,206 89.91%

XML 96,264 2,626,206 3.67%Fig 10-1: MAMA - How character sets are specified

15

CharsetSpecified In

QuantityTotal WhereAny CharsetSpecified

PercentageWhere AnyCharsetSpecified

HTTP only 240,349 2,626,206 9.15%

META only 1,872,497 2,626,206 71.30%

XML only 17,858 2,626,206 0.68%

HTTP and META 417,109 2,626,206 15.88%

HTTP and XML 6,791 2,626,206 0.26%

META and XML 49,115 2,626,206 1.87%

All three sources 22,500 2,626,206 0.86%Fig 10-2: MAMA - How character sets are specified in

combination

SpecifiedCharsetSources

Disagree Total Percentage

HTTP and META 123,245 417,109 29.55%

HTTP and XML 2,238 6,791 32.96%

META and XML 4,086 49,115 8.32%

All three sources 4,399 22,500 19.55%Fig 10-3: MAMA - How character sets disagree when

specified in combination

The Validator's view of character sets

Now, a look at the way the markup validator views charset information. The validator generally looksfor the same 3 document sources mentioned previously to determine charset information. Beforelooking at these actual charset values, it is useful to examine whether the validator's view of charsetinformation is internally consistent or not. It can also be instructive to compare, where possible, thevalidator's view of charset information versus MAMA's view.

To directly compare validator and MAMA charset information, we must remove some URLs fromconsideration. The validator's SOAP response returns an empty charset value in all cases where thereis a validator failure. It is useful to know if the validator is returning a "truly" empty charset value, soall URLs with a failure error are removed from the examination set for this phase. This immediatelyreduces our URL group by 408,687 URLs.

The items of interest to look at in the validator response are the contents of the <m:charset>element, and warnings issued for no detected charset or charset value mismatch from differingsources. Let's take a look at how all these factors mesh (or not) when the validator is determiningwhich charset to use.

Validator detected charsets versus MAMA detected charsets

The following table is mostly for sanity checking to see if the validator's results resemble MAMA'sresults. The first two entries have very low totals, but may involve some corner charset detection casesworth taking a second look at. The third case is a definite indication that the validator has defaultfallback values used for character set when none is detected through the typical methods.

16

ValidatorCharsetDetected

Scenario Total

No No MAMA charsets detected 47

No MAMA charset detected 1,179

Yes No MAMA charsets detected 592,361

Yes Validator also issued: "Warning! Conflicting charsets..." message 118,367

Yes Validator also issued: "Warning! No charset found..." message 480,942Fig 10-4: Validator versus MAMA charset detection

Validator Warning 04 issued: No character encoding found

This table might be a little confusing with some of the double negatives being tossed around. Thepresence of a Warning 04 means that the SGML parser portion of the validator did not detect acharacter set. This result may differ from what the validator ends up deciding should be used for thecharset. [RE-CHECK THESE NUMBERS: row 2 is the sum of rows 1, 3 and 4. Row 6 seems to be thesum of rows 5, 7 and 8. And row 5 only seems like it could indicate that the validator uses a defaultcharset value]

Warning 04 Charset State Total

No No validator charset detected 1,226

No Validator charset detected 2,618,315

No No MAMA charset detected 137,286

No MAMA charset detected 2,482,255

Yes No validator charset detected 0

Yes Validator charset detected 480,942

Yes No MAMA charset detected 455,122

Yes MAMA charset detected 25,820Fig 10-5: Validator Warning 04 scenarios

Validator Warnings 18-20 issued: Character encoding mismatches

In these cases, the validator discovers more than one encoding source, and there is some disagreementbetween them. The validator does not say what the disagreement was, so for some idea, we can look atthe data MAMA discovered about these sources. Note that the final row in each table is the expectedscenario for the warning to be generated; naturally those totals are the highest by a wide margin. URLsfrom the other rows may merit further testing, but there is one reason mentioned before that canexplain at least some of these quantities: the two month delta between MAMA's analysis and theValidator's analysis of the URL set.

17

MAMADetectedHTTP

MAMADetectedXML

Additional Factor Total

Yes No -- 483

No Yes -- 70

Yes Yes Both agree 80

Yes Yes Both different 2,517Fig 10-6: Warning 18: Character encoding

mismatch(HTTP Header encoding/XML encoding)

MAMADetectedHTTP

MAMADetectedMETA


Yes No -- 6,712

No Yes -- 4,485

Yes Yes Both agree 4,153

Yes Yes Both different 97,028Fig 10-7: Warning 19: Character encoding mismatch

(HTTP Header encoding/META encoding)

MAMADetectedXML

MAMADetectedMETA


Yes No -- 79

No Yes -- 50

Yes Yes Both agree 88

Yes Yes Both different 992Fig 10-8: Warning 20: Character encoding

mismatch(XML encoding/META element encoding)

Validator-detected charset values

We've saved the best of our character set discussion for last - what values are actually used by thevalidator for character set? (We'll be looking at similar frequency tables for each of the MAMA-detected charset sources (HTTP header, META, XML) in another section of this study.) The full per-URL and per-Domain frequency tables for Validator charset show very little movement between thetwo - you have to go down to #17 before there is a difference! Below is an abbreviated per-URLfrequency table for validator character set values (out of 243 unique values found for this field).

18

http://people.opera.com/brian/mamavalidation/validatorcharsets-url.htm

http://people.opera.com/brian/mamavalidation/validatorcharsets-url.htm

http://people.opera.com/brian/mamavalidation/validatorcharsets-dom.htm

PopularityValidatorCharset Value

Frequency

1 iso-8859-1 1,510,827

2 utf-8 943,326

3 windows-1252 293,595

4 shift_jis 87,593

5 iso-8859-2 60,663

6 windows-1251 51,336

7 windows-1250 30,353

8 gb2312 19,412

9 iso-8859-15 12,276

10 big5 11,395

11 windows-1254 9,756

12 iso-8859-9 9,091

13 us-ascii 8,134

14 euc-jp 7,174

15 x-sjis 5,564Fig 10-9: Validator character set short

frequency table

11. Validator failures

When the validator runs into a condition that does not allow it to validate a document, a failure noticeis issued. The validator defines 9 different conditions as fatal errors, but MAMA only encountered fourof them amongst all the URLs it has processed through the validator. It is certainly possible thatMAMA's selection mechanism may have contributed to prevention of these errors from occurring.408,920 URLs out of the 3,509,170 URLs validated (11.65%) officially failed validation for variousreasons.

Failure TypeDetectedInMAMA

Explanation

Transcode Error NoOccurs when attempting to transcode the character encoding ofthe document

Byte Error Yes Bytes found that are not valid in the specified Character Encoding

URI Error No The URL Scheme/protocol is not supported by the validator

No Content Error No No content found to validate

IP Error No IP address is not public

HTTP Error Yes Received unexpected HTTP response

MIME Error Yes Unsupported MIME type

Parse External IDError

YesReference made to a system-specific file insteadof using a well-known public identifier

Referer Error No Referer check requested but 'Referer' HTTP header not sentFig 11-1: Validator failure modes

19

Frequencies of failure types in MAMA

By far, the "Fatal Byte Error" occurs the most of any failure error - 300,008 times (8.55%) out of allURLs validated. This error type occurs when characters in the document are not valid in the detectedcharacter encoding. This is an indication to the validator that it can not trust the information it hasabout the document, so it chooses to quit trying rather than attempt to validate incorrectly.

An additional failure mode relating to MAMA's processing of the validator's activities should bementioned. If MAMA did not receive a response back from the validator, or some other (possibly)temporary factor caused an interruption between MAMA and the validator, an "err" message code wasgenerated. MAMA encountered this type of error 34,950 times out of the 3,509,170 URLs (1.00%) thatwere passed to the validator. Note that MAMA has not yet tried to re-validate any of these URLs. Thereare various pluses and minuses to dismissing the "err" state, or any other validator failure mode fromthe overall grand total of URLs validated. These failed URLs remain in the final count, but if youdisagree, there is enough numerical data to be able to arrive at your own tweaked numbers andpercentages.

Failure TypeNumber ofOccurrences

Fatal Byte Error 300,008

Fatal HTTP Error 63,908

err 34,950

Fatal Parse Extid Error 8,360

Fatal MIME Error 1,709Fig 11-2: Validator failures in MAMA's

URLs

Number of failures

A field was created in the MAMA database to store the number of failures encountered in a document.The expectation was that the validator could only experience one failure mode at a time, so this fieldwould hold either a '0' or '1'. Imagine the surprise when 248 URL cases registered as having two failuretypes at the same time! It turns out that in every one of these cases, it was the "Fatal Byte Error" and"Fatal MIME Error" occurring at the same time.[Note: 98 of the 248 URLs returning these double-failure modes are definitely text files (ending in".txt") and should be removed from consideration]

Number ofFailures


0 3,100,484

1 408,439

2 248Fig 11-3: Number of

failures per URL

"Fatal Byte Error" and "Fatal MIME Error" occurring at the same time: http://asdconferences.com/

12. Validator warnings

The validator issues a Warning if it detects missing or conflicting information important for thevalidation process. In such cases, the validator must make a "best guess"; If the validator has chosen

20

http://asdconferences.com/

wrong, it can negate the entire validation results. The validator suggests that all Warning issues beaddressed so that the validator can produce results that have the highest confidence.

The validator can produce 27 different types of Warnings, but MAMA only encountered 14 of them inits journeys through DMoz and friends. A specific Warning type will only be issued once for a URL if itis encountered, but multiple Warning types can be issued for the same URL.

Frequencies of Warning types

The most common Warning type in MAMA's URL set was W06/"Unable to determine parse mode",with W09/"No DOCTYPE found" coming a close second. These two each dwarf all other Warning typescombined by a factor of two. For full explanations of the Warning codes, see the Validator CVS.

Warning Code Explanation Frequency

W06 Unable to determine parse mode (XML/SGML) 1,585,029

W09 No DOCTYPE found 1,372,864

W04 No character encoding found 480,942

W19 Character encoding mismatch (HTTP header/META element) 113,927

W11 Namespace found in non-XML document 65,807

W23 Conflict between MIME type and document type 19,097

W21 Byte-order mark found in UTF-8 File 17,148

W22 Character Encoding suggestion: use XXX instead of YYY 8,237

W24 Rare or unregistered character encoding detected 7,149

W18 Character encoding mismatch (HTTP header/XML encoding) 3,220

W20 Character encoding mismatch (XML encoding/META element) 1,220

W09x No DOCTYPE found. Checking XML syntax only 488

W07 Contradictory parse modes detected (XML/SGML) 72

W01 Missing 'charset' attribute (HTTP header for XML) 21Fig 12-1: Validator Warning type frequency table

Warnings in combination

MAMA never encountered more than five different Warning types at a time for any given URL. Themost common scenario found was for a URL to have two types of Warnings at a time. There is adefinite correlation between the two most frequent Warning types and that big "bump" in the Warningcount list below. Of the 1,025,319 cases where only two different Warning types were encountered,951,957 (92.84%) were the W06 and W09 type together.

21

http://dev.w3.org/cvsweb/validator/share/templates/en_US/warnings.tmpl?rev=1.40&content-type=text/x-cvsweb-markup

Number OfWarnings

Frequency

0 1,702,424

1 363,103

2 1,025,319

3 411,850

4 6,439

5 35Fig 12-2: Number ofWarnings per URL

5 Warning types in combination: http://www.hazenasvinov.cz

...And, er...those other types of warnings too

The truth is, the validator seems to define a warning somewhat loosely, hence the capitalized use of"Warning" in the previous section to make the validator's two interpretations distinct. Firstly, itdefines a "Warning" according to the warning codes and meanings in the above section, where MAMAencountered no more than 5 Warning types at a time. The validator additionally has a warnings sectionin its SOAP output, and a warning summary count. When the validator uses this latter interpretationof warning, it seems to have a more liberal meaning. It lumps other error types in with the strictWarnings measure as classified before. By this accounting, a number of URLs in DMoz have more than10,000 of these warnings each.

The URL that contained the most "warnings" of this expanded type is a blog at: http://club-aguada.blogspot.com/. In MAMA's initial analysis, it reported 19602 warnings! When collectingtogether this research soon after, this URL was re-checked on 16 Feb., 2008 through the validator andit still had 14,838 warnings - and an additional 14,949 errors. This URL only has about 10-20paragraphs of text content and an additional 1,400 or so non-visible search engine spam hyperlinks.Such a big change in results seems somewhat suspect in a short amount of time, but content in blogstend to change rather rapidly which could account for the difference.

What IS of concern is how a page that is less than 250KB in size generates over 26MB from thevalidator's SOAP output mode. The SOAP version is much more terse than the HTML output, so thevalidation results size could have been even bigger. A validation result like this is just far tooexcessive. Perhaps the validator should offer a way (at least as an option) to truncate the warningsand/or errors after a certain amount to control this problem.

13. Validator errors

Any problem or issue that the validator can recognize that is not a failure or a warning is just acommon "error". Errors have the most variety - 446 are currently defined in the error_messages.cfgfile in the validator's code. The validator only encountered 134 of them through MAMA's URL set. Thevalidation studies done by Parnas and Saarsoo kept track of far fewer error types - perhaps to decreasethe studies' complexity. MAMA kept track of them all in the hopes that it might be useful to thosedeveloping or using the validator. First we will take a look at the various error types and errorfrequencies. To wrap things up, we will showcase URLs demonstrating some of the extreme errorscenarios discovered (the URLs exhibited the error behavior at the time of writing but can change overtime).

22

http://www.hazenasvinov.cz/

http://club-aguada.blogspot.com/


http://dev.w3.org/cvsweb/validator/share/templates/en_US/error_messages.cfg

http://dev.w3.org/cvsweb/validator/share/templates/en_US/error_messages.cfg

Error type frequency

For each error type found in a URL, MAMA stored only the error code and the number of times thaterror type occurred. Shown below is a short "Top 10" list of the most frequent error types. Thefrequency ratios for the top errors generally agree with Saarsoo's research, with a few minordifferences. The error that happens most often in the analyzed URL set is #108 (2,253,893 times),followed closely by: #127 (2,013,162 times). Coming in third is an interesting document structuralerror, #344: "No document type declaration; implying X". This error appears to be a duplicate ofWarning W09/W09x, "No DOCTYPE found" (see previous section) - notice that the occurrencenumbers for the two types are almost identical. [HOW CAN THESE BE DIFFERENT?]

PopularityErrorCode

Error Description Frequency

1 108 There is no attribute X 2,253,893

2 127 Required attribute X not specified 2,013,162

3 344 No document type declaration; implying X 1,371,836

4 79 End tag for element X which is not open 1,232,169

5 64 Document type does not allow element X here 1,229,145

6 76 Element X undefined 1,114,796

7 325Reference to entity X for which no system identifier could begenerated

859,846

8 25 General entity X not defined and no default entity 859,636

9 338 Cannot generate system identifier for general entity X 859,636

10 247 NET-enabling start-tag requires SHORTTAG YES 798,046Fig 13-1: Validator error type frequency table

The full Validator error type frequency frequency table for MAMA's study is in a separate document.For brevity, only the error codes are listed there. The complete list of Validator error codes and theirexplanations can be found on the W3C's site. Note: A few error message codes are not described in theaforementioned W3C document, and need a little extra exposition:

• "xmlw": XML well-formedness error• "no-x": No-xmlns error (No XML Namespace declared)• "wron": Wrong-xmlns (Incorrect XML Namespace declared)

Quantity of error types

There were 3,000,493 URLs where at least one validation error occurred. But among these URLs,there was a great variety in the types of erorrs encountered. The vast majority of URLs encounteringerrors found 10 or less errors. [Create a graph like Parnas and Saarsoo did of this data. Note the diffsbut suggest this info can be baseline for future studies]

23

http://people.opera.com/brian/mamavalidation/mamavalidator-errorcounthistogram.htm

http://validator.w3.org/docs/errors.html

http://validator.w3.org/docs/errors.html

Total NumberError Types


Total NumberError Types


1 194,518 20 7,857

2 249,997 21 5,619

3 301,900 22 3,317

4 315,367 23 2,132

5 336,832 24 1,309

6 312,103 25 840

7 252,934 26 422

8 208,563 27 227

9 172,004 28 114

10 145,127 29 63

11 117,612 30 24

12 96,969 31 11

13 76,967 32 13

14 61,692 33 6

15 47,681 34 6

16 34,180 35 5

17 24,991 38 2

18 17,363 39 1

19 11,725Fig 13-2: Validator error type variety per URL

Error Extremes

DMoz has many URLs, and some are bound to have unbelievable numbers of errors. Believe itthough. The following three tables showcase the most extreme offenders in generating validator errormessages.

The URLs in these lists are fairly diverse. Some of the documents are long, yet some are also fairly brief(considering the error quantity). Some use CSS or scripting, while others don't. IIS and Apache areusually both well-represented. The only noticeable tendency is found in the last table (Fig 12-5) for thewidest variety of error types; five of the eight worst offenders in this category use Microsoft IIS 6.0/ASP.NET servers (note the same URL pattern in 4 of them). There is no noticeable correlation otherthan this. One plausible explanation for the inflated error numbers could be that IIS servers browsersniff for the User-Agent header string and deliver lower-quality content based on the validator's UAvalue "W3C_Validator/1.575".

24

URLErrorType

ErrorQty

http://www.bloomington.in.us/~pen/mwcraft.html 139 39,401

http://www.music-house.co.uk/ 76 28,961

http://www.zughaid.com/TMP.htm 325 22,193

http://www.filosofico.net/virgilioeneide.htm 65 15,409

http://www.gencat.cat/diue/llengua/eines_per_a_lempresa/lexics/alimenta.htm

64 14,316

http://www.cwc.lsu.edu/cwc/projects/dbases/chase.htm 82 12,211

http://www.dienanh.net/forums/ xmlw 12,103Fig 13-3: URLs with the most errors of a specific type

URL Total Errors

http://www.bloomington.in.us/~pen/mwcraft.html 39,409

http://club-aguada.blogspot.com/ 37,370

http://www.first-jp.com/ 34,530

http://www.prezesm.kylos.pl/ 33,083

http://defensor-sporting.blogspot.com/ 31,617

http://www.mlnh.zmva.ru 29,184

http://www.music-house.co.uk/ 28,963Fig 13-4: URLs with the most total errors in combination

URLError TypeQty

http://alumni.wsu.edu/site/c.llKYL9MQIsG/b.1860301/k.BCA0/Home.htm 39

http://www.vincipro.com/cart/home.php 38

http://www.c-sharpcorner.com/UploadFile/prasad_1/RegExpPSD12062005021717AM/RegExpPSD.aspx

38

http://www.sleepfoundation.org/site/c.huIXKjM0IxF/b.2417141/k.C60C/Welcome.htm

35

http://www.buckeyeranch.org/site/c.glKSLeMXIsG/b.1043121/k.BCC0/Home.htm 35

http://www.ucmerced.edu/ 35

http://www.girlscouts.ak.org/site/c.hsJSK0PDJpH/b.1806483/k.BE48/Home.htm 35

http://kaltenkirchen.dlrg.de 35Fig 13-5: URLs with the widest variety of error types

14. Summing up...

Parnas' study presented an interesting statistic:

"In October 2001, the W3C validator validated approximately 80,000 documents per day"

Olivier Théreaux, who currently works on development of the W3C validator, provided an updatedusage statistic in February 2008 of ~700-800,000 URLs per day. This is a ten-fold increase. The

25

http://www.bloomington.in.us/~pen/mwcraft.html

http://www.music-house.co.uk/

http://www.zughaid.com/TMP.htm

http://www.filosofico.net/virgilioeneide.htm



http://www.cwc.lsu.edu/cwc/projects/dbases/chase.htm

http://www.dienanh.net/forums/

http://www.bloomington.in.us/~pen/mwcraft.html


http://www.first-jp.com/

http://www.prezesm.kylos.pl/

http://defensor-sporting.blogspot.com/

http://www.mlnh.zmva.ru/

http://www.music-house.co.uk/

http://alumni.wsu.edu/site/c.llKYL9MQIsG/b.1860301/k.BCA0/Home.htm

http://www.vincipro.com/cart/home.php





http://www.buckeyeranch.org/site/c.glKSLeMXIsG/b.1043121/k.BCC0/Home.htm

http://www.ucmerced.edu/

http://www.girlscouts.ak.org/site/c.hsJSK0PDJpH/b.1806483/k.BE48/Home.htm

http://kaltenkirchen.dlrg.de/

awareness regarding the process of validating documents definitely seems to be increasing over time,as this sharp increase in usage of the validator indicates. The perceived importance of havingdocuments pass validation though, needs to improve. Yes, the pass-rate in the general Web populationhas also increased by a respectable rate - 0.71% to 4.13% in "just" six years. It also has increasedsimilarly for the W3C member companies in that time. But in the case of the W3C members, theyappear to regress in their validation pass state about as often as they attain this goal. How can theWeb-at-large strive to do better when these key companies don't seem to be trying harder? As we haveseen with the (non-)success of the validation icon badge, it is one thing to say you support thestandards, and validation as a means to that end...but it doesn't necessarily reflect reality.

If we relax our concentration on simply passing validation, we notice that support for other parts ofthis process are improving nicely over time. At least one aspect of the validation process has madegreat strides and definitely contributes to a perceived importance for document correctness - Doctypeusage. Doctypes help concentrate author focus toward thinking about what standards their documentsare trying to adhere to. This can only help the validation cause over time. The web may be crossing animportant threshold in this regard. The number of URLs in this study carrying a Doctype of some kind,has just barely crossed the 50% boundary. In the U.S. political system this is called a "clear mandate" -so an avalanche of authors validating their documents must not be far behind...right? Joking aside,there is a clear and obvious connection between claiming to adhere to a standard and then actuallydoing so. Increased outreach by the standards community to help developers draw the dotted linebetween the two points in this line can only help matters here.

Appendix: Validation methodology

Markup validation was the last main phase of the research completed. MAMA only attempted to validate theURLs that were successfully analyzed in the other big analysis phase, so as to maximize the possibility for datacross-referencing.

The URL set

MAMA employed several strategies to refine and improve the analysis set of URLs. The full size of the DMozURL set was ~4.5 million as of Nov. 2007, which was distilled down to ~3.5 million URLs. Saarsoo's studychose to follow, as closely as possible, the URL selection strategy that Parnas used in his study, to ensuremaximum compatibility between the two. MAMA's URL selection methods do not directly match these otherstudies. Even with the set size reduction, this appears to be the largest URL sample of validation trends to date.

• URL sets analyzed:• DMoz (May 2007 initial snapshot, added diff Nov. 2007)• W3C member company home pages (429 listed URLs; 26 January 2008)• Alexa Global Top 500 (500 URLs; 28 January 2008)

• Basic filtering: Domain limiting of the randomized URL set to no more 30 URLs analyzed per domain• Other filtering: Excluded non-HTTP/HTTPS protocols• Skipped analysis of URLs that hit any failure conditions

Various parts of the examined URL sets have definite bias. Alexa's top URL lists, for example, are the result ofusage stats from voluntary installation of a Windows-only MSIE toolbar. The DMoz set has definite top-page-itis...it is skewed heavily toward the root/home pages of domains - as much as 80%!

The W3C Validator

MAMA was only able to employ two local copies of the W3C validator on separate machines. One of thesemachines was very "old" and weak by today's hardware standards, while the other one was more of a "typical"modern system. The weak machine was simply not up to the task and could only handle about 1/10th of the loadthat the more powerful machine easily handled. MAMA would feed a URL to the validator, parse the outputresult, then send it to the MAMA database for storage, and then move on to the next URL in the list to beanalyzed. Rinse and repeat until complete. The big bottleneck was the validator. If MAMA had more validatorsavailable to use, the processing time would be drastically cut from weeks to days.

• Validator machine 1: CPU: Intel 2.4GHz dual core P4; RAM: 1GB

26

http://people.opera.com/brian/urlset.htm

http://people.opera.com/brian/mamavalidation/mama-w3cmemberlist.htm

http://people.opera.com/brian/mamavalidation/mama-alexaglobaltop500list.htm

• Validator machine 2: CPU: AMD 800MHz; RAM: 768MB• Driver script: Perl (using LWP module for validator communication and DBI module for database

connectivity)• Number of driver scripts: Usually about 10 at a time• Duration of validation: 8-29 January, 2008 (~ 3 weeks), usually 24/7• Processing rate: ~150,000 URLs per day• How many URLs validated: 3509170 URLs from 3011661 domains.• URL list: randomized

The markup validator has a number of processing options, but a main goal for the validation process was tokeep the analysis simple and direct. Each candidate URL was passed to the validator using the followingoptions. The SOAP output was chosen for is brevity and ease of results parsing.

• Charset: Detect automatically• Doctype: Detect automatically• Output: SOAP

MAMA stored a compacted version of the results of each URL validation. In retrospect, it would also have beenuseful to store at least part of each error description (the unique arguments portion), but during this first timethrough there was no way to know just know how much storage all that data would need. So, MAMA opted tostore as little as possible. As it is, MAMA's abbreviated format stored over 25 million rows of data for theabbreviated error messages alone. A goal for "next time" is to store all the unique error arguments in addition towhat MAMA currently stores.

• Did it validate? (Pass/Fail)• Doctype FPI• Character set• Number of warnings• Number of errors• Number of failures• Date the URL was validated• An aggregated list of error types and the quantity of those errors for the URL

27

MAMA: W3C Validator research - triin.net · the WDG's web site, there are many similarities between...

Documents

Transcript of MAMA: W3C Validator research - triin.net · the WDG's web site, there are many similarities between...