3.2 XML/XHTML

3.2 XML/XHTML

What is XML?

HTML handles the needs of most web authors, but there are some kinds of information that it is not well suited for presenting (e.g., mathematical notations, chemical formulae, musical scores). The eXtensible Markup Language (XML) was developed to address this shortcoming. XML is a language used to define other languages, a meta-language. Just as HTML has its own syntax rules, you can create your own markup language with its own set of rules.

For example, here is an example of a markup language used to store information about CDs in a CD catalog. Unlike HTML, in which the root element is called <html>, this language has a root element called <CATALOG>. A <CATALOG> element is composed of one or more <CD> elements. Each <CD> element in turn has a <TITLE>, <ARTIST>, <COUNTRY>, <COMPANY>, <PRICE>, and <YEAR>.

So, like HTML, XML documents use tags to define data elements. The difference is that you create the tag set. It is possible (though not required) to specify the rules of your XML language in a document type definition (DTD) file or an XML schema definition (XSD) file. For example, you might decide that a <CD> element can have one and only one <ARTIST> element within it.

While XML is quite similar in syntax to HTML (i.e., its use of tags), there are some syntax differences that make XML a bit more strict. Among these differences are:

  • XML elements must have an end tag. Recall that the <img> element in HTML doesn't require an end tag.
  • XML tags are case sensitive. <CATALOG> is not the same as <catalog> in XML, whereas <EM> is the same as <em> in HTML.
  • XML elements must be nested properly. In HTML, an author can get away with code like <strong><em>Total</strong></em>, whereas in XML the </em> tag must come before the </strong> tag.

XML's Uses

XML has become an important format for the exchange of data across the Internet. And, as the CD catalog example discussed above implies, it can also be used as a means for storing data. We'll use XML later in the course for both of these purposes. Right now, we're going to focus on XML's use in creating a new flavor of HTML called XHTML.

An early development in the growth of web publishing was that desktop web browsers were designed to correct poorly written HTML. For example, in a bit of content containing paragraphs of text, it is possible to omit the </p> tag for paragraph elements and all of the popular desktop web browsers will render the content the same as if the </p> tags were present. This is not a surprising development if you think about it for a moment. Which browser would you prefer to use: one that displays a readable page the vast majority of the time or one that displays an error message when it encounters poorly written HTML?

Flash forward to the 2000s, the dawn of portable handheld devices like tablets and smartphones and wireless Internet. Slower data transfers and less computing power in these devices combined to produce lesser performance when rendering HTML content. This was one of the factors leading to the development of XHTML.

What is XHTML?

XHTML (eXtensible Hypertext Markup Language) is simply a version of HTML that follows the stricter syntax rules of XML. An XML/XHTML document that meets all of the syntax rules is said to be well-formed. Well-formed documents can be interpreted more quickly than documents containing syntax errors that must be corrected.

The introduction of this new XHTML language came after years of the development of plain HTML content. Thus, web authors interested in creating well-formed XHTML had to adjust their practices a bit, both because of the sloppy habits that forgiving browsers had enabled and the outright differences with HTML. These adjustments include:

  • Elements that don't require an end tag in HTML must have one in XHTML (e.g., <p> needs a matching </p>).
  • Empty elements in HTML must be terminated in XHTML (e.g., <br> must be changed to <br></br> or its shortcut <br />).
  • Elements must be properly nested.
    <strong><em>Some text</strong></em> INCORRECT
    <strong><em>Some text</em></strong> CORRECT
  • Attribute values must be quoted.
    <table rows=3> INCORRECT
    <table rows="3"> CORRECT
  • XHTML is case sensitive. <a> and <A> are different tags. The standard is to use lowercase.

Flavors of XHTML

HTML was originally developed such that the actual informational content of a document was mixed with presentational settings. For example, the <i> tag was used to tell the browser to display bits of text in italics and the <b> tag was used to produce bold text. An important aspect in the development of web publishing has been its push for the separation of content from presentation. This resulted in the creation of an <em> tag to define emphasized text and a <strong> tag to define strongly emphasized text. It turns out that the default behavior of browsers is to display text tagged with <em> in italics and text tagged with <strong> in bold, which may leave you wondering what purpose these new tags serve if they only replicate the behavior of <i> and <b>.

The answer lies in the use of Cascading Style Sheets (CSS). With CSS, web authors can override the browsers' default settings to display elements differently. Authors can also develop multiple style sheets for the same content (e.g., one for desktop browsers, one intended for printing, etc.). The usage of style sheets also makes it much easier to make sweeping changes to the look of a series of related pages. We'll talk more about CSS in the next section.

So, while it is possible to override the behavior of the <i> and <b> tags just as easily as any other tags, the <em> and <strong> tags were added and recommended over <i> and <b> to encourage web authors to move away from the practice of mixing content with presentation.

XHTML has two main dialects that differ from one another in terms of whether they allow the use of some presentational elements or not. As its name implies, the XHTML Strict dialect does not allow the usage of elements like <font> and <center>. It also does not allow the usage of element attributes like align and bgcolor. Presentation details like font size/type and alignment must be handled using CSS. The Strict dialect also requires that all text and images be embedded within either a <p> element or a <div> element (used to define divisions or sections of a document).

The XHTML Transitional dialect does not prohibit the use of presentational elements and attributes like those described in the previous paragraph. Generally speaking, XHTML Transitional was intended for developers who want to convert their old pages to a newer version, but would probably not bother to do it if they had to eliminate every presentational setting. XHTML Strict was intended for developers creating new pages.

XHTML developers specify which set of rules their page follows by adding a DOCTYPE line to the top of the page. For example, here are the DOCTYPE statements for XHTML Strict and XHTML Transitional, respectively:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

OR

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Browse this article, Transitional vs. Strict Markup, for more details on the differences between the two dialects.

The basic skeleton of an XHTML Strict document looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <meta http-equiv="content-type"
                content="text/html; charset=utf-8"/>
            <title>Your title here</title>
        </head>
        <body>
            Your content here
        </body>
    </html>

Page validation and conversion

XHTML and its dialects were developed by a standards organization called the World Wide Web Consortium (W3C). They also provide tools for web authors to validate that their pages follow the syntax rules of their selected DOCTYPE and to convert their pages from sloppy HTML to clean XHTML. This conversion tool is called HTML Tidy and can be run on the desktop or online (see links below).

HTML Tidy

HTML Tidy (online version)

Recent developments

Though XHTML was originally intended to be "the next step in the evolution of the Internet," it never gained a strong foothold in the web development community. A major factor that discouraged its adoption was that in order to reap the full benefit of an XML-based document, it needed to be served to browsers as "application/xhtml+xml" rather than "text/html." Most major browsers such as Firefox, Chrome, and Safari were built to handle the "application/xhtml+xml" content type. The notable exception was Internet Explorer, which until version 9 did not support "application/xhtml+xml" content — users were asked if they wanted to save the file when documents of that type were encountered.

In addition to the content type issue, technological advances have undercut the argument that small handheld devices cannot load web pages at an acceptable speed without the use of an XML-based parser. Today's smartphone browsers utilize the same HTML parsers as their desktop counterparts.

Thus, in recent years, the notion of a world in which browsers parse all web pages as XML and page developers must author well-formed documents appears less and less likely. The W3C halted their development of a new version of XHTML and shifted their focus towards a new version of HTML (HTML5). This led some to declare that "XHTML is dead." HTML5 requires browsers to continue correcting poorly written HTML. This has the effect of allowing sloppy page authors to continue in their sloppy habits. That said, browsers will continue to accept pages authored with XHTML-style coding, so developers who see value in serving their pages as XML may continue to do so. To learn more about HTML5, see the HTML 5 Introduction at the w3schools site.

So what language should we use?

HTML5 is the current standard for web publishing and you should certainly be looking to employ it in the web pages you develop. I have the content on XHTML in this lesson for a few reasons:

  1. to give you an appreciation for the evolution of web publishing standards,
  2. to encourage you to write clean HTML code. Doing so will enable you to achieve a higher level of consistency in the rendering of your pages than if you allowed yourself to pick up bad habits, and
  3. you may encounter XHTML in other people's source code.

At the end of this lesson, you'll be asked to write a web page from scratch using what you learned from the lesson. I'm going to require you to write that page in XHTML Strict. However, for subsequent projects, you will be able to use the less rigid HTML5.