What Is XML

by Dinesh 2012-07-22 22:20:54

What Is XML?

XML stands for Extensible Markup Language, and it is used todescribe documents and data in a standardized, text-based format that can be easily transported via standard Internet protocols. XML, like HTML, is based on the granddaddy of all markup languages, Standard Generalized Markup Language (SGML).

SGML is remarkable not just because it’s the inspiration and basis for all modern markup languages, but also because of the fact that SGML was created in 1974 as part of an IBM document-sharing project, and officially became an International Organization for Standardization (ISO) standard in 1986, long before the Internet or anything like it was operational. The ISO standard documentation for SGML (ISO 8879:1986) can be purchased online at http://www.iso.org. The first popular adaptation of SGML was HTML, which was developed as part of a project to provide a common language for sharing technical documents. The advent of the Internet facilitated the document exchange method, but not the display of the document. The markup language that was developed to standardize the display format of the documents was called Hypertext Markup Language, or HTML, which provides a standardized way of describing document layout and display, and is an integral part of every Web browser and Website.

Although SGML was a good format for document sharing, and HTML was a good language for describing the layout of the documents in a standardized way, there was no standardized way to describe and share data that was stored in the document. For example, an HTML page might have a body that contains a listing of today’s closing prices of a share of every company in the Fortune 500. This data can be displayed using HTML in a myriad of ways. Prices can be bold if they have moved up or down by 10 percent, and prices that are up from yesterday’s closing price can be displayed in green, with prices that are down displayed in red. The information can be formatted in a table, and alternating rows of the table can be in different colors.

However, once the data is taken from its original source and rendered as HTML in a browser, the values of the data only have value as part of the markup language on that page. They are no longer individual pieces of data, but are now simply pieces of “content” wedged between elements and attributes that specify how to display that content. For example, if a Web developer wanted to extract the top ten price movers from the daily closing prices displayed on the Web page, there was no standardized way to locate the top ten values and isolate them from the others, and relate the prices to the associated Fortune 500 Company.

Note that I say that there was no standardized way to do this; this did not stop developers from trying. Many a Web developer in the mid- to late-1990s, including myself, devised very elaborate and clever ways of scraping the data they needed from between HTML tags, mostly by eyeballing the page and the HTML source code, then coding routines in various languages to read, parse, and locate the required values in the page. For example, a developer may read the HTML source code of the stock price page and discover that the prices were located in the only table on the HTML page. With this knowledge, code could be developed in the developer’s choice of language to locate the table in the page, extract the values nested in the table, calculate the top price movers for the day based on values in the third column in the table, and relate the company name in the first column of the table with the top ten values.

However, it’s fair to say that this approach represented a maintenance nightmare for developers. For example, if the original Web page developers suddenly decided to add a table before the stock price table on the page, or add an additional column to the table, or nest one table in another, it was back to the drawing board for the developer who was scraping the data from the HTML page, starting over to find the values in the page, extract the values into meaningful data, and so on. Most developers who struggled with this inefficient method of data exchange on the Web were looking for better ways to share data while still using the Web as a data delivery mechanism.

But this is only one example of many to explain the need for a tag-based markup language that could describe data more effectively than HTML. With the explosion of the Web, the need for a universal format that could function as a lowest common denominator for data exchange while still using the very popular and standardized HTTP delivery methods of the Internet was growing.

In 1998 the World Wide Web Consortium (W3C) met this need by combining the basic features that separate data from format in SGML with extension of the HTML tag formats that were adapted for the Web and came up with the first Extensible Markup Language (XML) Recommendation. The three pillars of XML are Extensibility, Structure, and Validity.

Tagged in:


You must LOGIN to add comments