Reading XML with Ali

Version 1.3

Introduction

Ali is designed to make reading XML easy from ANSI C programs. Generally it should only take one line of code to read one line of XML. In fact, keeping the lines of code to be as few as possible has been the major goal of Ali. This tutorial shows how you can easily write code to read various XML data examples.

Reading A Simple Document

Reading an XML document with Ali is really easy. It is very much like reading a simple text file using the stdio.h commands like fopen and fscanf. First, you need to have a name for the document being read. For our example let's use a document named hello.xml that contains this data:

<?xml version="1.0"?>

This is as simple of an XML document as possible. Now let's write some code to read it.

ali_doc_info * hello;
ali_open(&hello, "hello.xml", ALI_OPTION_INPUT_XML_DECLARATION, NULL);
if (hello != NULL)
    ali_close(hello);

The ali_doc_info is like a stdio.h FILE ref. It contains information about the document being read like which element is currently being read. We add a true after the file name in ali_open to indicate that a XML declaration is expected. This is the <?xml?> in hello.xml. We would set this to false when we were reading a fragment of XML that lacks such a declaration. The last argument is for passing pointers to data needed to understand or store the read document into. Our hello.xml document is so simple that we don't need other data, and so we pass NULL.

Now this XML document is really too simple to be interesting. Let's change it to have the greeting "Hello, world!". Hello.xml now looks like this:

<?xml version="1.0"?>
<greeting>Hello, world!</greeting>

To read the greeting element we need to use the ali_in call. ali_in is used to read all markup and content and it needs some info to do this. First, we need to specify which document to read the element from. Next is the element we are reading from. Then comes the content that is being read from the element. This is controlled by a format string. The "^e" indicates that an element is being read. Because of this ali_in expects to see a namespace argument followed by an element name argument. Namespaces are not supported yet so pass a 0 or NULL. The element is a C null terminated string. After the "^e" comes a "%as" in the format string. This scanf like format indicates to input the next argument greeting as a string from the content of the element. The 'a' modifier before the 's' is a GNU scanf extension indicating that appropriately sized storage should be allocated by ali_in and freed by the caller. This safely avoids buffer overruns. Alternatively an existing string along with a maximum width, not including the terminating nul char, can be specified. Either method is safe, but at least one must be be used!

The ali_close on the third line finishes all reading and closes hello.xml. No further reading is allowed. This is the resulting code to read hello.xml. It will read in the XML and copy "Hello, world!" to the greeting.

char * greeting;
root = ali_open(&hello, "hello.xml", ALI_OPTION_INPUT_XML_DECLARATION, NULL);
if (hello != NULL)
{
    ali_in(hello, root, "^e%as", 0, "greeting", &greeting);
    ali_close(hello);
}

Some XML documents have more complicated XML declarations. You may have seen some where they look like this:

<?xml version="1.0" encoding="UTF-8" ?>

Ali works with UTF-8, ISO-8859-1 and US-ASCII encodings.

That's all it takes to read basic XML data. Not much was read but the work involved is comparable to reading other file types like text. Read on to see how to read files with more data in them!

Reading Other Markup

The XML specification allows other types of markup to be read. Attributes can be added to elements to add to their meaning. Comments can be present that aid human readers of the XML. And there are others. But before we can read them we need to learn how to read markup to an element. Since XML documents can have only one root element, data is added to it as attributes and nested elements. We must read these other markups to that element.

Reading Attributes in Elements

An element's attributes are always found within the element's start tag. An a example is this:

<news id="Ali1.0release">Ali 1.0 is released!</news>

Here an id attribute has been added . This attribute was added so that other data can reference the news data. How can we read this element, it's attribute, and the element's content? First read in the "news" element to get a reference to it. Next read the "id" attribute in the "news" element. Finally, we want the "news" element's content which is it's description.

news = ali_in(doc, root, "^e", 0, "news");
ali_in(doc, news, "^a%as", 0, "id", &id);
ali_in(doc, news, "%as", &description);

Notice the news returned by ali_in when reading the "news" element is passed when the attribute is read. Also note that a "^a" is used instead of a "^e" to indicate that "id" is an attribute of news and not an element inside news. Obviously you can add as many attributes to news as you want by adding more ali_in calls. Finally see that the news content is read using "%as".

We can also shorten the C code to a single line if we want.

news = ali_in(doc, root, "^e^a%as^%as", 0, "news", 0, "id", &id, &description);

It is important that the attribute be read before the element's content (%as). If the "^a%as" is before or after the "^e%as" then the attribute is expected to come from the element passed to ali_in, which is root. Long format strings can be used to read lots of XML with one ali_in call, but this can make the code difficult to match to the XML. Generally reading one element or attribute per line of code helps to match the code and XML to each other and eases reading. Let's use a more complex news element and then use simple ali_in calls to read it.

<news id="Ali1.0release" priority="important">
    Ali 1.0 is released!
    <location>https://alo.sourceforge.net</location>
</news>

news = ali_in(doc, root, "^e", 0, "news");
ali_in(doc, news, "^a%as", 0, "id", &id);
ali_in(doc, news, "^a%as", 0, "priority", &priority);
ali_in(doc, news, "%as", &description);
ali_in(doc, news, "^e%as", 0, "location", &location);

This added another two lines of code to read another attribute and another element in the news element. We can continue to read more attributes or elements as desired. Markup not read is skipped. We can read sub elements as long as we remember and pass the element returned from ali_in. Now we can use our Ali knowledge to quickly write code to read many real life XML uses.

Reading Real Life XML

There are many XML data sources that we can access from C code with Ali. Let's try reading something more complicated than we have tried so far, but also practical. A good choice would be an XML based RSS feed. There is a RSS feed at the bottom right of the Slashdot main web page which links to http://slashdot.org/index.rss. Here's is some sample rss data:

<item rdf:about="http://slashdot.org/article.pl?sid=04/03/24/2327229">
  <title>In-Depth Look At LinuxBIOS</title>
  <link>http://slashdot.org/article.pl?sid=04/03/24/2327229</link>
  <description>DrSkwid writes "With PheonixBIOS reading your email because of such inordinate boot up times for Windows and other OSs, it was remarked in #plan9 about our 5s ...</description>
  <dc:creator>timothy</dc:creator>
  <dc:subject>linux</dc:subject>
  <dc:date>2004-03-25T00:29:00+00:00</dc:date>
  <slash:department>quickly-quickly</slash:department>
  <slash:section>developers</slash:section>
  <slash:comments>101</slash:comments>
  <slash:hitparade>101,88,65,50,26,20,16</slash:hitparade>
</item>

To read this RSS item, three main pieces are needed. First we need some data structures to store the data in, so let's define some:

typedef struct
{
   char * about;
   char * title;
   char * link;
   char * description;
   char * creator;
   struct tm date;
   char * section;
   int comment_count;
} rss_item_type;

rss_item_type rss_item;

Second, we need to read the XML for the RSS item. Since it is composed of many pieces, we put the code into it's own function. When elements are large or read multiple times, moving that code into it's own function helps to maintain clarity. Here is how to do it:

static void
parse_rss_item(ali_doc_info *doc, ali_element_ref itemN, void * data, bool new_element)
{
   rss_item_type * item = (rss_item_typee *) data;
   char * date_string;


   ali_in(doc, itemN, "^a%as", 0, "rdf:about", &item->about);
   ali_in(doc, itemN, "^e%as", 0, "title", &item->title);
   ali_in(doc, itemN, "^e%as", 0, "link", &item->link);
   ali_in(doc, itemN, "^e%as", 0, "description", &item->description);
   ali_in(doc, itemN, "^e%as", 0, "dc:creator", &item->creator);
   if (ali_in(doc, itemN, "^e%as", 0, "dc:date", &date_string))
   {
      memset(&item->date, 0, sizeof(item->date));
      sscanf(date_string, "%d-%d-%dT%d:%d",
         &item->date.tm_year, &item->date.tm_mon, &item->date.tm_mday,
         &item->date.tm_hour, &item->date.tm_min);
      free(date_string);
      date_string = NULL;
   }
   ali_in(doc, itemN, "^e%as", 0, "slash:section", &item->section);
   ali_in(doc, itemN, "^e%d", 0, "slash:comments", &item->comment_count);
}

Very little new appears in the function. The first two function parameters are what you expect but the third needs an explanation. The data parameter is passed by Ali from your app. The idea is to set it to a pointer or something so that data read can be stored somewhere useful to the app

Note that sscanf is used above to read the data. The format string could be passed to ali_in as well. But the date string could be in several different formats. It makes more sense to read the date and then interpret it based on it's format.

The third and last piece is some code to read the document and call the item callback function when an item element is found.

static void read_rss()
{
   rss_xml_data_type rss_item;
   ali_doc_info * doc;
   ali_element_ref root;


   root = ali_open(&doc, "rss_item.xml", ALI_OPTION_INPUT_XML_DECLARATION, &rss_item);
   {
      ali_in(doc, root, "^e%F", 0, "item", parse_rss_item);
      ali_close(doc);
   }


   assert(strcmp(rss_item.about, "http://slashdot.org/article.pl?sid=04/03/24/2327229") == 0);
   assert(strcmp(rss_item.title, "In-Depth Look At LinuxBIOS") == 0);

   free(rss_item.about);
   free(rss_item.title);
   free(rss_item.link);
   free(rss_item.description);
   free(rss_item.creator);
   free(rss_item.section);
}

Reading the RSS item element is simple, but real RSS documents are more complicated. They have channel information that needs to be stored, and RSS feeds have multiple items. Here's a RSS feed with all but two items removed, but it is still much more complicated than the last example. You can see the channel, items, and other data.


<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>
<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns=\"http://purl.org/rss/1.0/\"
  xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:slash=\"http://purl.org/rss/1.0/modules/slash/\
  xmlns:taxo=\"http://purl.org/rss/1.0/modules/taxonomy/\" xmlns:admin=\"http://webns.net/mvcb/\
  xmlns:syn=\"http://purl.org/rss/1.0/modules/syndication/\">
  <channel rdf:about=\"http://slashdot.org/\">
    <title>Slashdot</title>
    <link>http://slashdot.org/</link>
    <description>News for nerds, stuff that matters</description>
    <dc:language>en-us</dc:language>
    <dc:rights>Copyright 1997-2004, OSDN - Open Source Development Network, Inc. All Rights Reserved.</dc:rights>
    <dc:date>2004-04-21T05:43:48+00:00</dc:date>
    <dc:publisher>OSDN</dc:publisher>
    <dc:creator>pater@slashdot.org</dc:creator>
    <dc:subject>Technology</dc:subject>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1970-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2229212\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2348214\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2229256\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/225254\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2030212\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1954249\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1847206\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1827232\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/186237\" />
      <rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1738217\" />
      </rdf:Seq>
    </items>
    <image rdf:resource=\"http://images.slashdot.org/topics/topicslashdot.gif\" />
    <textinput rdf:resource=\"http://slashdot.org/search.pl\" />
  </channel>
  <image rdf:about=\"http://images.slashdot.org/topics/topicslashdot.gif\">
    <title>Slashdot</title>
    <url>http://images.slashdot.org/topics/topicslashdot.gif</url>
    <link>http://slashdot.org/</link>
  </image>
  <item rdf:about=\"http://slashdot.org/article.pl?sid=04/04/20/2229212\">
    <title>Why MySQL Grew So Fast</title>
    <link>http://slashdot.org/article.pl?sid=04/04/20/2229212</link>
    <description>jpkunst writes \"Andy Oram, who attended the MySQL Users Conference which was held April 16-18 in Orlando, Florida, attempts to explain MySQL's popularity in ...</description>
    <dc:creator>timothy</dc:creator>
    <dc:subject>storage</dc:subject>
    <dc:date>2004-04-21T03:57:00+00:00</dc:date>
    <slash:department>should-be-has-grown</slash:department>
    <slash:section>developers</slash:section>
    <slash:comments>172</slash:comments>
    <slash:hitparade>172,157,100,77,32,18,10</slash:hitparade>
  </item>
  <item rdf:about=\"http://slashdot.org/article.pl?sid=04/04/20/2348214\">
    <title>Linus Torvalds: Backporting Is A Good Thing</title>
    <link>http://slashdot.org/article.pl?sid=04/04/20/2348214</link>
    <description>darthcamaro writes \"Looks like we don't need to speculate on what Linus' opinion is on backporting. Internetnews.com is running a story this morning that ...</description>
    <dc:creator>timothy</dc:creator>
    <dc:subject>linux</dc:subject>
    <dc:date>2004-04-21T01:47:00+00:00</dc:date>
    <slash:department>martha-agrees</slash:department>
    <slash:section>developers</slash:section>
    <slash:comments>143</slash:comments>
    <slash:hitparade>143,120,82,59,43,32,20</slash:hitparade>
  </item>
</rdf:RDF>";

To read this feed a few more tasks need to be added to the last example. Data structures are needed to store the channel and multiple items. The function parse_rss_item() needs to keep track of more than a single item. The main reading function needs to read in the channel. Note that to read an element as complicated as the channel, I normally isolate the code into it's own function. But for this example I wanted to show both a callback function (with items) and not (with the channel). This is a lot of code. Because some people will want to copy and paste it, I'll keep the code all together.

typedef struct
{
   char * about;
   char * title;
   char * link;
   char * description;
   char * creator;
   struct tm date;
   char * section;
   int comment_count;
} rss_item_type;

typedef struct
{
   char * about;
   char * title;
   char * link;
   char * description;
   char * rights;
   struct tm date;
   char * publisher;
   char * creator;
   char * subject;
} channel_type;


typedef struct
{
   channel_type channel;
   int count;
   rss_item_type * item;
} rss_xml_data_type;


static void
parse_rss_item(ali_doc_info *doc, ali_element_ref itemN, void * data, bool new_element)
{
   rss_xml_data_type * rss = (rss_xml_data_type *) data;
   rss_item_type * item;
   char * date_string;


   if (ali_is_element_new(doc, itemN))
   {
      // Resize the items to add one more
      if (rss->count == 0)
      {
         rss->count++;
         rss->item = (rss_item_type *) malloc(sizeof(*(rss->item)));
      }
      else
      {
         rss->count++;
         rss->item = (rss_item_type *) realloc(rss->item, rss->count * sizeof(*(rss->item)));
      }
   }

   item = &rss->item[rss->count - 1];

   ali_in(doc, itemN, "^a%as", 0, "rdf:about", &item->about);
   ali_in(doc, itemN, "^e%as", 0, "title", &item->title);
   ali_in(doc, itemN, "^e%as", 0, "link", &item->link);
   ali_in(doc, itemN, "^e%as", 0, "description", &item->description);
   ali_in(doc, itemN, "^e%as", 0, "dc:creator", &item->creator);
   if (ali_in(doc, itemN, "^e%as", 0, "dc:date", &date_string))
   {
      memset(&item->date, 0, sizeof(item->date));
      sscanf(date_string, "%d-%d-%dT%d:%d",
         &item->date.tm_year, &item->date.tm_mon, &item->date.tm_mday,
         &item->date.tm_hour, &item->date.tm_min);
      free(date_string);
      date_string = NULL;
   }
   ali_in(doc, itemN, "^e%as", 0, "slash:section", &item->section);
   ali_in(doc, itemN, "^e%d", 0, "slash:comments", &item->comment_count);
}

static void read_rss_feed()
{
   rss_xml_data_type rss;
   ali_doc_info * doc;
   ali_element_ref root;
   ali_element_ref rdf;
   int channel;
   char * date_string;


   rss.count = 0;
   rss.item = NULL;


   root = ali_open(&doc, "rss_feed.xml", ALI_OPTION_INPUT_XML_DECLARATION, &rss);
   if (doc != NULL)
   {
      rdf = ali_in(doc, root, "^e", 0, "rdf:RDF");

      /* read the channel */
      if ((channel = ali_in(doc, rdf, "^e", 0, "channel")) != 0)
      {
         ali_in(doc, channel, "^a%as", 0, "rdf:about", &rss.channel.about);
         ali_in(doc, channel, "^e%as", 0, "title", &rss.channel.title);
         ali_in(doc, channel, "^e%as", 0, "link", &rss.channel.link);
         ali_in(doc, channel, "^e%as", 0, "description", &rss.channel.description);
         ali_in(doc, channel, "^e%as", 0, "dc:rights", &rss.channel.rights);
         if (ali_in(doc, channel, "^e%as", 0, "dc:date", &date_string))
         {
            memset(&rss.channel.date, 0, sizeof(rss.channel.date));
            sscanf(date_string, "%d-%d-%dT%d:%d",
               &rss.channel.date.tm_year, &rss.channel.date.tm_mon, &rss.channel.date.tm_mday,
               &rss.channel.date.tm_hour, &rss.channel.date.tm_min);
            free(date_string);
            date_string = NULL;
         }
         ali_in(doc, channel, "^e%as", 0, "dc:publisher", &rss.channel.publisher);
         ali_in(doc, channel, "^e%as", 0, "dc:creator", &rss.channel.creator);
         ali_in(doc, channel, "^e%as", 0, "dc:subject", &rss.channel.subject);
      }

      /* read all items until there are no more. */
      while (ali_in(doc, rdf, "^oe%F", 0, "item", parse_rss_item))
      {
         ;
      }

      ali_close(doc);
   }

   assert(strcmp(rss.channel.about, "http://slashdot.org/") == 0);
   assert(rss.count == 2);
   assert(strcmp(rss.item[0].title, "Why MySQL Grew So Fast") == 0);
   assert(strcmp(rss.item[1].title, "Linus Torvalds: Backporting Is A Good Thing") == 0);

   free(rss.item[0].about);
   free(rss.item[0].title);
   free(rss.item[0].link);
   free(rss.item[0].description);
   free(rss.item[0].creator);
   free(rss.item[0].section);

   free(rss.item[1].about);
   free(rss.item[1].title);
   free(rss.item[1].link);
   free(rss.item[1].description);
   free(rss.item[1].creator);
   free(rss.item[1].section);

   free(rss.item);

   free(rss.channel.about);
   free(rss.channel.title);
   free(rss.channel.link);
   free(rss.channel.description);
   free(rss.channel.rights);
   free(rss.channel.publisher);
   free(rss.channel.creator);
   free(rss.channel.subject);
}

Whew! That is quite a bit to read! There is the RSS XML feed, the data structures to store the information, and the Ali calls to map the XML to the data structures. Reading the RSS feed takes all of the knowledge that we've learned to read the different data types. The example shows how to read elements either inline or with a callback function. You've shown that you can read in documents with arbitrary data and you can deal with large and complicated XML documents by decomposing the complicated elements into callback functions. Nothing can stop you now!