Reading XML with Ali
Version 1.3
Introduction
Ali is designed to make reading XML easy from ANSI C programs. Generally it should only take one line of code to read one line of XML. In fact, keeping the lines of code to be as few as possible has been the major goal of Ali. This tutorial shows how you can easily write code to read various XML data examples.
Reading A Simple Document
Reading an XML document with Ali is really easy. It is very
much like reading a simple text file using the
stdio.h
commands like fopen
and
fscanf
. First, you need to have a name for the
document being read. For our example let's use a document named
hello.xml that contains this data:
<?xml version="1.0"?>
This is as simple of an XML document as possible. Now let's write some code to read it.
ali_doc_info * hello;
ali_open(&hello, "hello.xml", ALI_OPTION_INPUT_XML_DECLARATION, NULL);
if (hello != NULL)
ali_close(hello);
The ali_doc_info
is like a stdio.h
FILE
ref. It contains information about the document
being read like which element is currently being read. We add a
true after the file name in ali_open
to indicate
that a XML declaration is expected. This is the
<?xml?>
in hello.xml. We would set this to
false when we were reading a fragment of XML that lacks such a
declaration. The last argument is for passing pointers to data
needed to understand or store the read document into. Our
hello.xml document is so simple that we don't need other data,
and so we pass NULL.
Now this XML document is really too simple to be interesting. Let's change it to have the greeting "Hello, world!". Hello.xml now looks like this:
<?xml version="1.0"?>
<greeting>Hello, world!</greeting>
To read the greeting element we need to use the
ali_in
call. ali_in
is used to read all
markup and content and it needs some info to do this. First, we need
to specify which document to read the element from. Next is the
element we are reading from. Then comes the content that is being
read from the element. This is controlled by a format string. The
"^e" indicates that an element is being read. Because of this
ali_in
expects to see a namespace argument followed by
an element name argument. Namespaces are not supported yet so pass a
0 or NULL. The element is a C null terminated string. After the "^e"
comes a "%as" in the format string. This scanf
like
format indicates to input the next argument greeting
as
a string from the content of the element. The 'a' modifier before
the 's' is a GNU scanf
extension indicating that
appropriately sized storage should be allocated by
ali_in
and freed by the caller. This safely avoids
buffer overruns. Alternatively an existing string along with a
maximum width, not including the terminating nul char, can be
specified. Either method is safe, but at least one must be be
used!
The ali_close
on the third line finishes all reading
and closes hello.xml. No further reading is allowed. This is
the resulting code to read hello.xml. It will read in the XML
and copy "Hello, world!" to the greeting.
char * greeting;
root = ali_open(&hello, "hello.xml", ALI_OPTION_INPUT_XML_DECLARATION, NULL);
if (hello != NULL)
{
ali_in(hello, root, "^e%as", 0, "greeting", &greeting);
ali_close(hello);
}
Some XML documents have more complicated XML declarations. You may have seen some where they look like this:
<?xml version="1.0" encoding="UTF-8" ?>
Ali works with UTF-8, ISO-8859-1 and US-ASCII encodings.
That's all it takes to read basic XML data. Not much was read but the work involved is comparable to reading other file types like text. Read on to see how to read files with more data in them!
Reading Other Markup
The XML specification allows other types of markup to be read. Attributes can be added to elements to add to their meaning. Comments can be present that aid human readers of the XML. And there are others. But before we can read them we need to learn how to read markup to an element. Since XML documents can have only one root element, data is added to it as attributes and nested elements. We must read these other markups to that element.
Reading Attributes in Elements
An element's attributes are always found within the element's start tag. An a example is this:
<news id="Ali1.0release">Ali 1.0 is released!</news>
Here an id attribute has been added . This attribute was added so that other data can reference the news data. How can we read this element, it's attribute, and the element's content? First read in the "news" element to get a reference to it. Next read the "id" attribute in the "news" element. Finally, we want the "news" element's content which is it's description.
news = ali_in(doc, root, "^e", 0, "news");
ali_in(doc, news, "^a%as", 0, "id", &id);
ali_in(doc, news, "%as", &description);
Notice the news returned by ali_in
when reading
the "news" element is passed when the attribute is read. Also
note that a "^a" is used instead of a "^e" to indicate that
"id" is an attribute of news and not an element inside news.
Obviously you can add as many attributes to news as you want by
adding more ali_in
calls. Finally see that the
news content is read using "%as".
We can also shorten the C code to a single line if we want.
news = ali_in(doc, root, "^e^a%as^%as", 0, "news", 0, "id", &id, &description);
It is important that the attribute be read before the
element's content (%as). If the "^a%as" is before or after the
"^e%as" then the attribute is expected to come from the element
passed to ali_in
, which is root. Long format
strings can be used to read lots of XML with one
ali_in
call, but this can make the code difficult
to match to the XML. Generally reading one element or attribute
per line of code helps to match the code and XML to each other
and eases reading. Let's use a more complex news element and
then use simple ali_in
calls to read it.
<news id="Ali1.0release" priority="important">
Ali 1.0 is released!
<location>https://alo.sourceforge.net</location>
</news>
news = ali_in(doc, root, "^e", 0, "news");
ali_in(doc, news, "^a%as", 0, "id", &id);
ali_in(doc, news, "^a%as", 0, "priority", &priority);
ali_in(doc, news, "%as", &description);
ali_in(doc, news, "^e%as", 0, "location", &location);
This added another two lines of code to read another
attribute and another element in the news element. We can
continue to read more attributes or elements as desired. Markup
not read is skipped. We can read sub elements as long as we
remember and pass the element returned from
ali_in
. Now we can use our Ali knowledge to
quickly write code to read many real life XML uses.
Reading Real Life XML
There are many XML data sources that we can access from C code with Ali. Let's try reading something more complicated than we have tried so far, but also practical. A good choice would be an XML based RSS feed. There is a RSS feed at the bottom right of the Slashdot main web page which links to http://slashdot.org/index.rss. Here's is some sample rss data:
<item rdf:about="http://slashdot.org/article.pl?sid=04/03/24/2327229">
<title>In-Depth Look At LinuxBIOS</title>
<link>http://slashdot.org/article.pl?sid=04/03/24/2327229</link>
<description>DrSkwid writes "With PheonixBIOS reading your email because of such inordinate boot up times for Windows and other OSs, it was remarked in #plan9 about our 5s ...</description>
<dc:creator>timothy</dc:creator>
<dc:subject>linux</dc:subject>
<dc:date>2004-03-25T00:29:00+00:00</dc:date>
<slash:department>quickly-quickly</slash:department>
<slash:section>developers</slash:section>
<slash:comments>101</slash:comments>
<slash:hitparade>101,88,65,50,26,20,16</slash:hitparade>
</item>
To read this RSS item, three main pieces are needed. First we need some data structures to store the data in, so let's define some:
typedef struct
{
char * about;
char * title;
char * link;
char * description;
char * creator;
struct tm date;
char * section;
int comment_count;
} rss_item_type;
rss_item_type rss_item;
Second, we need to read the XML for the RSS item. Since it is composed of many pieces, we put the code into it's own function. When elements are large or read multiple times, moving that code into it's own function helps to maintain clarity. Here is how to do it:
static void
parse_rss_item(ali_doc_info *doc, ali_element_ref itemN, void * data, bool new_element)
{
rss_item_type * item = (rss_item_typee *) data;
char * date_string;
ali_in(doc, itemN, "^a%as", 0, "rdf:about", &item->about);
ali_in(doc, itemN, "^e%as", 0, "title", &item->title);
ali_in(doc, itemN, "^e%as", 0, "link", &item->link);
ali_in(doc, itemN, "^e%as", 0, "description", &item->description);
ali_in(doc, itemN, "^e%as", 0, "dc:creator", &item->creator);
if (ali_in(doc, itemN, "^e%as", 0, "dc:date", &date_string))
{
memset(&item->date, 0, sizeof(item->date));
sscanf(date_string, "%d-%d-%dT%d:%d",
&item->date.tm_year, &item->date.tm_mon, &item->date.tm_mday,
&item->date.tm_hour, &item->date.tm_min);
free(date_string);
date_string = NULL;
}
ali_in(doc, itemN, "^e%as", 0, "slash:section", &item->section);
ali_in(doc, itemN, "^e%d", 0, "slash:comments", &item->comment_count);
}
Very little new appears in the function. The first two function parameters are what you expect but the third needs an explanation. The data parameter is passed by Ali from your app. The idea is to set it to a pointer or something so that data read can be stored somewhere useful to the app
Note that sscanf is used above to read the data. The format string could be
passed to ali_in
as well. But the date string could be in several
different formats. It makes more sense to read the date and then interpret it
based on it's format.
The third and last piece is some code to read the document and call the item callback function when an item element is found.
static void read_rss()
{
rss_xml_data_type rss_item;
ali_doc_info * doc;
ali_element_ref root;
root = ali_open(&doc, "rss_item.xml", ALI_OPTION_INPUT_XML_DECLARATION, &rss_item);
{
ali_in(doc, root, "^e%F", 0, "item", parse_rss_item);
ali_close(doc);
}
assert(strcmp(rss_item.about, "http://slashdot.org/article.pl?sid=04/03/24/2327229") == 0);
assert(strcmp(rss_item.title, "In-Depth Look At LinuxBIOS") == 0);
free(rss_item.about);
free(rss_item.title);
free(rss_item.link);
free(rss_item.description);
free(rss_item.creator);
free(rss_item.section);
}
Reading the RSS item element is simple, but real RSS documents are more complicated. They have channel information that needs to be stored, and RSS feeds have multiple items. Here's a RSS feed with all but two items removed, but it is still much more complicated than the last example. You can see the channel, items, and other data.
<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>
<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns=\"http://purl.org/rss/1.0/\"
xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:slash=\"http://purl.org/rss/1.0/modules/slash/\
xmlns:taxo=\"http://purl.org/rss/1.0/modules/taxonomy/\" xmlns:admin=\"http://webns.net/mvcb/\
xmlns:syn=\"http://purl.org/rss/1.0/modules/syndication/\">
<channel rdf:about=\"http://slashdot.org/\">
<title>Slashdot</title>
<link>http://slashdot.org/</link>
<description>News for nerds, stuff that matters</description>
<dc:language>en-us</dc:language>
<dc:rights>Copyright 1997-2004, OSDN - Open Source Development Network, Inc. All Rights Reserved.</dc:rights>
<dc:date>2004-04-21T05:43:48+00:00</dc:date>
<dc:publisher>OSDN</dc:publisher>
<dc:creator>pater@slashdot.org</dc:creator>
<dc:subject>Technology</dc:subject>
<syn:updatePeriod>hourly</syn:updatePeriod>
<syn:updateFrequency>1</syn:updateFrequency>
<syn:updateBase>1970-01-01T00:00+00:00</syn:updateBase>
<items>
<rdf:Seq>
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2229212\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2348214\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2229256\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/225254\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/2030212\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1954249\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1847206\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1827232\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/186237\" />
<rdf:li rdf:resource=\"http://slashdot.org/article.pl?sid=04/04/20/1738217\" />
</rdf:Seq>
</items>
<image rdf:resource=\"http://images.slashdot.org/topics/topicslashdot.gif\" />
<textinput rdf:resource=\"http://slashdot.org/search.pl\" />
</channel>
<image rdf:about=\"http://images.slashdot.org/topics/topicslashdot.gif\">
<title>Slashdot</title>
<url>http://images.slashdot.org/topics/topicslashdot.gif</url>
<link>http://slashdot.org/</link>
</image>
<item rdf:about=\"http://slashdot.org/article.pl?sid=04/04/20/2229212\">
<title>Why MySQL Grew So Fast</title>
<link>http://slashdot.org/article.pl?sid=04/04/20/2229212</link>
<description>jpkunst writes \"Andy Oram, who attended the MySQL Users Conference which was held April 16-18 in Orlando, Florida, attempts to explain MySQL's popularity in ...</description>
<dc:creator>timothy</dc:creator>
<dc:subject>storage</dc:subject>
<dc:date>2004-04-21T03:57:00+00:00</dc:date>
<slash:department>should-be-has-grown</slash:department>
<slash:section>developers</slash:section>
<slash:comments>172</slash:comments>
<slash:hitparade>172,157,100,77,32,18,10</slash:hitparade>
</item>
<item rdf:about=\"http://slashdot.org/article.pl?sid=04/04/20/2348214\">
<title>Linus Torvalds: Backporting Is A Good Thing</title>
<link>http://slashdot.org/article.pl?sid=04/04/20/2348214</link>
<description>darthcamaro writes \"Looks like we don't need to speculate on what Linus' opinion is on backporting. Internetnews.com is running a story this morning that ...</description>
<dc:creator>timothy</dc:creator>
<dc:subject>linux</dc:subject>
<dc:date>2004-04-21T01:47:00+00:00</dc:date>
<slash:department>martha-agrees</slash:department>
<slash:section>developers</slash:section>
<slash:comments>143</slash:comments>
<slash:hitparade>143,120,82,59,43,32,20</slash:hitparade>
</item>
</rdf:RDF>";
To read this feed a few more tasks need to be added to the last example. Data structures are needed to store the channel and multiple items. The function parse_rss_item() needs to keep track of more than a single item. The main reading function needs to read in the channel. Note that to read an element as complicated as the channel, I normally isolate the code into it's own function. But for this example I wanted to show both a callback function (with items) and not (with the channel). This is a lot of code. Because some people will want to copy and paste it, I'll keep the code all together.
typedef struct
{
char * about;
char * title;
char * link;
char * description;
char * creator;
struct tm date;
char * section;
int comment_count;
} rss_item_type;
typedef struct
{
char * about;
char * title;
char * link;
char * description;
char * rights;
struct tm date;
char * publisher;
char * creator;
char * subject;
} channel_type;
typedef struct
{
channel_type channel;
int count;
rss_item_type * item;
} rss_xml_data_type;
static void
parse_rss_item(ali_doc_info *doc, ali_element_ref itemN, void * data, bool new_element)
{
rss_xml_data_type * rss = (rss_xml_data_type *) data;
rss_item_type * item;
char * date_string;
if (ali_is_element_new(doc, itemN))
{
// Resize the items to add one more
if (rss->count == 0)
{
rss->count++;
rss->item = (rss_item_type *) malloc(sizeof(*(rss->item)));
}
else
{
rss->count++;
rss->item = (rss_item_type *) realloc(rss->item, rss->count * sizeof(*(rss->item)));
}
}
item = &rss->item[rss->count - 1];
ali_in(doc, itemN, "^a%as", 0, "rdf:about", &item->about);
ali_in(doc, itemN, "^e%as", 0, "title", &item->title);
ali_in(doc, itemN, "^e%as", 0, "link", &item->link);
ali_in(doc, itemN, "^e%as", 0, "description", &item->description);
ali_in(doc, itemN, "^e%as", 0, "dc:creator", &item->creator);
if (ali_in(doc, itemN, "^e%as", 0, "dc:date", &date_string))
{
memset(&item->date, 0, sizeof(item->date));
sscanf(date_string, "%d-%d-%dT%d:%d",
&item->date.tm_year, &item->date.tm_mon, &item->date.tm_mday,
&item->date.tm_hour, &item->date.tm_min);
free(date_string);
date_string = NULL;
}
ali_in(doc, itemN, "^e%as", 0, "slash:section", &item->section);
ali_in(doc, itemN, "^e%d", 0, "slash:comments", &item->comment_count);
}
static void read_rss_feed()
{
rss_xml_data_type rss;
ali_doc_info * doc;
ali_element_ref root;
ali_element_ref rdf;
int channel;
char * date_string;
rss.count = 0;
rss.item = NULL;
root = ali_open(&doc, "rss_feed.xml", ALI_OPTION_INPUT_XML_DECLARATION, &rss);
if (doc != NULL)
{
rdf = ali_in(doc, root, "^e", 0, "rdf:RDF");
/* read the channel */
if ((channel = ali_in(doc, rdf, "^e", 0, "channel")) != 0)
{
ali_in(doc, channel, "^a%as", 0, "rdf:about", &rss.channel.about);
ali_in(doc, channel, "^e%as", 0, "title", &rss.channel.title);
ali_in(doc, channel, "^e%as", 0, "link", &rss.channel.link);
ali_in(doc, channel, "^e%as", 0, "description", &rss.channel.description);
ali_in(doc, channel, "^e%as", 0, "dc:rights", &rss.channel.rights);
if (ali_in(doc, channel, "^e%as", 0, "dc:date", &date_string))
{
memset(&rss.channel.date, 0, sizeof(rss.channel.date));
sscanf(date_string, "%d-%d-%dT%d:%d",
&rss.channel.date.tm_year, &rss.channel.date.tm_mon, &rss.channel.date.tm_mday,
&rss.channel.date.tm_hour, &rss.channel.date.tm_min);
free(date_string);
date_string = NULL;
}
ali_in(doc, channel, "^e%as", 0, "dc:publisher", &rss.channel.publisher);
ali_in(doc, channel, "^e%as", 0, "dc:creator", &rss.channel.creator);
ali_in(doc, channel, "^e%as", 0, "dc:subject", &rss.channel.subject);
}
/* read all items until there are no more. */
while (ali_in(doc, rdf, "^oe%F", 0, "item", parse_rss_item))
{
;
}
ali_close(doc);
}
assert(strcmp(rss.channel.about, "http://slashdot.org/") == 0);
assert(rss.count == 2);
assert(strcmp(rss.item[0].title, "Why MySQL Grew So Fast") == 0);
assert(strcmp(rss.item[1].title, "Linus Torvalds: Backporting Is A Good Thing") == 0);
free(rss.item[0].about);
free(rss.item[0].title);
free(rss.item[0].link);
free(rss.item[0].description);
free(rss.item[0].creator);
free(rss.item[0].section);
free(rss.item[1].about);
free(rss.item[1].title);
free(rss.item[1].link);
free(rss.item[1].description);
free(rss.item[1].creator);
free(rss.item[1].section);
free(rss.item);
free(rss.channel.about);
free(rss.channel.title);
free(rss.channel.link);
free(rss.channel.description);
free(rss.channel.rights);
free(rss.channel.publisher);
free(rss.channel.creator);
free(rss.channel.subject);
}
Whew! That is quite a bit to read! There is the RSS XML feed, the data structures to store the information, and the Ali calls to map the XML to the data structures. Reading the RSS feed takes all of the knowledge that we've learned to read the different data types. The example shows how to read elements either inline or with a callback function. You've shown that you can read in documents with arbitrary data and you can deal with large and complicated XML documents by decomposing the complicated elements into callback functions. Nothing can stop you now!