magic/xml tutorial

magic/xml aims at being the most powerful XML processing solution for Ruby ever. It wants to solve all your XML generation, XML analysis, XML transformation problems, with easy things being really easy, and who cares about the rest anyway.

The tutorial describes only the basics of magic/xml. It contains a lot more power than that - if you don't know whether it supports something, just try - more often than not it will do what you meant !

Introduction

The basic objects you will be working with are XML and String objects. An XML object has a name (like :h3, :body, or :comments), a hash table attrs containing attributes (like :href being "http://en.wikipedia.org/"), and contents - an array of XML and String children.

XML objects do not know anything about their parents. In particular they do not belong to any "Document". We avoid plenty of complexity this way. How to build an XML object ? It's easy:

hello = XML.parse("<hello>world!</hello>")

And hello object will be an XML with name = :hello, no attributes and a single child world!.

hello = XML.load("test.xml")
xml_from_internet = XML.load("http://t-a-w.blogspot.com/atom.xml")

hello = XML.new(:hello, "world!")

How about printing your XMLs ? There couldn't be anything simpler than that. You simply say:

print xml_object # <hello>world!</hello>

If you want to inspect an XML object, just p it. It does the right thing - namely prints name and attribute and hides the children if any.

p xml_object # <hello>...</hello>

Building XML objects

magic/xml is a "There's more than one way to do it" kind of library, especially as far as building XML objects is concerned. One way to build XML objects is using object-oriented interface:

foo = XML.new(:foo, {:color => "blue"}, "Hello")
print foo # <foo color='blue'>Hello</foo>

The first argument to the constructor is a tag name, the optional second argument is a hash with attributes, the rest are object's children. Because it's so common, magic/xml also provides xml method which simply calls XML.new

print xml(:hello, "world") # <hello>world</hello>

node = xml(:hello)
node[:color] = "blue" # Change attributens
node << xml(:world) # Add a new child (at the end)
print node # <hello color='blue'><world/></hello>

Of course magic/xml wouldn't be a Ruby library unless if the constructor didn't accept block arguments.

node = xml(:html) {
    add! xml(:head) {
        add! xml(:title, "Hello world in HTML")
    }
    add! xml(:body) {
        add! xml(:h3) {
            add! "Hello, world"
        }
    }
}

The blocks are instance-eval'ed, so special commands like add! modify the objects we're building. The example above obviously builds a hello world XHTML document. All unknown !-commands are automatically converted to add! xml, so you can also code the last example as:

node = xml(:html) {
    head! {
        title! "Hello world in HTML"
    }
    body! {
        h3! "Hello, world"
    }
}

Can you imagine easier way to do it ? The only irregularity is that you cannot build the top element with html! - we don't want to take over all !-commands everywhere in the program.

It is easy to create your own functions that build XMLs from templates. For example a simple function that builds a simple website can be:

def build_html(title, body)
    xml(:html) {
        head! {
            title! title
        }
        body! {
            add! body
        }
    }
end

As add! command (and its equivalent <<) accepts Strings, XMLs, and Arrays of children, it is very easy to do the right thing.

If you want to make an XML-building function that accepts a block, you must use instance_eval, otherwise special commands like add! will not work inside the block.

def build_html(title, &block)
    xml(:html) {
        head! {
            title! title
        }
        body! {
            instance_eval(block)
        }
    }
end

doc = XML.parse("<hello>world</hello>")
doc = XML.parse(STDIN)

Analyzing XML objects

Of course we want to do more with XMLs than just building them and printing them out. Let's load some interesting document first, like Reddit RSS feed for Web 2.0:

feed = XML.load("http://web2.reddit.com/browse.rss?t=all&s=highscore")

Now let's find all item objects inside the feed, and print their titles and links.

items = feed.descendants(:item)
items.each{|item|
    title_node = item.child(:title)
    link_node = item.child(:link)
    print title_node.text, " - ", link_node.text, "\n"
}

<item>
<title>reddit</title>
<link>http://web2.reddit.com/goto?rss=true&amp;id=4v0s</link>
</item>

<item title="reddit" link="http://web2.reddit.com/goto?rss=true&amp;id=4v0s" />

Well, it does, and many other XMLs do, but magic/xml lets you pretend they are real attributes

items.each{|item|
    print item[:@title], " - ", item[:@link], "\n"
}

item[:@foo] is exactly equivalent to item.child(:foo).text - it returns text of first child with tag :foo. You can even assign to such pseudoattributes so item[:@title] = "New title" does the right thing

Now if you pass a block to descendants or children it will do implicit each, so we can condense the whole example into:

feed = XML.load("http://web2.reddit.com/browse.rss?t=all&s=highscore")
feed.descendants(:item) {|item|
    print item[:@title], " - ", item[:@link], "\n"
}

Now we can use any selector we want as argument to child, children or descendants - tag names, hash tables, regular expressions, and even complex selectors build using any(...) and all(...).

feed = XML.load("http://del.icio.us/rss/tag/ruby")
print feed.descendants(all(:item, {:@title => /Rails/i}))

As you can see - we first load the RSS into an XML object, then we select those descendants of it that match all listed criteria:

Stream processing

magic/xml can also efficiently process long XML streams without loading everything to memory. Basically you call XML.parse_as_twigs(file_name_or_file_handler) with a block that will be getting all the nodes. The nodes do not have children yet (well, otherwise we'd need to load everything into memory), so if you want to read the children just tell the node to complete!. If you exit the block without calling complete!, the block will be called with the children as they are read. A simple example that extracts article IDs and names from a Wikipedia dump:

XML.parse_as_twigs(STDIN) {|node|
  next unless node.name == :page
  node.complete! # Read all children of <page>...</page> node
  t = node[:@title] # :@title is a child
  i = node[:@id]    # :@id is another child
  print "#{i}: #{t}\n"
}

magic/xml can parse huge files in a very convenient way using very little memory. However the processing speed is somewhat limited by its all-Ruby REXML parser backend. It will be fixed in the future.

Introduction

Building XML objects

Analyzing XML objects

Stream processing

Possible problems