magic/xml aims at being the most powerful XML processing solution for Ruby ever. It wants to solve all your XML generation, XML analysis, XML transformation problems, with easy things being really easy, and who cares about the rest anyway.
The tutorial describes only the basics of magic/xml. It contains a lot more power than that - if you don't know whether it supports something, just try - more often than not it will do what you meant !
The basic objects you will be working with are XML
and String
objects. An XML
object has a name
(like :h3
,
:body
, or :comments
), a hash table attrs
containing attributes (like :href
being "http://en.wikipedia.org/"
),
and contents
- an array of XML
and String
children.
XML objects do not know anything about their parents. In particular they do not belong to any "Document". We avoid plenty of complexity this way. How to build an XML object ? It's easy:
hello = XML.parse("<hello>world!</hello>")
And hello
object will be an XML
with name = :hello
,
no attributes and a single child world!
.
You can also load them from a file or from the Internet:
hello = XML.load("test.xml") xml_from_internet = XML.load("http://t-a-w.blogspot.com/atom.xml")
Or you can create XML objects on your own (more on that later):
hello = XML.new(:hello, "world!")
How about printing your XMLs ? There couldn't be anything simpler than that. You simply say:
print xml_object # <hello>world!</hello>
If you want to inspect an XML object, just p
it.
It does the right thing - namely prints name and attribute and hides the children if any.
p xml_object # <hello>...</hello>
magic/xml is a "There's more than one way to do it" kind of library, especially as far as building XML objects is concerned. One way to build XML objects is using object-oriented interface:
foo = XML.new(:foo, {:color => "blue"}, "Hello") print foo # <foo color='blue'>Hello</foo>
The first argument to the constructor is a tag name,
the optional second argument is a hash with attributes,
the rest are object's children. Because it's so common,
magic/xml also provides xml
method which simply calls
XML.new
print xml(:hello, "world") # <hello>world</hello>
The XML
objects can be freely modified, using the usual methods.
node = xml(:hello) node[:color] = "blue" # Change attributens node << xml(:world) # Add a new child (at the end) print node # <hello color='blue'><world/></hello>
Of course magic/xml wouldn't be a Ruby library unless if the constructor didn't accept block arguments.
node = xml(:html) { add! xml(:head) { add! xml(:title, "Hello world in HTML") } add! xml(:body) { add! xml(:h3) { add! "Hello, world" } } }
The blocks are instance-eval
'ed, so special commands like add!
modify the objects we're building. The example above obviously builds a hello world XHTML document.
All unknown !-commands are automatically converted to add! xml
, so you can also
code the last example as:
node = xml(:html) { head! { title! "Hello world in HTML" } body! { h3! "Hello, world" } }
Can you imagine easier way to do it ? The only irregularity is that you cannot build the top element
with html!
- we don't want to take over all !-commands everywhere in the program.
It is easy to create your own functions that build XMLs from templates. For example a simple function that builds a simple website can be:
def build_html(title, body) xml(:html) { head! { title! title } body! { add! body } } end
As add!
command (and its equivalent <<
) accepts
String
s, XML
s, and Array
s of children,
it is very easy to do the right thing.
If you want to make an XML-building function that accepts a block,
you must use instance_eval
, otherwise special commands like add!
will not work inside the block.
def build_html(title, &block) xml(:html) { head! { title! title } body! { instance_eval(block) } } end
Or you can also parse a String
or a File
object:
doc = XML.parse("<hello>world</hello>") doc = XML.parse(STDIN)
Of course we want to do more with XMLs than just building them and printing them out. Let's load some interesting document first, like Reddit RSS feed for Web 2.0:
feed = XML.load("http://web2.reddit.com/browse.rss?t=all&s=highscore")
Now let's find all item
objects inside the feed, and print their titles and links.
items = feed.descendants(:item) items.each{|item| title_node = item.child(:title) link_node = item.child(:link) print title_node.text, " - ", link_node.text, "\n" }
This is ugly. And why does RSS even uses those elements:
<item> <title>reddit</title> <link>http://web2.reddit.com/goto?rss=true&id=4v0s</link> </item>
instead of real attributes:
<item title="reddit" link="http://web2.reddit.com/goto?rss=true&id=4v0s" />
Well, it does, and many other XMLs do, but magic/xml lets you pretend they are real attributes
items.each{|item| print item[:@title], " - ", item[:@link], "\n" }
item[:@foo]
is exactly equivalent to item.child(:foo).text
-
it returns text of first child with tag :foo
. You can even assign
to such pseudoattributes so item[:@title] = "New title"
does the right thing
Now if you pass a block to descendants
or children
it will do implicit each
, so we can condense the whole example into:
feed = XML.load("http://web2.reddit.com/browse.rss?t=all&s=highscore") feed.descendants(:item) {|item| print item[:@title], " - ", item[:@link], "\n" }
That wasn't so bad, was it ?
Now we can use any selector we want as argument to child
, children
or descendants
- tag names, hash tables, regular expressions, and even complex selectors
build using any(...)
and all(...)
.
Let's load del.icio.us RSS feed for Ruby this time and print those items that have "Rails" in their titles.
feed = XML.load("http://del.icio.us/rss/tag/ruby") print feed.descendants(all(:item, {:@title => /Rails/i}))
As you can see - we first load the RSS into an XML object, then we select those descendants
of it that match all
listed criteria:
:item
and
:@title
(that is text of their first child with tag name :title
) matches regular expression /Rails/i
(/i
means case-insensitive).magic/xml can also efficiently process long XML streams without loading everything to memory.
Basically you call XML.parse_as_twigs(file_name_or_file_handler)
with a block
that will be getting all the nodes. The nodes do not have children yet (well, otherwise we'd
need to load everything into memory), so if you want to read the children just tell the node
to complete!
. If you exit the block without calling complete!
,
the block will be called with the children as they are read. A simple example that extracts
article IDs and names from a Wikipedia dump:
XML.parse_as_twigs(STDIN) {|node| next unless node.name == :page node.complete! # Read all children of <page>...</page> node t = node[:@title] # :@title is a child i = node[:@id] # :@id is another child print "#{i}: #{t}\n" }
magic/xml can parse huge files in a very convenient way using very little memory. However the processing speed is somewhat limited by its all-Ruby REXML parser backend. It will be fixed in the future.
Symbol
s for tag names and attribute names. Sometimes they contain
funny letters like dash (-
) or accented characters. In such case you can use :"funny-name"
syntax in Ruby to get the right symbol. Such funny tags also cannot have automatic !-commands, so just use add!
if you want to build XML with such tags.