Saturday, May 14, 2011

Using XPath in Xml With Namespaces.

Hello,

[Beginner level although the presented case might be not so common]

Today I want to consider the following case:
In your application, the user may load an Xml file and search for certain Xml nodes using an XPath expression. For the sake of the examples in this post, I assume the user wants only the number of the elements returned by the XPath query. First, I present a solution written in code and then an example UI for the end user.

In my examples I am using System.Xml (XmlDocument, XmlNode, etc...) but the same works for both XPathNavigator and XElement classes. I tried to search for a good explanation on this topic on the internet but couldn't find a useful one. I did find some posts of StackOverflow (e.g. a somewhat useful one here) and the somehow vague explanation in MSDN can be found here.

In my first example the user loads a simple XML:

<Products>
  <Product>
    <Book>
      <Author>Orson Scott Card</Author>
      <Name>Ender's Game</Name>
    </Book>
  </Product>
 
 
  <Product>
    <Book>
      <Author>Douglas Adams</Author>
      <Name>The Hitchhiker's Guide to the Galaxy</Name>
    </Book>
  </Product>
</Products>

The user would like to get the the number of book elements, which is 2. Here is the code:
 XmlDocument document = new XmlDocument();
 document.LoadXml(sb.ToString());
 int count = document.SelectNodes("/Products/Product/Book").Count;
 Console.Out.WriteLine("Example 1: The number of returned elements is {0}", count);

Indeed, when running the application we see:
Example 1: The number of returned elements is 2



Now things get a bit more complicated and the Xml becomes (two namespaces are added to the root element):


<a:Products xmlns:a="MyUri1" xmlns:b="MyUri2">
  <Product>
    <b:Book>
      <Author>Orson Scott Card</Author>
      <Name>Ender's Game</Name>
    </b:Book>
  </Product>
  <Product>
    <Book>
      <Author>Douglas Adams</Author>
      <Name>The Hitchhiker's Guide to the Galaxy</Name>
    </Book>
  </Product>
</a:Products>

When we use the same code as in the first example the number of the elements returned by the query is 0. But why? Looking at the new Xml we see that the Products element is in the namespace MyUri1 yet we did not specify it anywhere in our query. A common mistake would be to assume the XmlDocument will detect that the prefix "a" is mapped to MyUri1 and try the following code:

int count = document.SelectNodes("/a:Products/Product/b:Book").Count;
This code results in the following exception:
 (System.Xml.XPath.XPathException: Namespace Manager or XsltContext needed. This query has a prefix, variable, or user-defined function. ) telling you that the XPath engine does not know what to do with the given prefixes ("a" and "b" in the query).

Looking closly at the SelectNodes method you can see that it has an overload that accepts a second argument- the XmlNamespaceManager. You can interface consider the XmlNamespaceManager as a framework provided implementation for the IXmlNamespaceResolver interface. This class is a simple dictionary which maps a prefix to a namespace and vise versa. Opening the properties of any XmlNode will reveal that the namespace of the node is actually saved in a property called NamespaceUri. For example this is the properties of the Products XmlElement:


You can also see that the LocalName of the element is "Products" and the prefix is "a" as expected. To make our XPath work we need to map this prefix to the namespace Uri:


      XmlNamespaceManager manager = new XmlNamespaceManager(document.NameTable);
      manager.AddNamespace("a""MyUri1");
      manager.AddNamespace("b""MyUri2");
 
      int count = document.SelectNodes("/a:Products/Product/b:Book",manager).Count;
      Console.Out.WriteLine("Example 2: The number of returned elements is {0}", count);

Running the application we see:
 Example 2: The number of returned elements is 1
(note that the other book element is in a different namespace and therefore will not be counted)

This is a good point to ask "how important are those prefixes?". The simple answer is that they are not important. When an XPath query "meets" a prefix it will invoke the method 
string LookupNamespace(string prefix);
from the provided IXmlNamespaceResolver and compare the Uri returned by this method to the NamespaceUri property of all the nodes it can return. This means that it doesn't matter what prefixes exist in the original Xml file but only the mappings you make in your manager.

For example, the code:


      XmlNamespaceManager manager = new XmlNamespaceManager(document.NameTable);        
      manager.AddNamespace("M""MyUri1");       
      manager.AddNamespace("N""MyUri2");       
      int count =  document.SelectNodes("/M:Products/Product/N:Book", manager).Count;
      Console.Out.WriteLine("Example 3: The number of returned elements is {0}", count);


Would work just the same as in the previous example.


Now lets add an additional complication to the Xml. A default namespace and a duplicate Uri. The Xml now looks as:


<a:Products xmlns:a="MyUri1" xmlns:b="MyUri2">
  <Product xmlns="MyUri3">
    <b:Book>
      <Author>Orson Scott Card</Author>
      <Name>Ender's Game</Name>
    </b:Book>
  </Product>
  <p:Product xmlns:p="MyUri3">
    <b:Book>
      <Author>Douglas Adams</Author>
      <Name>The Hitchhiker's Guide to the Galaxy</Name>
    </b:Book>
  </p:Product>
</a:Products>



MyUri3 is the default namespace for all the nodes within and including the first product element who don't have other namespace mapping (e.g. the element Book has a mapping with the prefix "b" that maps to the namespace MyUri2). You should be able to realize how to handle it. Remember that it doesn't matter how the namespaces mapped within the document only how you map them in the manager. Note that the XPath from the previous example would return 0 elements because the first Product element has a namespace now (the NamespaceUri of this element is MyUri3 because it has no prefix) and the second Product element has an explicit namespace which is once again MyUri3 but marked with the prefix "p".
The correct code to get the Book element is therefore:


      XmlNamespaceManager manager = new XmlNamespaceManager(document.NameTable);                  
      manager.AddNamespace("a""MyUri1");
      manager.AddNamespace("b""MyUri2");       
      manager.AddNamespace("dns""MyUri3");
      int count = document.SelectNodes("/a:Products/dns:Product/b:Book", manager).Count;       
      Console.Out.WriteLine("Example 5: The number of returned elements is {0}", count);



Note that the prefix "dns" is just a random string I invented, what is important is that it points to the right namespace within the manager.


The output here is: 
Example 5: The number of returned elements is 2


In my final example I am showing a different approach. You can instruct the XPath engine to ignore namespaces. I would use this approach only in specific cases when I know namespaces are not important. The code in this case would look like this:


      int count = document.SelectNodes("/*[local-name() = 'Products']/*[local-name() = 'Product']/*[local-name() = 'Book']").Count;       
      Console.Out.WriteLine("Example 6: The number of returned elements is {0}", count);

There is no need to use a manager since we instruct the engine to ignore the namespace in each of the axes in the XPath query. We do so by specifying that the engine needs to compare only the LocalName property and ignore the prefix and the NamespaceUri of the elements. 


As expected this code outputs:
Example 6: The number of returned elements is 2


This is all nice and dandy but we still have the user to consider. 
I created a simple example UI of what might be the UI in this case:


In the example UI you can see the XPath query Textbox on the top but underneath it a table of all the namespaces the user wants to include in his query. I build the initial table directly from the prefixes in the document using the following code:

 private void PopulateInitialNamespaces()
    {
      XmlNodeList nodes = _document.SelectNodes("//*");
      HashSet<string> knownUris = new HashSet<string>();
      string defaultNamespaceHeader = "dns";
      int defaultNamespaceCount = 0;
      foreach (XmlNode node in nodes)
      {
        foreach (XmlAttribute attribute in node.Attributes)
        {
          if (attribute.Prefix == "xmlns" && !knownUris.Contains(attribute.Value))   //Handle named namespace
          {
            knownUris.Add(attribute.Value);
            NamespaceItems.Add(new NamespaceItem(attribute.LocalName, attribute.Value));
          }
 
          if (attribute.LocalName == "xmlns" && !knownUris.Contains(attribute.Value))
          {
            knownUris.Add(attribute.Value);
            defaultNamespaceCount++;
            NamespaceItems.Add(new NamespaceItem(defaultNamespaceHeader + defaultNamespaceCount.ToString(), attribute.Value));
          }
 
        }
      }
    }

(If anyone has a better way to do it, please add a comment explaining how.)

The UI allows adding and removing namespace mappings and finally to execute the query.
The code is:

    private void RunExecuted(object sender, ExecutedRoutedEventArgs e)
    {
      XmlNamespaceManager manager = new XmlNamespaceManager(_document.NameTable);
      foreach (NamespaceItem item in NamespaceItems)
      {
        if (!string.IsNullOrEmpty(item.Prefix) && !string.IsNullOrEmpty(item.Namesapce))
        {
          manager.AddNamespace(item.Prefix, item.Namesapce);
        }
      }
 
      try
      {
        int count = _document.SelectNodes(XPathText, manager).Count;
        ResultText = string.Format("The query will return {0} elements.", count);
      }
      catch (Exception ex)
      {
        ResultText = ex.Message;
      }
 
    }
The full code for everything I discuss in this post can be found here:
(link)

Thanks again for reading. Feel free to leave your comments and thoughts.
Boris.