What is XPath? How to use it?

Basic Operations with XPath

In recent examples related to the R programming language, we performed various operations with XML and used Rcrawler to scrape a website and generate sitemaps for the website we scraped. In addition, similar uses are also mentioned in the examples related to Python Selenium. It would be useful to also mention the use of XPath mentioned in these articles.

AA

XML is a widely used format for storing and exchanging data between different systems. To work with XML data, it's often necessary to extract information from specific fields or nodes within the XML file. XPath is a powerful tool for querying XML data and selecting the information you need. In this article, we'll explore what XPath is, how it works, and some common XPath expressions.

In various programming languages such as R, Python, and PHP, we read and/or save data in XML format according to our requirements. In such cases, various functions/methods are used to extract data from relevant fields (nodes and/or attributes) or to create these fields. In my article titled XML and Basic Concepts, I briefly explained the XML structure. You can check out this article for more information on XML structure.

XPath

XPath

XPath is a query language and syntax guide (syntax) that is frequently used in the process of obtaining data within the XML structure (including HTML) and continuously developed1. Thanks to XPath, which is a W3C standard, we can move within the XML file (between nodes) by using path expressions2. Currently, it has the XPath 3.1 version published in 20173, but XPath 1.0 is still the most widely used version. Another W3C standard, CSS Selectors, can be cited as a simpler alternative to XPath.

Syntax and Semantics

As with any language, there are certain rules that must be followed in order to perform operations with XPath. In order to relate it to previous articles, I will use the sitemap structure.

Syntax Abbreviation Description
ancestor Selects the ancestor of the context node
ancestor-or-self Selects the ancestor of the context node, and the context node itself, if it is an ancestor
attribute @class Selects the attribute of the context node. Alternatively, can be written as attribute::class
child loc Selects the children of the context node. Alternatively, can be written as child::loc
descendant Selects all descendants (children, grandchildren, etc.) of the context node
descendant-or-self // Selects all descendants (children, grandchildren, etc.) of the context node, and the context node itself, if matched. Alternatively, can be written as /descendant-or-self::node()/
parent .. Selects the parent of the context node. Alternatively, can be written as parent::node()
preceding Selects all nodes that appear before the context node
preceding-sibling Selects all siblings that appear before the context node
self . Selects the context node itself. Alternatively, can be written as self::node()

You can test the codes listed below using any XPath Tester or your browser's web developer tool4.

<urlset>
    <url>
        <loc>https://domain.com</loc>
        <priority>1.00</priority>
        <lastmod>2021-03-26T11:40:09+03:00</lastmod>
        <changefreq>always</changefreq>
    </url>
    <url>
        <loc>https://domain.com/about</loc>
        <priority>1.00</priority>
        <lastmod>2021-03-26T11:40:09+03:00</lastmod>
        <changefreq>always</changefreq>
    </url>
</urlset>

In XPath, node ordering starts at 1 (one). Therefore, the relationship between parent-child and same-level nodes (including text nodes) should be based on this ordering. The hierarchy between nodes is represented by / (slash).

Let's start with some examples to access node contents from the most comprehensive forms to the basic ones in the above example XML content. First, let's look at the method that specifies an absolute hierarchy.

/urlset/url/loc

As seen, each level is defined separately starting from the root. The deeper and more complex the node hierarchy becomes, the more likely it is to make mistakes in precise definitions. This sequential syntax will return the values of https://domain.com and https://domain.com/about. As seen, all node definitions are specified hierarchically. However, in complex structures, following this sequential structure may not be practical or possible. In this case, general and/or specific definitions can still be made hierarchically. In this relative structure, no exact path is specified.

//loc

This definition will also return the same values to us. XPath scans for the node named loc, regardless of its order, and returns its content. This operation can also be performed with special characters. The * (asterisk) in the example below allows scanning without considering the hierarchy.

//*/loc

With this expression, loc contents are returned without considering nodes in between. //./loc will also give us the same result. However, the . used here actually represents the node itself, just like self::node(). Additionally, text() can also be used instead of text() as shown in h3[.='See also'] format. We can also access the parent node with .. or parent::node().

To specify the order, we can use [] to specify the value. Below is an example usage that includes both order and ...

//../url[1]/loc

The definition above will only give us the content in the first url. Now, let's detail this order issue with identifiers called test formats.

//../url[1]/node()

This definition will give us the list of other nodes covered by the first node. Besides node(), we can access the content of textNode with text() and comments specified as <!-- Comment --> with comment(). Now, let's try a few things together separately.

(//url/*)[1]
//url/*[1]

he definition within parentheses here refers not to a general but to a node itself. The [1] following it specifies the order of this node. When we change the order to [2], we reach the priority content. When we remove the parentheses, all nodes that match the query will return the specified order values. Now, let's add text() to this query.

(//url/*/text())[1]
(//url/*[2]/text())[1]
//url/*/text()[1]
//url/*[2]/text()[1]

The first definition returns the loc value in the first node content, while the second definition returns the priority value within this node. The third definition presents the values of nodes covered by all url nodes, and the last definition returns the value of the 2nd node among all these nodes.

Of course, we can access the attributes of a node, not just between nodes. We can use @ or attribute:: definitions for this operation.

Since there is no attribute in the above XML structure, we will add a different example.

<html lang="en">
    <head>
        <title>Title</title>
    </head>
    <body class="body-class">
        <h1 id="mainTitle" class="title">Main Title</h1>
        <p lang="en" class="text p-1">Lorem ipsum dolor sit amet, <br /> consectetur <a href="https://google.com" class="link ">adipiscing</a> elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
        <p lang="tr" class="text p-2">Ut enim ad minim veniam, quis nostrud <a href="https://google.com" class="link" target="_blank" rel="noopener">exercitation ullamco</a> laboris nisi ut aliquip ex ea commodo consequat.</p>
        <p lang="fr" class="text p-3">Excepteur sint occaecat cupidatat non proident, <a href="https://google.com" class="link external" target="_blank" rel="noopener">exercitation</a> in culpa qui officia deserunt mollit anim id est laborum.</p>
    </body>
</html>

Now, let's select the paragraph with the text p-1 class in the above HTML example.

//body/*[@class='text p-1']
//body/p[@class='text p-1']

Let's see what changes when you add the following extensions to the above expression:

//body/p[@class='text p-1']/*
//body/p[@class='text p-1']/node()
//body/p[@class='text p-1']/text()
//body/p[@class='text p-1']/..
//body/p[contains(@class, 'text')]

We can make these expressions even more complex and add various conditions using the syntax.

//body/p[@lang='tr' and @class='text p-2']/a[@href, contains = 'google']/@target
//p/a[contains(text(), 'exercitation')]/parent::node()[@lang = 'tr']
//p/a[contains(., 'exercitation')]/parent::node()[@lang = 'tr']
//p[starts-with(., 'Ut')]/a[contains(text(), 'exercitation')]
//p/a[@class = 'link secure' and contains(text(), 'exercitation')]

XPath is a powerful tool for accessing XML data. XPath allows you to access, select, edit, delete, and add new objects to XML files. XPath is an indispensable part of applications that work with XML data and is frequently used by developers who work with XML files. In this article, we have covered what XPath is, how it can be used, and the most commonly used XPath expressions. We hope this article has helped you understand the benefits of XPath and how to work with XML data more easily.

XPath examples can also include operators and functions such as concat(), length(), and substring()5 6. However, I will touch upon examples with XPath in the context of JavaScript7 8 and Selenium Driver9, so I will end the examples for now.

Conclusion

XPath is an essential tool for working with XML data. By using path expressions, you can navigate through the nodes and attributes of an XML file and extract the information you need. XPath is widely supported in programming languages like Python, R, and PHP, and is frequently used by developers who work with XML files. We've covered some of the most common XPath expressions in this article, but there are many more operators and functions that you can use to manipulate XML data. With the knowledge gained from this article, you'll be well-equipped to start working with XPath and XML data in your own projects.