Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath

Structured Strategy:
How to Supercharge Your Content
Analysis with XML and XPath
Josh Anderson

Who I am
• Information Architect at Precision
Content
• Certified Professional Technical
Communicator (CPTC) Foundation
• Co-Organized World Information
Architecture Day 2023 Toronto
• Master of Information from the
University of Toronto

3
We are experts in structured content.
We’re a full-service, end-to-end technical
communications consultancy, technology
innovator, and systems integrator offering
professional services, training, and technology.
Areas of expertise
• structured authoring methods
• content lifecycle management
• DITA/XML design and
implementation
• information architecture
• content strategy,
• and structured content delivery.

4
Who is this presentation for?
People who…
• strategize, plan for, and otherwise work with text
content
• understand the benefits of structured content
• are familiar with XML but perhaps not with XPath
• want to learn how to take their content analysis
skills to the next level

5
Structured
content
Content is easier to use and
understand when organized in a
predictable way.
Content is written to fit a model:
• Title
• Presenter
• Description
• Speaker Bio

6
Structure makes content FAIR
Findable
Accessible
Interoperable
Reusable
“The FAIR Guiding Principles for scientific data management and stewardship” was published in Scientific Data in 2016

7
An example of structure: HTML
Content is contained
inside opening and
closing tags.
Sometimes elements
contain other
elements.
All elements are
contained within a
single root.

8
There’s just one problem…
These elements don’t tell me anything about what the content is about!

9
XML
• XML is a way to store information
• XML stands for “eXtensible Markup Language”
• “Extensible” means that you define your own structure
HTML: Pre-defined tags XML: Define your own tags

10
How do you define your own structure?
You define your structure, or your content model,
in a Document Type Definition (DTD).

11
Defining your structure
What you can
define with a DTD
What you can’t
define with a DTD
• Elements
• Attributes
• If an element can
contain text, another
element, or both
• Order of elements
• If something is
required or optional
• Length of content
• Occurrence
constraints
• What text can go
inside elements

12
Structure helps you analyze your content
• Structure is a prerequisite to performing content
analysis at scale
• You want a way to tell if your content is valid or invalid
• Semantic structures can be understood by both people
and computers
• Using a widely adopted standard like XML lets us take
advantage of specialized tools
• Oftentimes you can adopt a standard structure rather
than inventing your own

13
XML-based standards
Some extensions of XML have become standards in their own right
Scalable Vector Graphics Resource Description
Framework
Darwin Information Typing Architecture (DITA)

14
Finding structure
• What if your content is unstructured?
• Look for patterns in
• attributes
• classes
• common parent/sibling elements, and
• common text strings.

15
Creating structure
• Break your content down into microcontent
• about one primary idea, fact, or concept
• easily scannable
• labelled for clear identification
and meaning, and
• appropriately written and formatted
for use anywhere and anytime it is needed.

16
Microcontent structure
Source: The DITA Style Guide – Best Practices for Authors. Tony Self. www.ditastyle.com
• You do not need code to have structure
• Structure means
• systematic labelling
• modular, topic-based architecture
• constrained writing environments, and
• separation of content and form.

17
Focus
Information about hours of work
Requirement for unplanned absences
Information about lunch breaks
Requirement for planned absences

18
Function
Reference information
Principle information

21
What is XPath?
• XPath is a language that lets you identify particular parts
of XML documents
• In XPath, we write “location paths”
• Example of an XPath location path: //bookstore/book/@id
• XPath can help you answer queries like…
• “Show me every element called ‘book’.”
• “Show me the parent element of the element called
‘price’.”
• “Show me all the elements that have the attribute
‘language’ set to ‘English’.”
• … and much more
• XPath is used in other XML-related languages like
XQuery and XSLT

22
Image source: https://www.researchgate.net/figure/Example-of-XML-document-and-XML-tree-representation_fig1_315998361
XML is structured like a tree

23
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes

24
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes

25
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes

26
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes

27
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes

28
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes

29
Node selectors
Expression Description
/ Selects the document root node
// Selects from all descendants of the context node and the context node itself
. Selects the current node
.. Selects the parent of the current node
@ Selects attribute nodes
* Selects any element node, regardless of type.

30
Select the bookstore
element node
How to select nodes in XPath

31
Select the bookstore
element node
• /bookstore
• //bookstore

32
Select all book
element nodes

33
Select all book
element nodes
• /bookstore/book
• //book

34
Select all price
element nodes

35
Select all price
element nodes
• /bookstore/book/
price
• //price

36
Select all lang
attribute nodes

37
Select all lang
attribute nodes
• //@lang

38
Select all comment
nodes

39
Select all comment
nodes
• //comment()

40
Select the parent
element nodes of
the title element
nodes

41
Select the parent
element nodes of
the title element
nodes
• //title/..

42
Select the comment
nodes that
are children of book
elements

43
Select the comment
nodes that
are children of book
elements
• /bookstore/book/
comment()
• //book/comment()

44
Axes
An axis is a direction that
we travel along to get to
different parts of an XML
document.
All XPath location paths
have an axis. So far, we
have used “abbreviated
location paths.”
Unabbreviated, they use a
double colon before the
node test. It looks like this:
//child::bookstore
Image source: https://jrebecchi.github.io/xpath-helper/xpath-axes.html

45
Select any comment
that is a
descendant of the
book element
Selecting with axes in XPath

46
Select any comment
that is a
descendant of the
book element
• //book/descendant::
comment()

47
Select the parent
element of the
price element

48
Select the parent
element of the
price element
• //price/..
• //price/parent::element()

49
Select the sibling
elements following
the title element

50
Select the sibling
elements following
the title element
• //title/following-
sibling::element()

51
Predicates
• Predicates are like a filter on your results
• Predicates appear inside [square brackets]
• Predicates are Boolean expressions
• The full syntax of an XPath location path is
axis::node[predicate]
• Axis and node are required. Predicate is optional.
• If you do not specify an axis, it is assumed to be “child::”

52
Select the book with
the title “Harry Potter”
Selecting with predicates in XPath

53
Select the book with
the title “Harry Potter”
• //book[title=“Harry
Potter”]

54
Select the titles that
are in English

55
Select the titles that
are in English
• //title[@lang=“eng”]

56
Select the textbooks

57
Select the textbooks
• //book[@category=
“textbook”]

58
Select the second
book

59
Select the second
book
• //book[2]

60
Select the second
textbook

61
Select the second
textbook
• //book[@category=
“textbook”][2]

64
Real-world content analysis with XPath
• From experience, I know that tables inside tables often
have unpredictable issues. I want to check on them.
• //table/table
• I need to change all the section titles called
“Introduction” to “Overview.” Did I miss any?
• //section[title=“Overview”]
• The client wants a disclaimer paragraph at the very end
of the topic. Are there any disclaimers that are in the
wrong place?
• //p[@outputclass=“disclaimer”]/following-sibling::element()

65
Ideas for content analysis with XPath

66
Ideas for content analysis with XPath
• Look for outliers
• Ensure that elements are used for their intended
purpose (not just for some formatting shortcut)
• Check consistency across different types of elements
• Track down unnecessary child elements

68
XML and XPath resources
• W3Schools tutorials
• https://www.w3schools.com/xml/
• XPath cheat sheet
• https://devhints.io/xpath

Thank You!
Are you ready to upgrade, transform, and future-enable your content?
Contact us and we’ll show you what’s possible.
precisioncontent.com | more-info@precisioncontent.com | 1(647)265-8500

Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath

Recommended

Recommended

More Related Content

Similar to Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath

Similar to Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath (20)

Recently uploaded

Recently uploaded (20)

Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath

Editor's Notes