The best content strategists are limited by how much content they can analyze. There comes a point where a content set becomes too large to analyze using usual methods. Do you have the skills to scale your strategy? Enter XML and XPath, two languages that provide deep insight into content with superhuman efficiency. This session teaches code-shy content strategists enough of these languages to be effective. You may be used to interviewing your users. Now learn how to query your content itself!
2. Who I am
• Information Architect at Precision
Content
• Certified Professional Technical
Communicator (CPTC) Foundation
• Co-Organized World Information
Architecture Day 2023 Toronto
• Master of Information from the
University of Toronto
3. 3
We are experts in structured content.
We’re a full-service, end-to-end technical
communications consultancy, technology
innovator, and systems integrator offering
professional services, training, and technology.
Areas of expertise
• structured authoring methods
• content lifecycle management
• DITA/XML design and
implementation
• information architecture
• content strategy,
• and structured content delivery.
4. 4
Who is this presentation for?
People who…
• strategize, plan for, and otherwise work with text
content
• understand the benefits of structured content
• are familiar with XML but perhaps not with XPath
• want to learn how to take their content analysis
skills to the next level
5. 5
Structured
content
Content is easier to use and
understand when organized in a
predictable way.
Content is written to fit a model:
• Title
• Presenter
• Description
• Speaker Bio
6. 6
Structure makes content FAIR
Findable
Accessible
Interoperable
Reusable
“The FAIR Guiding Principles for scientific data management and stewardship” was published in Scientific Data in 2016
7. 7
An example of structure: HTML
Content is contained
inside opening and
closing tags.
Sometimes elements
contain other
elements.
All elements are
contained within a
single root.
8. 8
There’s just one problem…
These elements don’t tell me anything about what the content is about!
9. 9
XML
• XML is a way to store information
• XML stands for “eXtensible Markup Language”
• “Extensible” means that you define your own structure
HTML: Pre-defined tags XML: Define your own tags
10. 10
How do you define your own structure?
You define your structure, or your content model,
in a Document Type Definition (DTD).
11. 11
Defining your structure
What you can
define with a DTD
What you can’t
define with a DTD
• Elements
• Attributes
• If an element can
contain text, another
element, or both
• Order of elements
• If something is
required or optional
• Length of content
• Occurrence
constraints
• What text can go
inside elements
12. 12
Structure helps you analyze your content
• Structure is a prerequisite to performing content
analysis at scale
• You want a way to tell if your content is valid or invalid
• Semantic structures can be understood by both people
and computers
• Using a widely adopted standard like XML lets us take
advantage of specialized tools
• Oftentimes you can adopt a standard structure rather
than inventing your own
13. 13
XML-based standards
Some extensions of XML have become standards in their own right
Scalable Vector Graphics Resource Description
Framework
Darwin Information Typing Architecture (DITA)
14. 14
Finding structure
• What if your content is unstructured?
• Look for patterns in
• attributes
• classes
• common parent/sibling elements, and
• common text strings.
15. 15
Creating structure
• Break your content down into microcontent
• about one primary idea, fact, or concept
• easily scannable
• labelled for clear identification
and meaning, and
• appropriately written and formatted
for use anywhere and anytime it is needed.
16. 16
Microcontent structure
Source: The DITA Style Guide – Best Practices for Authors. Tony Self. www.ditastyle.com
• You do not need code to have structure
• Structure means
• systematic labelling
• modular, topic-based architecture
• constrained writing environments, and
• separation of content and form.
17. 17
Focus
Information about hours of work
Requirement for unplanned absences
Information about lunch breaks
Requirement for planned absences
21. 21
What is XPath?
• XPath is a language that lets you identify particular parts
of XML documents
• In XPath, we write “location paths”
• Example of an XPath location path: //bookstore/book/@id
• XPath can help you answer queries like…
• “Show me every element called ‘book’.”
• “Show me the parent element of the element called
‘price’.”
• “Show me all the elements that have the attribute
‘language’ set to ‘English’.”
• … and much more
• XPath is used in other XML-related languages like
XQuery and XSLT
23. 23
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
24. 24
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
25. 25
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
26. 26
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
27. 27
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
28. 28
1. The root node
2. Element nodes
3. Text nodes
4. Attribute nodes
5. Comment nodes
6. Processing
instruction nodes
7. Namespace nodes
Seven kinds of XML nodes
29. 29
Node selectors
Expression Description
/ Selects the document root node
// Selects from all descendants of the context node and the context node itself
. Selects the current node
.. Selects the parent of the current node
@ Selects attribute nodes
* Selects any element node, regardless of type.
43. 43
Select the comment
nodes that
are children of book
elements
• /bookstore/book/
comment()
• //book/comment()
How to select nodes in XPath
44. 44
Axes
An axis is a direction that
we travel along to get to
different parts of an XML
document.
All XPath location paths
have an axis. So far, we
have used “abbreviated
location paths.”
Unabbreviated, they use a
double colon before the
node test. It looks like this:
//child::bookstore
Image source: https://jrebecchi.github.io/xpath-helper/xpath-axes.html
50. 50
Select the sibling
elements following
the title element
• //title/following-
sibling::element()
Selecting with axes in XPath
51. 51
Predicates
• Predicates are like a filter on your results
• Predicates appear inside [square brackets]
• Predicates are Boolean expressions
• The full syntax of an XPath location path is
axis::node[predicate]
• Axis and node are required. Predicate is optional.
• If you do not specify an axis, it is assumed to be “child::”
52. 52
Select the book with
the title “Harry Potter”
Selecting with predicates in XPath
53. 53
Selecting with predicates in XPath
Select the book with
the title “Harry Potter”
• //book[title=“Harry
Potter”]
62. 64
Real-world content analysis with XPath
• From experience, I know that tables inside tables often
have unpredictable issues. I want to check on them.
• //table/table
• I need to change all the section titles called
“Introduction” to “Overview.” Did I miss any?
• //section[title=“Overview”]
• The client wants a disclaimer paragraph at the very end
of the topic. Are there any disclaimers that are in the
wrong place?
• //p[@outputclass=“disclaimer”]/following-sibling::element()
64. 66
Ideas for content analysis with XPath
• Look for outliers
• Ensure that elements are used for their intended
purpose (not just for some formatting shortcut)
• Check consistency across different types of elements
• Track down unnecessary child elements
66. 68
XML and XPath resources
• W3Schools tutorials
• https://www.w3schools.com/xml/
• XPath cheat sheet
• https://devhints.io/xpath
67. Thank You!
Are you ready to upgrade, transform, and future-enable your content?
Contact us and we’ll show you what’s possible.
precisioncontent.com | more-info@precisioncontent.com | 1(647)265-8500
Editor's Notes
Precision Content is a consultancy specializing in end-to-end services for technical communications.
We provide services in writer training, content strategy, information architecture, content lifecycle management, systems integration, and content publishing.
We use our expertise in microcontent and structured authoring with DITA/XML to empower our clients across a variety of industries to modernize their content. [click]
[Image – “Hours of Work” section from the old handbook]
[Image – The series of briefer microcontent topics in the updated handbook. “Work Hour Limits,” “Time Tracking Requirement,” “Your Work Environment,” etc.
[Image – highlight both reference and principle information in the original employee handbook topic “Hours of Work”]
[Image – show two separate topics (with type info, if possible) that were broken out of the single mixed-function topic “Hours of Work”]
(Maybe what I can do for this is go on Heretto, find a topic, then delete the headings and paragraph breaks and such and use that as my example of “unstructured” content) Maybe “Hours of work” from the old employee handbook compared to the rewritten passage in the new one
Link to old employee handbook: https://ascan.sharepoint.com/CorpCommunications/Forms/AllItems.aspx?id=%2FCorpCommunications%2FPrecision%20Content%20Employee%20Handbook%2Epdf&parent=%2FCorpCommunications
Look at some of the other PCAS microcontent presentations for some stuff about what we mean by structure. In fact, use material from those presentations throughout your talk.