|
- ---
- title: XML Schema Validation for the command line
- description: >
- XML Schema Validation for the command line
- created: !!timestamp '2015-05-07'
- time: 2:17 PM
- tags:
- - xml
- - schema
- ---
-
- It turns out that unless you use a full fledge XML editor, validating
- your XML document against a schema is difficult. Most tools require you
- to specify a single schema file. If you have an XML document that
- contains more than one name space this doesn't work too well as often,
- each name space is in a separate schema file.
-
- The XML document has `xmlns` attributes which use a URI as the identifier.
- These URIs are for identifying it, and not a URL, so not able to be used.
- In fact, different cases in the URIs specify different name spaces even
- in the "host" part, though that is not the case with URLs. In order for
- validators to find the schema, the attribute
- <code>[xsi:schemaLocation](https://www.w3.org/TR/xmlschema-1/#schema-loc)</code> is
- used to map the name space URIs to the URLs of the schema.
-
- The `xsi:schemaLocation` mapping is very simple. It is simply a white
- space delimited list of URI/URL pairs. None of the command line tools
- that I used uses this attribute to make the schema validation simple.
- This includes [xmllint](https://web.archive.org/web/20210415145100/http://xmlsoft.org/xmllint.html)<label for="sn-xmlintarchive"
- class="margin-toggle sidenote-number"></label>
- <input type="checkbox" id="sn-xmlintarchive" class="margin-toggle"/><span class="sidenote">Via WayBack Machine as original link it http only.</span>
- which uses the libxml2 library. I also tried to use the Java XML library
- Xerces, but was unable to get it to work. Xerces did not provide a
- simple command line utility, and I couldn't figure out the correct java
- command line to invoke the validator class.
-
- My coworker, [Patrick](https://web.archive.org/web/20151012162546/http://fivetwentysix.com/)<label for="sn-526archive" class="margin-toggle sidenote-number"></label>
- <input type="checkbox" id="sn-526archive" class="margin-toggle"/>
- <span class="sidenote">Via WayBack Machine as original link is now defunct.</span>, found the blog entry,
- [Nokogiri XML schema validation with multiple schema files](https://avinmathew.com/nokogiri-xml-schema-validation-with-multiple-schema-files/),
- which talks about using `xs:import` to have a single schema file support
- multiple name spaces. With this, we realized that we could finally get
- our XML document verified.
-
- As I know shell scripting well, I decided to write a script to automate
- creating a unified schema and validate a document. The tools don't cache
- the schema documents, requiring fetching the schema each time you want
- to validate the XML document. We did attempt to write the schema files
- to disk, and reuse those, *but* there are issues in that some schemas
- reference other resources in them. If the schema is not retrieved from
- the web, these internal resources are not retrieved also, causing errors
- when validating some XML documents.
-
- With a little bit of help from `xsltproc` to extract xsi:schemaLocation,
- it wasn't to hard to generate the schema document and provide it to
- xmllint.
-
- The code ([xmlval.sh](https://www.funkthat.com/~jmg/xmlval.sh)):
-
- ``` { .shell .showlines }
- #!/bin/sh -
-
- cat <<EOF |
- <?xml version="1.0"?>
- <xsl:stylesheet version="1.0"
- xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- >
-
- <xsl:output method="text"/>
- <xsl:template match="/">
- <xsl:value-of select="/*/@xsi:schemaLocation"/>
- </xsl:template>
-
- </xsl:stylesheet>
- EOF
- xsltproc - "$1" |
- sed -e 's/ */\
- /g' |
- sed -e '/^$/d' |
- (echo '<?xml version="1.0" encoding="UTF-8"?>'
- echo '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nospace="nospace" targetNamespace="http://www.example.com/nospace">'
- while :; do
- if ! read a; then
- break
- fi
- if ! read b; then
- break
- fi
- echo '<xs:import namespace="'"$a"'" schemaLocation="'"$b"'"/>'
- done
- echo '</xs:schema>') |
- xmllint --noout --schema - "$1"
- ```
-
-
- Though the script looks complicated, it is a straight forward pipeline:
-
- 1. Lines 3-16 provide the xslt document to `xsltproc` on line 17 to
- extract schema location attribute.
- 1. Lines 18-20 replace multiple spaces with new lines and deletes any
- blank lines. It should probably also handle tabs, but none of the
- documents that I have had tabs. After this, we now have the odd
- lines containing the URI of the name space, and the even lines
- contain the URL for the schema.
- 1. Lines 21 and 22 are the header for the new schema document.
- 1. Lines 23-31 pulls in these line pairs and create the necessary
- `xs:import` lines.
- 1. Line 32 provides the closing element for the schema document.
- 1. Line 33 gives the schema document to xmllint for validation.
|