|
|
@@ -0,0 +1,106 @@ |
|
|
|
--- |
|
|
|
title: XML Schema Validation for the command line |
|
|
|
description: > |
|
|
|
XML Schema Validation for the command line |
|
|
|
created: !!timestamp '2015-05-07' |
|
|
|
time: 2:17 PM |
|
|
|
tags: |
|
|
|
- xml |
|
|
|
- schema |
|
|
|
--- |
|
|
|
|
|
|
|
It turns out that unless you use a full fledge XML editor, validating |
|
|
|
your XML document against a schema is difficult. Most tools require you |
|
|
|
to specify a single schema file. If you have an XML document that |
|
|
|
contains more than one name space this doesn't work too well as often, |
|
|
|
each name space is in a separate schema file. |
|
|
|
|
|
|
|
The XML document has xmlns attributes which use a URI as the identifier. |
|
|
|
These URIs are for identifing it, and not a URL, so not able to be used. |
|
|
|
In fact, different cases in the URIs specify different name spaces even |
|
|
|
in the "host" part, though that is not the case with URLs. In order for |
|
|
|
validators to find the schema, the attribute |
|
|
|
[xsi:schemaLocation](http://www.w3.org/TR/xmlschema-1/#schema-loc) is |
|
|
|
used to map the name space URIs to the URLs of the schema. |
|
|
|
|
|
|
|
The `xsi:schemaLocation` mapping is very simple. It is simply a white |
|
|
|
space delimited list of URI/URL pairs. None of the command line tools |
|
|
|
that I used uses this attribute to make the schema validation simple. |
|
|
|
This includes [xmllint](http://xmlsoft.org/xmllint.html) which uses |
|
|
|
the libxml2 library. I also tried to use the Java XML library |
|
|
|
Xerces, but was unable to get it to work. Xerces did not provide a |
|
|
|
simple command line utility, and I couldn't figure out the correct java |
|
|
|
command line to invoke the validator class. |
|
|
|
|
|
|
|
My coworker, [Patrick](http://fivetwentysix.com/), found the blog entry, |
|
|
|
[Nokogiri XML schema validation with multiple schema files](http://avinmathew.com/nokogiri-xml-schema-validation-with-multiple-schema-files/), |
|
|
|
which talks about using `xs:import` to have a single schema file support |
|
|
|
multiple name spaces. With this, we realized that we could finally get |
|
|
|
our XML document verified. |
|
|
|
|
|
|
|
As I know shell scripting well, I decided to write a script to automate |
|
|
|
creating a unified schema and validate a document. The tools don't cache |
|
|
|
the schema documents, requiring fetching the schema each time you want |
|
|
|
to validate the XML document. We did attempt to write the schema files |
|
|
|
to disk, and reuse those, *but* there are issues in that some schemas |
|
|
|
reference other resources in them. If the schema is not retrieved from |
|
|
|
the web, these internal resources are not retrieved also, causing errors |
|
|
|
when validating some XML documents. |
|
|
|
|
|
|
|
With a little bit of help from `xsltproc` to extract xsi:schemaLocation, |
|
|
|
it wasn't to hard to generate the schema document and provide it to |
|
|
|
xmllint. |
|
|
|
|
|
|
|
The code ([xmlval.sh](http://www.funkthat.com/~jmg/xmlval.sh)): |
|
|
|
|
|
|
|
``` { .shell .showlines } |
|
|
|
#!/bin/sh - |
|
|
|
|
|
|
|
cat <<EOF | |
|
|
|
<?xml version="1.0"?> |
|
|
|
<xsl:stylesheet version="1.0" |
|
|
|
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" |
|
|
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
|
|
|
> |
|
|
|
|
|
|
|
<xsl:output method="text"/> |
|
|
|
<xsl:template match="/"> |
|
|
|
<xsl:value-of select="/*/@xsi:schemaLocation"/> |
|
|
|
</xsl:template> |
|
|
|
|
|
|
|
</xsl:stylesheet> |
|
|
|
EOF |
|
|
|
xsltproc - "$1" | |
|
|
|
sed -e 's/ */\ |
|
|
|
/g' | |
|
|
|
sed -e '/^$/d' | |
|
|
|
(echo '<?xml version="1.0" encoding="UTF-8"?>' |
|
|
|
echo '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nospace="nospace" targetNamespace="http://www.example.com/nospace">' |
|
|
|
while :; do |
|
|
|
if ! read a; then |
|
|
|
break |
|
|
|
fi |
|
|
|
if ! read b; then |
|
|
|
break |
|
|
|
fi |
|
|
|
echo '<xs:import namespace="'"$a"'" schemaLocation="'"$b"'"/>' |
|
|
|
done |
|
|
|
echo '</xs:schema>') | |
|
|
|
xmllint --noout --schema - "$1" |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Though the script looks complicated, it is a straight forward pipeline: |
|
|
|
|
|
|
|
1. Lines 3-16 provide the xslt document to `xsltproc` on line 17 to |
|
|
|
extract schema location attribute. |
|
|
|
1. Lines 18-20 replace multiple spaces with new lines and deletes any |
|
|
|
blank lines. It should probably also handle tabs, but none of the |
|
|
|
documents that I have had tabs. After this, we now have the odd |
|
|
|
lines containing the URI of the name space, and the even lines |
|
|
|
contain the URL for the schema. |
|
|
|
1. Lines 21 and 22 are the header for the new schema document. |
|
|
|
1. Lines 23-31 pulls in these line pairs and create the necessary |
|
|
|
`xs:import` lines. |
|
|
|
1. Line 32 provides the closing element for the schema document. |
|
|
|
1. Line 33 gives the schema document to xmllint for validation. |