The blog.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

112 lines
4.8 KiB

  1. ---
  2. title: XML Schema Validation for the command line
  3. description: >
  4. XML Schema Validation for the command line
  5. posted: !!timestamp '2015-05-07'
  6. created: !!timestamp '2015-05-07'
  7. time: 2:17 PM
  8. tags:
  9. - xml
  10. - schema
  11. ---
  12. It turns out that unless you use a full fledge XML editor, validating
  13. your XML document against a schema is difficult. Most tools require you
  14. to specify a single schema file. If you have an XML document that
  15. contains more than one name space this doesn't work too well as often,
  16. each name space is in a separate schema file.
  17. The XML document has `xmlns` attributes which use a URI as the identifier.
  18. These URIs are for identifying it, and not a URL, so not able to be used.
  19. In fact, different cases in the URIs specify different name spaces even
  20. in the "host" part, though that is not the case with URLs. In order for
  21. validators to find the schema, the attribute
  22. <code>[xsi:schemaLocation](https://www.w3.org/TR/xmlschema-1/#schema-loc)</code> is
  23. used to map the name space URIs to the URLs of the schema.
  24. The `xsi:schemaLocation` mapping is very simple. It is simply a white
  25. space delimited list of URI/URL pairs. None of the command line tools
  26. that I used uses this attribute to make the schema validation simple.
  27. This includes [xmllint](https://web.archive.org/web/20210415145100/http://xmlsoft.org/xmllint.html)<label for="sn-xmlintarchive"
  28. class="margin-toggle sidenote-number"></label>
  29. <input type="checkbox" id="sn-xmlintarchive" class="margin-toggle"/><span class="sidenote">Via WayBack Machine as original link it http only.</span>
  30. which uses the libxml2 library. I also tried to use the Java XML library
  31. Xerces, but was unable to get it to work. Xerces did not provide a
  32. simple command line utility, and I couldn't figure out the correct java
  33. command line to invoke the validator class.
  34. My coworker, [Patrick](https://web.archive.org/web/20151012162546/http://fivetwentysix.com/)<label for="sn-526archive" class="margin-toggle sidenote-number"></label>
  35. <input type="checkbox" id="sn-526archive" class="margin-toggle"/>
  36. <span class="sidenote">Via WayBack Machine as original link is now defunct.</span>, found the blog entry,
  37. [Nokogiri XML schema validation with multiple schema files](https://avinmathew.com/nokogiri-xml-schema-validation-with-multiple-schema-files/),
  38. which talks about using `xs:import` to have a single schema file support
  39. multiple name spaces. With this, we realized that we could finally get
  40. our XML document verified.
  41. As I know shell scripting well, I decided to write a script to automate
  42. creating a unified schema and validate a document. The tools don't cache
  43. the schema documents, requiring fetching the schema each time you want
  44. to validate the XML document. We did attempt to write the schema files
  45. to disk, and reuse those, *but* there are issues in that some schemas
  46. reference other resources in them. If the schema is not retrieved from
  47. the web, these internal resources are not retrieved also, causing errors
  48. when validating some XML documents.
  49. With a little bit of help from `xsltproc` to extract xsi:schemaLocation,
  50. it wasn't to hard to generate the schema document and provide it to
  51. xmllint.
  52. The code ([xmlval.sh](https://www.funkthat.com/~jmg/xmlval.sh)):
  53. ``` { .shell .showlines }
  54. #!/bin/sh -
  55. cat <<EOF |
  56. <?xml version="1.0"?>
  57. <xsl:stylesheet version="1.0"
  58. xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  59. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  60. >
  61. <xsl:output method="text"/>
  62. <xsl:template match="/">
  63. <xsl:value-of select="/*/@xsi:schemaLocation"/>
  64. </xsl:template>
  65. </xsl:stylesheet>
  66. EOF
  67. xsltproc - "$1" |
  68. sed -e 's/ */\
  69. /g' |
  70. sed -e '/^$/d' |
  71. (echo '<?xml version="1.0" encoding="UTF-8"?>'
  72. echo '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nospace="nospace" targetNamespace="http://www.example.com/nospace">'
  73. while :; do
  74. if ! read a; then
  75. break
  76. fi
  77. if ! read b; then
  78. break
  79. fi
  80. echo '<xs:import namespace="'"$a"'" schemaLocation="'"$b"'"/>'
  81. done
  82. echo '</xs:schema>') |
  83. xmllint --noout --schema - "$1"
  84. ```
  85. Though the script looks complicated, it is a straight forward pipeline:
  86. 1. Lines 3-16 provide the xslt document to `xsltproc` on line 17 to
  87. extract schema location attribute.
  88. 1. Lines 18-20 replace multiple spaces with new lines and deletes any
  89. blank lines. It should probably also handle tabs, but none of the
  90. documents that I have had tabs. After this, we now have the odd
  91. lines containing the URI of the name space, and the even lines
  92. contain the URL for the schema.
  93. 1. Lines 21 and 22 are the header for the new schema document.
  94. 1. Lines 23-31 pulls in these line pairs and create the necessary
  95. `xs:import` lines.
  96. 1. Line 32 provides the closing element for the schema document.
  97. 1. Line 33 gives the schema document to xmllint for validation.