docx4java aka docx4j - OpenXML office documents in Java

docx4j 2.7.0 release candidate is now available at http://dev.plutext.org/docx4j/docx4j-2.7.0-rc1.jar

This will form the basis of the 2.7.0 release. In fact, unless there are significant issues over the next week or so, this will become the 2.7.0 release! So please try it out and report back, positive or negative…

It is mainly a maintenance release, but things of note include:

* Improvements to Maven build

* ContentAccessor interface

* AlteredParts: identify parts in this pkg which are new or altered; Patcher
which adds new or altered parts.

* Support for .glox SmartArt package (/src/glox/)

* JAXB RI 2.2.3 compatibilty

For contributors to this release and a more complete list of changes, please see http://dev.plutext.org/svn/docx4j/trunk … README.txt

There are 2 new dependencies (required for OpenDoPE processing): antlr-runtime-3.3.jar and stringtemplate-3.2.1.jar For convenience, copies of these can be found in the same dir as the rc jar.

Thanks very much to everyone who contributed to this release (candidate!).

And please consider clicking one of the buttons below to circulate news of the release.

I’m pleased to announce the release today of docx4j 2.7.0.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx. it is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET. It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

Improvements to Maven build
ContentAccessor interface
AlteredParts: identify parts in this pkg which are new or altered; Patcher
which adds new or altered parts.
Support for .glox SmartArt package (/src/glox/)
JAXB RI 2.2.3 compatibilty
OpenDoPE support improvements

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse, or download them from one of the links above)

Maven: Please see forum for details (since XML doesn’t paste nicely here right now).

Dependency changes

Antlr is now required for OpenDoPE processing; this gives us better XPath processing. The required jars are:

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

A request to docx4j users

If you are happily using docx4j, it would be great if you could reply to this post with some words of recommendation for others who might be wondering whether docx4j is a good choice. I know there are thousands of you out there :-)

Some users have been kind enough to make such statements already; these may be found on the trac homepage.

Of course, there are a number of other ways you can contribute back. Please consider doing so, especially if you think you might find yourself looking for support from volunteers in the docx4j forums.

For reasons best known (or only known) to Google, dev.plutext.org has never been on the first page of results when you search for “docx java”, despite all the relevant posts in our forums over more than 3 years.

I can only think Google doesn’t at all like a hostname other than “www”.

So I’ve moved everything to www.docx4java.org

This shouldn’t impact you (other than having to find this new site, and update any bookmarks) unless you are using svn and have docx4j checked out.

If you have the docx4j repository checked out, you’ll want to do something like:

svn switch --relocate http://dev.plutext.org/svn/docx4j/trunk/docx4j http://www.docx4java.org/svn/docx4j/trunk/docx4j

If you are on Windows and using TortoiseSVN, use Tortoise’s “relocate” command (not its “switch” command).

That should make your SVN checkout work again.

There may be various broken or outdated links on the website. I guess I’ll fix these over time.

If you encounter any other issues, then please post to http://www.docx4java.org/forums/announces/docx4j-has-a-new-home-t815.html

The source code for the OpenDoPE Word Add-In developer edition is available at last at http://opendope.codeplex.com/

(A binary download has been available for 10 months or so now)

OpenDoPE stands for Open Document Processing Ecosystem; its a standards based approach to document automation / document assembly.

Fundamentally, it is a set of conventions for doing document assembly using Open XML (the ISO-standard Microsoft Word docx file format), specifically, its content control databinding architecture.

Its real attraction is that it enables users to do document production without getting locked in to some vendors’ proprietary file format:- in adopting OpenDoPE, you aren’t making any commitment above and beyond continued use of the docx file format, and a conventional approach to use of its content controls.

For further details, see the OpenDoPE website.

docx4j can combine an XML data file with an OpenDoPE docx template for you; the point of the OpenDoPE Word Add-In is to help your authors with the initial step of creating OpenDoPE docx templates.

The Word Add-In is relatively straightforward; it uses VSTO (Visual Studio Tools for Office). You’ll need Visual Studio (2010) and basic C# skills to modify it.

The point of releasing the source code is to make it easy for developers to contribute back fixes and enhancements (which has worked really well for docx4j), or extend the Addin to create their own specialised authoring tool. The source code is in Mercurial, which – because of its distributed nature – should facilitate the latter especially.

The source code for the OpenDoPE Word Add-In (developer edition) is dual licensed, the primary license being GPL v2.

The Add-In is made possible because of the availability of the SharpDevelop “Avalon” and XML editor components. Thanks guys!

With version 2.7.1, docx4j – a library for manipulating Word docx, Powerpoint pptx, and Excel xlsx xml files in Java – and all its dependencies, are available from Maven Central.

This makes it really easy to get going with docx4j. With Eclipse and m2eclipse installed, you just add docx4j, and you’re done. No need to mess around with manually installing jars, setting class paths etc.

This post demonstrates that, starting with a fresh OS (Win 7 is used, but these steps would work equally well on OSX or Linux).

Step 1 – Install the JDK

For the purposes of this article, I used JDK 7, but docx4j works with Java 6 and 1.5.

Step 2 – Install Eclipse Indigo (3.7.1)

I normally download the version for J2EE developers. Unzip it and run eclipse

Step 3 – Install m2eclipse.

In Eclipse, click Help > Install New Software.

Type “http://download.eclipse.org/technology/m2e/releases” in the “Work with” field as shown:

then follow the prompts.

Step 4 – Create your Maven project

In Eclipse, File > New > Project.., then choose Maven project

You should see:

Check “Create a simple project (skip archetype selection)” then press next.

Allocate group and artifact id (what you choose as your artifact id will become the name of your new project in Eclipse):

Press finish

This will create a project with directories using Maven conventions:

(Note: If your starting point is a new or existing Java project in Eclipse, you can right click on the project, then choose Configure > Convert to Maven project)

Step 5 – Add docx4j to your POM

Double Click on pom.xml

Next click on the dependencies tab, then click the “add dependency” button, and enter the docx4j coordinates as shown in the image below:

The result is this pom:


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>mygroup</groupId>
  <artifactId>myartifact</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
  	<dependency>
  		<groupId>org.docx4j</groupId>
  		<artifactId>docx4j</artifactId>
  		<version>2.7.1</version>
  	</dependency>
  </dependencies>
</project>

Ctrl-S to save it.

m2eclipse may take some time to download the dependencies.

When it has finished, you should be able to see them:

Step 6 – Create HelloMavenCentral.java

If you made a Maven project as per step 4 above, you should already have src/main/java on your build path.

If not, create the folder and add it.

Now add a new class:

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

public class HelloMavenCentral {

	public static void main(String[] args) throws Exception {

		WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

		wordMLPackage.getMainDocumentPart()
			.addStyledParagraphOfText("Title", "Hello Maven Central");

		wordMLPackage.getMainDocumentPart().addParagraphOfText("from docx4j!");

		// Now save it
		wordMLPackage.save(new java.io.File(System.getProperty("user.dir") + "/helloMavenCentral.docx") );

	}
}

Step 7 – Click Run

When you click run, all being well, a new docx called helloMavenCentral.docx will be saved.

You can open it in Word (or anything else which can read docx), or unzip it to inspect its contents.

Step 8 – Adding docx4j.properties

One final thing. If you plan on creating documents from scratch using docx4j, it is useful to set paper size etc, via docx4j.properties. Put something like the following on your path:

# Page size: use a value from org.docx4j.model.structure.PageSizePaper enum
# eg A4, LETTER
docx4j.PageSize=LETTER
# Page size: use a value from org.docx4j.model.structure.MarginsWellKnown enum
docx4j.PageMargins=NORMAL
docx4j.PageOrientationLandscape=false

# Page size: use a value from org.pptx4j.model.SlideSizesWellKnown enum
# eg A4, LETTER
pptx4j.PageSize=LETTER
pptx4j.PageOrientationLandscape=false

# These will be injected into docProps/app.xml
# if App.Write=true
docx4j.App.write=true
docx4j.Application=docx4j
docx4j.AppVersion=2.7.1
# of the form XX.YYYY where X and Y represent numerical values

# These will be injected into docProps/core.xml
docx4j.dc.write=true
docx4j.dc.creator.value=docx4j
docx4j.dc.lastModifiedBy.value=docx4j

#
#docx4j.McPreprocessor=true

# If you haven't configured log4j yourself
# docx4j will autoconfigure it.  Set this to true to disable that
docx4j.Log4j.Configurator.disabled=false

And that’s it. For more information on docx4j, see our Getting Started document.

Please click the +1 button if you found this article helpful.

I’m pleased to announce the release of docx4j 2.7.1. It was actually released 2 weeks ago, but this announcement has been delayed until I was able to publish the accompanying post on docx4j now being in Maven Central.

What is docx4j?

docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML “packages”, including docx, pptx, and xslx. It is similar to Microsoft’s OpenXML SDK, but for Java rather than .NET. It uses JAXB to create the Java objects out of the OpenXML parts.

Notable features for docx include export as HTML or PDF, and CustomXML databinding for document generation (including our OpenDoPE convention support for processing repeats and conditions).

The docx4j project started in October 2007.

What’s new?

This is mainly a maintenance release; things of note include:

Preparation for including docx4j in Maven Central
mc:AlternateContent preprocessor, allowing graceful degradation of Word 2010 specific content
docx4j.properties, supports configuration of default page size, margins, orientation; also ability to set some of the doc props metadata (Application & AppVersion; dc.creator & dc.lastModifiedBy).
HtmlExporterNG2,(Pdf)Conversion, SvgExporter: storing any images is delegated to a
ConversionImageHandler that may be passed as a conversion parameter. Default implementation: DefaultConversionImageHandler
OpenDoPE changes – see summary post in the sub-forum

Where do you get it?

Binaries: You can download a jar alone or a tar.gz with all deps or pick and choose.

Source: Checkout the source from SVN (use the pom.xml file to satisfy the dependencies eg with m2eclipse as explain in the Maven blog post, or download them from one of the links above)

Maven: From Maven Central; please see the blog post referenced above.

Getting Started

See the “Getting Started” guide.

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

There have been a couple of posts on the forum lately regarding adding hyperlinks to other parts of a docx.

This blog post walks you through the generic process for investigating an issue like this.

First, create a sample docx in Word which exhibits the issue of interest.

Here I’m interested in hyperlinks to a heading, and to a bookmark. So see this docx. Second, look inside it (its a zip file). For the link to the heading, document.xml contains a w:p containing:

      <w:hyperlink w:anchor="_My_heading" w:history="1">
        <w:r>
          <w:rPr>
            <w:rStyle w:val="Hyperlink"/>
          </w:rPr>
          <w:t>My heading</w:t>
        </w:r>
      </w:hyperlink>

The heading itself is automatically given a bookmark:

    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading1"/>
      </w:pPr>
      <w:bookmarkStart w:id="0" w:name="_My_heading"/>
      <w:bookmarkEnd w:id="0"/>
      <w:r>
        <w:t>My heading</w:t>
      </w:r>
    </w:p>

For the link to my bookmark, Word 2010 used the legacy field formulation:

    <w:p>
      <w:r>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
        <w:instrText xml:space="preserve"> HYPERLINK  \l "bm1" </w:instrText>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="separate"/>
      </w:r>
      <w:r w:rsidRPr="00D16ABA">
        <w:rPr>
          <w:rStyle w:val="Hyperlink"/>
        </w:rPr>
        <w:t>bm1</w:t>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="end"/>
      </w:r>
    </w:p>

Third, what rels are involved? To answer this, I run the docx through docx4j’s PartsList sample. It shows me that these hyperlinks don’t create any rels. Alternatively, to see this, you could have looked at the rels part when you unzipped the docx.

So we can see that adding an internal hyperlink to a heading requires that it be bookmarked first. Once you have a bookmark, you use a w:hyperlink to refer to the bookmark by name (not id). Doesn’t look like there is any reason to use fields for this.

Here’s a suitable method:

	/**
	 * Create a Hyperlink object, which is suitable for adding to a w:p
	 * @param bookmarkName
	 * @param linkText
	 * @return
	 */
	public static Hyperlink hyperlinkToBookmark(String bookmarkName, String linkText) {

		try {

			String hpl = "<w:hyperlink w:anchor=\"" + bookmarkName + "\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" " +
            "w:history=\"1\" >" +
            "<w:r>" +
            "<w:rPr>" +
            "<w:rStyle w:val=\"Hyperlink\" />" +  // TODO: enable this style in the document!
            "</w:rPr>" +
            "<w:t>" + linkText + "</w:t>" +
            "</w:r>" +
            "</w:hyperlink>";

			return (Hyperlink)XmlUtils.unmarshalString(hpl);

		} catch (Exception e) {
			// Shouldn't happen
			e.printStackTrace();
			return null;
		}

	}

We can test it by altering the BookmarkAdd sample to add a link:

Hyperlink h = MainDocumentPart.hyperlinkToBookmark(bookmarkName, "link to bookmark");
wordMLPackage.getMainDocumentPart().addParagraphOfText("some text").getContent().add(h);

then checking the result opens in Word ok.

That’s all. Added to docx4j in revision 1777.

A customer asked me to prepare a sample Android project which converts docx to HTML.

The result is AndroidDocxToHtml

Since docx4j relies heavily on JAXB, the key to getting it working was getting JAXB – the reference implementation – to run on Android.

Android presents us with a number of challenges:

it won’t let you add a jar which includes classes in the javax.xml namespace (which is where the JAXB API lives)
JAXB uses JAXP 1.3 DatatypeFactory, but Android doesn’t provide it
JAXB uses javax.activation.DataHandler
Dalvik has a limit of 65536 method references per dex file
it doesn’t support package level annotations (which JAXB uses, and which in docx4j supply namespaces)

Ill-advised or mistaken usage of a core class (java.* or javax.*)

You’ll get this message if you try to add a jar containing classes in java.* or the following javax packages:

accessibility crypto imageio management naming

net print rmi security sound sql swing transaction

xml

Android doesn’t provide javax.xml.bind, and it won’t let you add it yourself. It forces you to re-package it. Just like on Google AppEngine, until Google eventually added it.

OK, done that; see https://github.com/plutext/jaxb-2_2_5_1/tree/android2 (the 2 in android2 is meaningless)

Repackaging is easy enough; the problem with it is that any library which uses the repackaged code, must also be changed. In the case of docx4j, this means a new branch, and ongoing maintenance.

JAXB uses JAXP 1.3 DatatypeFactory, but Android doesn’t provide it

com.sun.xml.bind invokes javax.xml.datatype.DatatypeFactory.newInstance, whereupon Android throws javax.xml.datatype.DatatypeConfigurationException: Provider org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl not found.

Easy solution: jar it up and provide it.

JAXB uses javax.activation.DataHandler

Easy solution: use the activation and additionnal jars from http://code.google.com/p/javamail-android/downloads/list

Dalvik limit of 65536 method references per dex file

This is more an issue running docx4j on Android than one related to JAXB, but it is worth noting. We’re running very close to this limit. Vote for the issue at http://code.google.com/p/android/issues/detail?id=7147

Also, you may need to give Eclipse more heap space (symptom is ‘you get Unable to execute dex: Java heap space’). In eclipse.ini, I used:

-Xms256m

-Xmx4096m

In Eclipse, Windows > Preferences > General > Show Heap Status gives you an entry on the bottom row which is useful.

Just when I thought it would all work…

I found that my XML was not unmarshalling, because it contains namespaces, and for some reason the objects in my JAXB were being read as not having any.

The problem is that Android doesn’t support package annotations: http://code.google.com/p/android/issues/detail?id=16149 (vote), but JAXB needs to read them. For example:

@javax.xml.bind.annotation.XmlSchema(namespace = “http://schemas.openxmlformats.org/package/2006/relationships”, elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED)

I ended up devising a simple minded way to tell JAXB about these programmatically. See Context.java. Hmmm, I probably should have created my own RuntimeInlineAnnotationReader implementation (Google ‘JAXBIntroductions’).

That done, it more or less works (if you need support for other package level annotations, you’ve got a bit more to do). The re-packaged JAXB is here. You can build it using ant -f build-repackaged.xml dist

It should work on Android 3 or 4.

To use it, where your code would otherwise import javax.xml.bind, use ae.java.xml.bind.

docx4j is now on GitHub! https://github.com/plutext/docx4j

This should make it easier for users to maintain their own branches (public or private), and contribute improvements back.

As of now, GitHub is the project’s authoritative version control. We’re no longer updating the existing svn repository.

Its pretty easy to work with docx4j sources in Eclipse. This post shows you how.

First, make sure you have eGit installed in Eclipse. Install it from here. On Windows, it is also useful to have msysgit. Refer elsewhere for how to set these up. Update: there is a GitHub Windows client now (I haven’t tried it) which apparently includes msysgit.

You also need m2eclipse

Assuming you’ve done all that, setting up the docx4j source code is just a few steps.

But first, be aware there is a difference between cloning and forking. Cloning gives you a copy of the source code you can work on, but without more, no easy way to contribute changes back. Forking sets you up with the source code, and makes it easy to contribute changes back.

If you think you might be making changes to the docx4j source code, you’re probably best to create a fork on GitHub right from the start.

Step 1 (optional, but recommended): To create a fork, log in to GitHub, visit https://github.com/plutext/docx4j then press the “Fork” button.

Step 2: Create your local repository (git clone)

This can be done from within Eclipse, or using Git Gui (easiest), or Git Bash Shell.

To do it from within Eclipse, File > Import .. > Repositories from GitHub:

If you forked docx4j, find your fork (it might not appear immediately, which is why Git Gui or Git Bash Shell are better for this step), select it, and click next.

If you didn’t fork docx4j, type ‘docx4j’ then press ‘search’, the plutext/docxj repository should come up:

Select plutext/docx4j, then click next.

This creates a local git repository on your computer.

Step 3: Now you need to import that repository into Eclipse as a project:

File > Import .. > Projects from Git

Eclipse should find the existing project settings:

(If it didn’t and you had to use the new projects wizard; be sure to set the file location to wherever your git repository is, rather than letting Eclipse create a new empty project in the workspace)

Now you should have a docx4j project in Eclipse, and it should be properly configured (since the project settings come with the project).

You should be done. But if something isn’t right, you can configure it manually (see further below).

Next steps? Improve the docx4j source code in Eclipse :-) , then Team > Commit, to commit those changes to your local repository.

Made a change which would be useful to others? If you forked docx4j as per step 1 above, you can push your changes to your repository on GitHub, then send a pull request.

If you didn’t fork docx4j, do that now on GitHub, then configure things locally to push your changes to your repository on GitHub, then you’ll be right to push your changes to your repository on GitHub, then send a pull request. Other docx4j users will thank you for this :-)

Manual configuration:

Configure > Convert to Maven Project

Properties > Java Compiler > Compiler compliance level: change to 1.6

Java Build Path > Libraries: remove 1.5 system library; Add Library … JRE System Library .. 1.6

Java Build Path > Source: check none of the entries say “Excluded: **” (remove the exclusion)

I’m pleased to say that docx4j 2.8.0 is now released.

What is docx4j?

docx4j is an open source (Apache v2) library for working with docx, pptx, and xslx files, based around JAXB.

What’s new?

The headline feature is XHTML import. docx4j can convert XHTML to Word document content, formatting it based on the CSS. Images and tables are supported. See the ConvertInXHTMLDocument and ConvertInXHTMLFragment samples.

Where do you get it?

See our downloads page or:

Binaries: You can download a jar alone or a zip with all deps or pick and choose. If you’re upgrading from 2.7.1, you need the docx4j jar and:

for XHTML importing, xhtmlrenderer-1.0.0.jar and itext-2.1.7.jar
for PDF output, jaxb-xslfo-1.0.1.jar (now a separate project)
for pptx conversion to SVG: jaxb-svg11-1.0.2.jar (now a separate project)
for digitally signed documents, jaxb-xmldsig-core-1.0.0.jar (new)

Source: the source code is on GitHub at https://github.com/plutext/docx4j; here’s how to setup docx4j source code

Maven: docx4j 2.8.0 is in Maven Central. Here is a guide to getting started (where it says 2.7.1, just use 2.8.0).

Getting Started

See the “Getting Started” guide, in html docx or pdf flavours.

There is lots of sample code here (freshly reviewed for 2.8.0).

Support

If you are looking for help (and have read the Getting Started Guide :-) ), you can post in our forums, or on Stack Overflow (where there is a docx4j tag).

Thanks to our contributors

A number of contributions have made this release what it is; thanks very much to those who contributed.

Contributors to this release and a more complete list of changes may be found in README.txt

Thanks also to those who have +1′d pages on this website, or tweeted or blogged about docx4j, which is critical to expanding the docx4j community!

Just launched is http://webapp.docx4java.org

You should be able to see it in the menu at the top right of this website (if not, reload the web page…).

There are three things you can do with it right now:

• Explore your docx/pptx/xlsx and its representation in docx4j

• Convert docx to PDF or XSL FO

• Merge docx files (eg cover letter plus contract) into a single docx, using Plutext’s MergeDocx. Or the same thing for pptx files, using MergePptx.

Here I want to focus on the first of these.

After you’ve uploaded your docx/pptx/xlsx, the first thing you see is like docx4j’s PartsList sample:

Here, I’ll click in the left hand column to look at the main document part, document.xml

When I do that, I see the XML:

No surprises there.

But notice the hyperlinks. Here I’ll just click on the first w:p.

What you get back, is Java source code to create that complete structure:-

As you can see from the image above, both styles of code (as described in docx4j’s Getting Started document) are produced for you. With a bit of luck, you can cut/paste either into your IDE (Eclipse or whatever), and just run with it!

To actually see the created object in an Office document, you’ll still need to add the created object to a part. See Getting Started, or the cheat sheet for how to do that.

I hope this helps you to create/modify your Office documents more efficiently,with docx4j!

Do let us know what you think in the comments, or in docx4j’s forums.

Here’s a single A4 page reference/overview of docx4j aka a cheat sheet, in PDF or PNG format.

This one is focused on docx files (WordprocessingML).

I’ll create something similar for pptx and xlsx over coming days.

docx4j 3.0 (beta for which will be available shortly) contains a lot of changes, some big, some small.

Here are the most visible (see our changelog for the rest):

Logging

docx4j 3.0 uses slf4j, instead of log4j.

As the slf4j website puts it:

The Simple Logging Facade for Java (SLF4J) serves as a simple facade or abstraction for various logging frameworks (e.g. java.util.logging, logback, log4j) allowing the end user to plug in the desired logging framework at deployment time.

So you need the slf4j api jar on your classpath:

If you want to use log4j, then include it, and:

XHTML Import

The XHTML Import functionality is now a separate project on GitHub.

The reason being that its main dependency – Flying Saucer - is licensed under LGPL v2.1 (as opposed to ASL v2, which docx4j’s other dependencies use).

If you want this functionality, you have to add these jars to your classpath. We’ll update this post with their coordinates once they are in Maven Central.

Docx4j facade

3.0 contains a facade providing clean access to some typical uses of docx4j:

Loading a document
Saving a document
Binding xml to content controls in a document
Exporting the document (to HTML, or PDF and other formats supported by the FO renderer)

You don’t have to use this – in that existing code should continue to work – but the facade is the right way to do things. Behind the facade is a major rethink/cleanup to the export architecture/implementation, contributed by Alberto.

MOXy

The key technology underlying docx4j – and a major differentiator from Apache POI – is JAXB.

There is a JAXB reference implementation; the JAXB baked into Java 6 and 7 is based on it.

Prior to v3, you had to use the reference implementation, or the implementation included in the JDK.

With v3, you can choose to use EclipseLink MOXy instead. To do so, simply include docx4j-MOXy-JAXBContext-3.0.0.jar and the MOXy jars on your classpath.

Sample code

The docx4j samples have relocated to src/samples

A beta of docx4j 3.0 is now available, at:

http://www.docx4java.org/docx4j/docx4j-3_0-beta2.zip [link updated 15 Nov]

That zip file contains docx4j, and all its dependencies. To use it, add all the jars to your classpath.

Alternatively, Maven users can get the beta from our staging repo on GitHub.

<repositories>
    <repository>
        <id>docx4j-mvn-repo</id>
        <url>https://raw.github.com/plutext/docx4j/mvn-repo/</url>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
        </snapshots>
    </repository>
</repositories>

docx4j 3.0 beta is:


<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j</artifactId>
    <version>3.0.0-SNAPSHOT</version>
</dependency>

Our last blog post outlines the major things to be aware of in v3.

Additional notes:

For convenience, the zip file also contains docx4j-ImportXHTML, and its dependencies, which are LGPL. You can delete these if you wish. They aren’t in the mvn staging repo.
To see any logging, you’ll need to add an slf4j implementation.
You might want to add a docx4j.properties file

You can find updated Getting Started guide in docx|pdf formats at http://www.docx4java.org/docx4j/.

Feedback welcome. You can reply here, or to the post in the docx4j forums.

All going smoothly, we’ll progress to final release over the next couple of weeks, so the sooner your feedback, the better!

On behalf of everyone who has contributed to docx4j, Plutext is pleased to announce that version 3 was released today.

You can get it from Maven Central, or from http://www.docx4java.org/docx4j/ (the jar, the dependencies, or everything including documentation zipped up)

Source code is available at GitHub or from the Maven Central link above. Javadoc is at Maven Central.

For what you need to know about docx4j 3.0, please see this post.

The XHTML Import stuff is now a separate project (since it and its dependencies are LGPL, not ASLv2 like docx4j).

the three jars you need (docx4j-ImportXHTML, xhtmlrenderer, and iText) are included for convenience in the zip file above. You can delete them if you don’t need or want XHTML import.
or you can get it from Maven Central

docx4j 3.0 uses slf4j for logging. For convenience, log4j is the default implementation. A follow-up post will explain more about logging config.

Thanks to everyone who has helped to make this release our best yet!

If you have questions pertaining to the use of docx4j, please post them in our forum, or on StackOverflow (rather than in comments to this post).

blog/2011/10/hello-maven-central/ walks you through the basics of using docx4j in an Eclipse project with the help of m2eclipse.

This post is about the different ways you can set up docx4j 3.0 with the help of Maven.

We’ll be using the following skeleton pom.xml:


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>your.group</groupId>
	<artifactId>your.artifactp</artifactId>
	<name>nameless</name>
	<version>0.0.1-SNAPSHOT</version>
	<description>
		some description
	</description>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-dependency-plugin</artifactId>
				<version>2.0</version>
			</plugin>
		</plugins>
	</build>

	<dependencies>

		<!-- dependencies go here -->

	</dependencies>

</project>

Adding the core dependency

To use docx4j, including its LGPL XHTML import capability, just include the following dependency in your pom.xml:


		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j-ImportXHTML</artifactId>
			<version>3.0.0</version>
		</dependency>

That’ll drag in docx4j, and all the other dependencies (you should be able to see then in Eclipse under Maven Dependencies, or by running mvn dependency:tree at a command prompt).

If you don’t want the XHTML import stuff, just use:


		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j</artifactId>
			<version>3.0.0</version>
		</dependency>

(You should consider adding a docx4j.properties to your classpath)

Logging

Both of the above default to using log4j. If you are happy with log4j, you’ll want a log4j.xml file unless you already have it on your classpath. If you don’t, you can configure https://github.com/plutext/docx4j/blob/master/src/samples/_resources/log4j.xml to suit.

If you want to use something other than log4j for logging, well you can, since docx4j uses slf4j.

First you need to exclude the log4j stuff.


		<dependency>
			<groupId>org.docx4j</groupId>
			<artifactId>docx4j-ImportXHTML</artifactId>
			<version>3.0.0</version>
			<exclusions>
				<exclusion>
					  <groupId>org.slf4j</groupId>
					  <artifactId>slf4j-log4j12</artifactId>
				</exclusion>
				<exclusion>
					<groupId>log4j</groupId>
					<artifactId>log4j</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

Then you add in the dependencies for your other logging frameworks. See further http://www.slf4j.org/ and slf4j in search.maven.org

JAXB

docx4j relies very heavily on JAXB. With Java 6 or 7, usually it’ll use the JAXB included in that (though things can be different with application servers – see the deployment forums for details).

The point here is that there is an alternative JAXB implementation, called EclipseLink MOXy (see http://www.eclipse.org/eclipselink/moxy.php), which is very well supported by its developers. You can try it with docx4j. To do so, just include the following additional dependencies:

[/sourcecode]

org.docx4j
docx4j-MOXy-JAXBContext
3.0.0

org.eclipse.persistence
org.eclipse.persistence.moxy
2.5.1

/sourcecode]

Since using MOXy with docx4j is all quite new, you may run into some minor issues. If you do, please let us know in the docx4j forums (with sufficient info for us to reproduce what you are seeing!). Thanks.

Given the news this week about Google lowering prices per GB on Google Drive, I thought it would be timely to explore interop with docx4j.

https://github.com/plutext/docx4j-cloud-GoogleDrive is a small project which demonstrates:

Upload a wordMLPackage (or presentationML or spreadsheetML pkg) to Google Drive as a docx
Download a file from Google Drive as a docx4j WordprocessingMLPackage
Convert a WordprocessingMLPackage to the specified output format, using Google Drive

Clone the project, and set it up using Maven in your IDE. I’m not going to tell you how to do that.

Enabling the Drive API

From there, it is fairly straightforward (assuming you have a Google account); you just need to enable the Drive API: set up a project and application in the Developers Console:

press the red “CREATE NEW CLIENT ID” button, then choose application type “Installed Application”; I then chose subtype “Other”
hit the “Download JSON” button; save it as client_secret.json in your project dir

Run our code

OK, now try running Docx4jUploadToGoogleDrive

It ought to say something like:

Please open the following URL in your browser then type the authorization code:

https://accounts.google.com/o/oauth2/auth?access_type=online&client_id=622239…

Paste the auth code into your IDE’s console (System.in, probably the same place which displayed the above message) then press enter. If you aren’t logged into your Google account in your browser, its at this point that you’ll be asked to log in.

The code will create a new docx file, and after uploading it, if successful, report the File ID allocated by Google Drive:

File ID: 0CyHdofN18p16OF9YWWNFUFdmTjg

The other 2 samples require you to provide an auth code the same way (each time you run them). Obviously, you’d be more sophisticated than this in a production application. See further https://developers.google.com/drive/web/about-auth

By now we’re used to products which emit docx files which are umm, not .. quite .. right.

But its more noteworthy when the product in question is from Microsoft. After all, its their file format (ECMA etc standardisation notwithstanding).

The product in question here is SQL Server Reporting Services 2012 and its Word export.

It seems they didn’t bother to validate their documents (eg using Open XML SDK 2.0 Productivity Tool):

Apparently there’s a reason for this:

“Word and SSRS treat page headers and footers differently. Word actually positions them inside the page margins, whereas SSRS positions them inside the area that the margins surround. As a result, in Word, the page margins do not control the distance between the top edge of the page and that of the page header (or similarly for the page footer). Instead, Word has separate “Header from Top” and “Footer from Bottom” properties to control those distances. Since RDL does not have equivalent properties, the Word renderer sets these properties to zero.”

But the problem is that it is actually setting them to blank (as opposed to zero), which is not valid.

Another problem:

JAXB doesn’t like invalid documents, so docx4j has to fix these sorts of things before it can construct a content model. (Maybe that’s why SSRS calls it Word export, not docx export:- they just check Word can open the document, then call it job done)

There are other problems with SSRS docx which the Productivity Tool doesn’t report.

Take a look at the styles part:

Notice anything wrong? It’d be better if the EmptyCellLayoutStyle had @w:styleId and @w:type, like so:

It’d also be nice if it defined the “Normal” style it is basedOn!

docx4j and other consumers could/should detect such problems and degrade gracefully in the face of them, but Microsoft (of all companies!) should exercise better quality control.

How to convert docx to PDF without using Microsoft Word?

If you docx is mainly text, tables and images, docx4j.NET may work well for you. docx4j.NET is open source (Apache software license v2), identical to the Java version, but made into a DLL using IKVM. Currently we’re at v3.2.0, released last week.

It is easy to test; you can upload your docx to the docx4j demo webapp

Or with very little effort, you can run it from a sample project in Visual Studio. Its very easy, because docx4j.NET is in the NuGet.org repository:

To create your sample project:

make sure you have NuGet Package Manager installed
- for VS 2012 and later, its installed by default
- for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
create a new project in Visual Studio (File > New > Project). A Console Application is fine. I chose that from the .NET 3.5 list.
from the Tools menu, choose NuGet Package Manager > Package Manager Console
type Install-Package docx4j.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there! Notice the file src/samples/c_sharp/Docx4NET/DocxToPDF.cs

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in “done! Press any key to continue..”

What just happened? All being well, the sample docx “src\samples\resources\sample-docx.docx” was saved as a PDF “OUT_sample-docx.pdf” in your project directory.

You can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own test docx.

A few comments.

XSL FO; Apache FOP. docx4j creates PDF via XSL FO. It generates XSL FO, then uses Apache FOP (v1.1) to convert the XSL FO to PDF. FOP also supports other output formats (the subject of another blog post).

Logging, Commons Logging. Logging is via Commons Logging. In the demo, it is configured programmatically (ie in DocxToPDF.cs). Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving PDF support. To improve the quality of the PDF output, typically you’d make the improvement to docx4j first (ie the Java version), then create a new DLL using the ant build target dist.NET. docx4j is on GitHub, and is most easily setup using Maven (see earlier blog post).

Help/support/discussion. You can post in the docx4j PDF output forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, pdf, fop, xslfo as you think appropriate). Please don’t cross post at both!

How to convert import HTML into a Word document without using Microsoft Word?

Honouring the CSS, so the Word document looks similar to the input XHTML. Alternatively, converting @class values to Word styles.

Its a common requirement in our increasingly web-centric world.

docx4j-ImportXHTML.NET is open source (LGPL v2.1 or later), identical to the Java version, but made into a DLL using IKVM. Currently we’re at v3.2.0, released last week.

It is easy to test; with very little effort, you can run it from a sample project in Visual Studio. Its very easy, because docx4j-ImportXHTML.NET is in the NuGet.org repository:

To create your sample project:

make sure you have NuGet Package Manager installed
- for VS 2012 and later, its installed by default
- for VS 2010, NuGet is available through the Visual Studio Extension Manager; see the above link.
create a new project in Visual Studio (File > New > Project). A Console Application is fine. I chose that from the .NET 3.5 list.
from the Tools menu, choose NuGet Package Manager > Package Manager Console
type Install-Package docx4j-ImportXHTML.NET

You should see something like:

And then, your project/solution will be populated to look like:

We’re nearly there! Notice the docx4j-ImportXHTML DLL, and the file src/samples/c_sharp/docx/ConvertInXHTMLFragment.cs. Most of the rest of the stuff comes from the docx4j dependency, which NuGet fetches.

If you have a look at ConvertInXHTMLFragment.cs, you’ll see it contains

Let’s run it, to convert that xhtml to docx content.

Click on your project in Solution Explorer, then right click (or hit Alt+Enter) to get the properties pane:

Then set the “startup object” as shown in the above image.

Now you can hit Ctrl+F5 (“Start without Debugging”) – you don’t want to debug, since that’s really slow.

You should see some logging in the console window, culminating in something like:

You can see there the WordML equivalent for the tail of the XHTML list we were converting.

Obviously, you can modify src/samples/c_sharp/Docx4NET/DocxToPDF.cs to read your own XHTML.

A few comments.

Well formed XML! Only well formed XML works, ie XHTML, not tag-soup HTML. If you have tag soup, its your responsibility to convert that to XHTML with some tidy tool. You’ll get a SAXParseException if your input is not well formed.

Word styles: if the target docx contains a style matching @class, it can be used. This’ll be the subject of a separate blog post.

Other examples: the Java repository on GitHub contains examples for reading from a file etc. Converting these to C# is left as an exercise for the reader. If you do that, we’d be delighted to receive a pull request on https://github.com/plutext/docx4j-ImportXHTML.NET

Logging, Commons Logging. Logging is via Commons Logging. In the demo, it is configured programmatically (ie in DocxToPDF.cs). Alternatively, you could do it in app.config.

OpenXML SDK interop: src/main/c_sharp/Plutext/Docx4NET contains code for converting between a docx4j representation of a docx package, and the Open XML SDK’s representation.

Improving XHTML import support. To implement a new feature in the XHTML import, typically you’d make the improvement to docx4j-ImportXHTML first (ie the Java version), then create a new DLL using the ant build target dist.NET. docx4j-ImportXHTML is on GitHub, and is most easily setup using Maven (see earlier blog post).

Alternatives. There are a couple of projects on CodePlex you could try:

html2openxml
htmltodocx (PHP)

I’d be interested in feedback on how they compare.

Help/support/discussion. You can post in the docx4j XHTML import forum, or on StackOverflow (be sure to use tag docx4j, plus some/all of c#, docx, xhtml etc as you think appropriate). Please don’t cross post at both!

Feedback on docx4j 2.7.0 release candidate?

docx4j 2.7.0 release candidate is now available at http://dev.plutext.org/docx4j/docx4j-2.7.0-rc1.jar

docx4j 2.7.0 released

docx4j has a new home

OpenDoPE Word Add-In source code released

Hello Maven Central

docx4j 2.7.1 released

docx – internal hyperlinks

JAXB can be made to run on Android

docx4j from GitHub in Eclipse

docx4j 2.8.0 released

docx4j/pptx/xlsx online code generation

docx4j in a single page

docx4j 3.0 – what you need to know

docx4j 3.0 beta

docx4j 3.0 released

docx4j 3.0 and Maven

docx4j and Google Drive

SQL Server Reporting Services (SSRS) emits dodgy Word docx documents

docx to PDF in C#/.NET

C#/.NET: Import XHTML into docx without Word