Awesome Document: Dockerize Document Processing
17 Sep 2023
Document processing is a common task in many industries, but it can be time-consuming and complex. Docker can be used to simplify and streamline document processing by providing a containerized environment for each tool.
This article will introduce three Docker images for document processing:
docker-pandoc: A Docker image for Pandoc, a universal document converter. docker-unoserver: A Docker image for LibreOffice, a free and open-source office suite. docker-poppler: A Docker image for Poppler, a PDF library.
docker-pandoc
docker-pandoc is a Docker image for Pandoc, a universal document converter. Pandoc can convert between many different document formats, including Markdown, HTML, PDF, and LaTeX.
Start the API server
docker run --rm -p 5000:5000 chanmo/pandoc
Convert the DOCX file to HTML format, by httpie
http -f POST :5000/convert/html file@~/demo.docx
docker-unoserver
docker-unoserver is a Docker image for LibreOffice, a free and open-source office suite. LibreOffice includes a variety of applications for word processing, spreadsheets, presentations, and more.
Start the API server
docker run -p 5000:5000 chanmo/unoserver
convert a docx file to the pdf format
http -f POST :5000/convert/pdf file@/path/to/demo.docx -o demo.pdf
If you just want to refresh the TOC in a docx file.
http -f POST :5000/convert/docx file@/path/to/demo.docx -o demo.docx
docker-poppler
docker-poppler is a Docker image for Poppler, a PDF library. Poppler can be used to render, split, and merge PDF files.
Start the API server
docker run --rm -p 5000:5000 chanmo/poppler
convert a pdf file to the html format
http -f POST :5000/pdftohtml file@/path/to/file.pdf -o demo.html
convert a pdf file to the text string
http -f POST :5000/pdftotext file@/path/to/file.pdf
convert a pdf file to multiple images
http -f POST :5000/pdftocairo file@/path/to/file.pdf
get a pdf file information
http -f POST :5000/pdfinfo file@/path/to/file.pdf
Conclusion
Docker can be used to simplify and streamline document processing by providing a containerized environment for each tool. The three Docker images introduced in this article can be used to perform a variety of document processing tasks, including converting documents between different formats, refresh the TOC in LibreOffice, and extracting text from PDF files.