PDF templating

imageHabrahabr, dear colleagues!



The problem of imprinting data in a PDF document is not new, I am not the first and I am not the last who comes across it, so I decided to share my experience of solving and at the same time present to your attention a small web application on this topic.



1. PDF format is good because it is not editable. In any case, an ordinary user is unlikely to be involved in making edits to a PDF document. This means that the PDF format is well suited for the exchange of legally significant documents.



2. PDF format is bad because it is not editable). templating, filling a PDF document form with a set of data in automatic mode is difficult, and manual mode requires the installation of paid, heavy applications.



As a programmer, I am primarily concerned about the 2nd point. How do I print the required dataset into a PDF document in a software application?



Another googling on this topic did not bring any results.

I only managed to google that everything is bad with typing ( Why is it so difficult to extract text from PDF ? , PDF from the point of view of a programmer ) and there is an option to template the docx file first, this is not difficult to do ( Filling documents in Microsoft Word ... ), and then convert in console libreoffice (librewrite) docx file to PDF. All this can be done automatically from the application.



But first, such a decision means that the project will have a heavy dependency on libreoffice.



And secondly, when converting docx to PDF in libreoffice, the document looks a little different from how it looks in word, and / or PDF generated in word from a docx file.



Finally, let's move on to the essence of the solution under consideration. Of course, "templating" in this case is a loud word, but the proposed solution is quite suitable and useful.



In python (and php) there are several libraries (not hard to google) that allow you to print strings and images into PDF files, we use pdfrw + reportlab.Canvas. That is, in principle, there is no problem to type the data, the problem with these libraries is that for each field you need to set your exact coordinates in the document, which means that



1.We need some kind of unified functionality that would store the coordinates of the fields not inside the source code, but in a separate file. I will clarify right away that from experience I recommend storing these coordinates in files and under version control, i.e. commit coordinates along with the corresponding PDF forms and methods that generate a particular set of documents. And do not put these coordinates into the database, because this will make it difficult to rollback to previous versions (coordinates) of documents, if the need arises. Everything seems to be clear here.



2. These coordinates must be calculated somehow, and this is a sad task if you do it manually.



The main idea here is to create movable div elements in the browser, use the mouse to adjust their position to the desired place in the document and save the coordinates of the elements obtained in the browser to a file on the backend. Actually, these two points are implemented in the application.



Mode of application



It looks like a small web application with a front-end and a back-end, i.e. to issue it as a python package, perhaps, will not work.



1. Download the sources from the gita



2. Install dependencies



3. Read README.md (install and configure nginx for static files)



4. In the documents folder, create a subfolder with the name of the document that needs to be generated and inside this subfolder create two files and (if necessary ) one directory with pictures:



- form.pdf # document form into which you need to

print data - fields.json # parameters of fields that need to be printed

- images # is optional, a set of images that need to be typed I

recommend that you also save the original docx file (if any), which is not involved in the generation of the document, but it will come in handy if you need to make changes and regenerate the PDF document form

- form.docx # is not necessary, any name



The fields.json file has the following structure, for example:



{
    "0": [
        [32.25, 710.25, "fio", "DejaVuSans", 12, 420],
        [425.25, 681.75, "gender", "DejaVuSans", 12, 18],
        [206.25, 681.75, "birth_date", "DejaVuSans", 12, 173],
        [462.75, 681.53, "foto.jpg", "DejaVuSans", 12, 92],
        [146.25, 665.25, "birth_place", "DejaVuSans", 12, 418],
        [228.0, 634.5, "registration", "DejaVuSans", 12, 340]
    ],
    "1": [
        [132.0, 720.76, "1_work", "DejaVuSans", 10, 260],
        [132.0, 697.51, "2_work", "DejaVuSans", 10, 260],
        [132.0, 673.51, "3_work", "DejaVuSans", 10, 141]
    ]
}

      
      





Adding / removing lines to this file adds / removes fields that are printed into the form



5. Open the page for setting fields (http://127.0.0.1/tpdf/positioning?pdf_name=ZayavlenieNaZagranpasport&page_num=1)



6. Adjust the position of the fields with the mouse in browser and save this position



7. The mouse is not always able to accurately set the desired position of the fields, in order to adjust the position of the fields, you can open the file fieldd.json and correct the coordinates manually. The data in the file is ordered by the Y coordinate and each field is stored on its own line in the file. Those. the file with the coordinates of the fields is formatted neatly, which allows you to manually make the necessary adjustments easily.



eight.We create another method for printing this type of document (if you need to somehow prepare the initial data and / or take it not from the front, but from the backend).



9. If everything is in order, then commit the resulting dataset fields.json and files (just not to my git, but to your local git, although if the document can be useful to someone else, then you can collect a public document bank, that's an idea).



The resulting file with coordinates can be used in another project, in another programming language, for example php, because the coordinates in the file are written in units of measurement (points) that are used in PDF files.



If you have a python project, then the source code of this application can be simply embedded into the project and, through the use of the main Tpdf class, generate PDF in any convenient place in the code.



Often it is necessary to generate not just one document from several pages, but to assemble several documents into one PDF file, each of which must be printed in the correct order and some of them more than once. The main class of this application has a special method for these needs that generates a set of documents, see the processing method / tpdf / example /.



Data must be passed to the main class when instantiating it. The main class can be extended with properties (@property), which will be calculated based on the input data and inserted into the PDF by property name = field name. So in the example, the fio field is displayed, and the data is transmitted last_name, first_name, middle_name.You



can deploy this small application as an independent service, and all other applications in the environment will access it for the necessary document over the network, but then there will be costs of transmission over the network, files PDFs are not too "light", the document generation itself is fast.



Instead of hundreds of words, sometimes it's better to watch a video instruction (I didn't record a sound).





Implementation experience (rake).



  1. PyPDF2, 28 3 , - . , , , , , , , - . , — . , pdfrw , , . .. 28 0.3 . : , , , .
  2. ajax, , .
  3. PDF. , , 3/4, PDF. , , .


(TODO)



  1. .



    , / / / () fields.json
  2. , , .
  3. , PDF-.
  4. A generic method that takes a dataset as input and returns a collection of documents.



All Articles