Last modified: 2023-04-08
Paperless
Paperless is an open source document management system that indexes your scanned documents and allows you to easily search for documents and store metadata alongside your documents.
A supercharged version of paperless: scan, index and archive all your physical documents.
I am going with bare-metal installation, and to manage all dependencies I use a meta package called wht_nas_paperless.
Most of these steps are taken from the documentation.
Bug
If you see this error log, it is known issue:
bad escape \d at position 7 : Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/django_q/cluster.py", line 432, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/share/paperless/src/documents/tasks.py", line 154, in consume_file
document = Consumer().try_consume_file(
File "/usr/share/paperless/src/documents/consumer.py", line 334, in try_consume_file
date = parse_date(self.filename, text)
File "/usr/share/paperless/src/documents/parsers.py", line 221, in parse_date
return next(parse_date_generator(filename, text), None)
File "/usr/share/paperless/src/documents/parsers.py", line 280, in parse_date_generator
yield from __process_content(text, settings.DATE_ORDER)
File "/usr/share/paperless/src/documents/parsers.py", line 271, in __process_content
date = __process_match(m, date_order)
File "/usr/share/paperless/src/documents/parsers.py", line 262, in __process_match
date = __parser(date_string, date_order)
File "/usr/share/paperless/src/documents/parsers.py", line 235, in __parser
return dateparser.parse(
File "/usr/lib/python3.10/site-packages/dateparser/conf.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/dateparser/__init__.py", line 61, in parse
data = parser.get_date_data(date_string, date_formats)
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 456, in get_date_data
parsed_date = _DateLocaleParser.parse(
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 200, in parse
return instance._parse()
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 204, in _parse
date_data = self._parsers[parser_name]()
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 224, in _try_freshness_parser
return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
File "/usr/lib/python3.10/site-packages/dateparser/date.py", line 262, in _get_translated_date
self._translated_date = self.locale.translate(
File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 131, in translate
relative_translations = self._get_relative_translations(settings=settings)
File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 159, in _get_relative_translations
self._generate_relative_translations(normalize=True))
File "/usr/lib/python3.10/site-packages/dateparser/languages/locale.py", line 173, in _generate_relative_translations
pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
File "/usr/lib/python3.10/site-packages/regex/regex.py", line 710, in _compile_replacement_helper
is_group, items = _compile_replacement(source, pattern, is_unicode)
File "/usr/lib/python3.10/site-packages/regex/_regex_core.py", line 1737, in _compile_replacement
raise error("bad escape \\%s" % ch, source.string, source.pos)
regex._regex_core.error: bad escape \d at position 7
- [BUG] regex._regex_core.error: bad escape \d at position 7 #1684
- [BUG] Paperless fails import at date parsing stage #1201
- [BUG] Paperless fails import at date parsing stage #1200
- [BUG] Upload fails due to date parsing errors #1188
They mostly blame ArchLinux packaging, which does not follow python's requirements.txt
. In my opinion, pinning dependency to a specific version is not the solution.
The underlying problem is incompatibility of two or more python packages - they should have already open issues about it and hopefully it will be fixed soon.
UPDATE: Problem is fixed with python-dateparser v1.1.4
.
Bug
As described in AUR/paperless-ngx, if you see following error:
Mar 10 21:36:21 ark systemd[1]: Started Paperless Celery Workers.
Mar 10 21:36:22 ark celery[284848]: Traceback (most recent call last):
Mar 10 21:36:22 ark celery[284848]: File "/usr/bin/celery", line 33, in <module>
Mar 10 21:36:22 ark celery[284848]: sys.exit(load_entry_point('celery==5.2.7', 'console_scripts', 'celery')())
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/celery/__main__.py", line 15, in main
Mar 10 21:36:22 ark celery[284848]: sys.exit(_main())
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/celery/bin/celery.py", line 217, in main
Mar 10 21:36:22 ark celery[284848]: return celery(auto_envvar_prefix="CELERY")
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
Mar 10 21:36:22 ark celery[284848]: return self.main(*args, **kwargs)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 1055, in main
Mar 10 21:36:22 ark celery[284848]: rv = self.invoke(ctx)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 1655, in invoke
Mar 10 21:36:22 ark celery[284848]: sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 920, in make_context
Mar 10 21:36:22 ark celery[284848]: self.parse_args(ctx, args)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 1378, in parse_args
Mar 10 21:36:22 ark celery[284848]: value, args = param.handle_parse_result(ctx, opts, args)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 2360, in handle_parse_result
Mar 10 21:36:22 ark celery[284848]: value = self.process_value(ctx, value)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 2316, in process_value
Mar 10 21:36:22 ark celery[284848]: value = self.type_cast_value(ctx, value)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/core.py", line 2304, in type_cast_value
Mar 10 21:36:22 ark celery[284848]: return convert(value)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/click/types.py", line 82, in __call__
Mar 10 21:36:22 ark celery[284848]: return self.convert(value, param, ctx)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/celery/bin/worker.py", line 58, in convert
Mar 10 21:36:22 ark celery[284848]: value = concurrency.get_implementation(worker_pool)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/celery/concurrency/__init__.py", line 28, in get_implementation
Mar 10 21:36:22 ark celery[284848]: return symbol_by_name(cls, ALIASES)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/kombu/utils/imports.py", line 56, in symbol_by_name
Mar 10 21:36:22 ark celery[284848]: module = imp(module_name, package=package, **kwargs)
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
Mar 10 21:36:22 ark celery[284848]: return _bootstrap._gcd_import(name[level:], package, level)
Mar 10 21:36:22 ark celery[284848]: File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
Mar 10 21:36:22 ark celery[284848]: File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
Mar 10 21:36:22 ark celery[284848]: File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
Mar 10 21:36:22 ark celery[284848]: File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
Mar 10 21:36:22 ark celery[284848]: File "<frozen importlib._bootstrap_external>", line 883, in exec_module
Mar 10 21:36:22 ark celery[284848]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/celery/concurrency/prefork.py", line 19, in <module>
Mar 10 21:36:22 ark celery[284848]: from .asynpool import AsynPool
Mar 10 21:36:22 ark celery[284848]: File "/usr/lib/python3.10/site-packages/celery/concurrency/asynpool.py", line 29, in <module>
Mar 10 21:36:22 ark celery[284848]: from billiard.compat import buf_t, isblocking, setblocking
Mar 10 21:36:22 ark celery[284848]: ImportError: cannot import name 'buf_t' from 'billiard.compat' (/usr/lib/python3.10/site-packages/billiard/compat.py)
Mar 10 21:36:22 ark systemd[1]: paperless-task-queue.service: Main process exited, code=exited, status=1/FAILURE
Mar 10 21:36:22 ark systemd[1]: paperless-task-queue.service: Failed with result 'exit-code'.
The problem is that paperless-ngx
requires python-billiard v3.x.x
, and in Arch Linux there is already python-billiard v4.x.x
. To fix this, you have to manually install the old python-billiard
version.
The do
# systemctl restart paperless.target
Installation
The package itself comes from AUR. I have added it into my personal repository to be built.
Unfortunately there are many dependencies which are also in AUR, and I had to add them all manually :/
wht_server_paperless/PKGBUILD
# Maintainer: Vojtech Vesely <vojtech.vesely@protonmail.com>
pkgname=wht_server_paperless
pkgver=1.0.0
pkgrel=1
pkgdesc='archlinux meta package - set of multiple meta packages for paperless server'
arch=('x86_64')
url='https://git.sr.ht/~atomicfs/atomicfs-repo-arch'
license=('MIT')
depends=(
# Other meta packages
'wht_server_https'
'wht_server_mariadb'
# Services
'paperless-ngx' # A supercharged version of paperless: scan, index and archive all your physical documents
# tesseract-data
# mostly for paperless-ngx
'tesseract-data-afr'
'tesseract-data-amh'
'tesseract-data-ara'
'tesseract-data-asm'
'tesseract-data-aze'
'tesseract-data-aze_cyrl'
'tesseract-data-bel'
'tesseract-data-ben'
'tesseract-data-bod'
'tesseract-data-bos'
'tesseract-data-bre'
'tesseract-data-bul'
'tesseract-data-cat'
'tesseract-data-ceb'
'tesseract-data-ces'
'tesseract-data-chi_sim'
'tesseract-data-chi_tra'
'tesseract-data-chr'
'tesseract-data-cos'
'tesseract-data-cym'
'tesseract-data-dan'
'tesseract-data-dan_frak'
'tesseract-data-deu'
'tesseract-data-deu_frak'
'tesseract-data-div'
'tesseract-data-dzo'
'tesseract-data-ell'
'tesseract-data-eng'
'tesseract-data-enm'
'tesseract-data-epo'
'tesseract-data-equ'
'tesseract-data-est'
'tesseract-data-eus'
'tesseract-data-fao'
'tesseract-data-fas'
'tesseract-data-fil'
'tesseract-data-fin'
'tesseract-data-fra'
'tesseract-data-frk'
'tesseract-data-frm'
'tesseract-data-fry'
'tesseract-data-gla'
'tesseract-data-gle'
'tesseract-data-glg'
'tesseract-data-grc'
'tesseract-data-guj'
'tesseract-data-hat'
'tesseract-data-heb'
'tesseract-data-hin'
'tesseract-data-hrv'
'tesseract-data-hun'
'tesseract-data-hye'
'tesseract-data-iku'
'tesseract-data-ind'
'tesseract-data-isl'
'tesseract-data-ita'
'tesseract-data-ita_old'
'tesseract-data-jav'
'tesseract-data-jpn'
'tesseract-data-jpn_vert'
'tesseract-data-kan'
'tesseract-data-kat'
'tesseract-data-kat_old'
'tesseract-data-kaz'
'tesseract-data-khm'
'tesseract-data-kir'
'tesseract-data-kmr'
'tesseract-data-kor'
'tesseract-data-kor_vert'
'tesseract-data-lao'
'tesseract-data-lat'
'tesseract-data-lav'
'tesseract-data-lit'
'tesseract-data-ltz'
'tesseract-data-mal'
'tesseract-data-mar'
'tesseract-data-mkd'
'tesseract-data-mlt'
'tesseract-data-mon'
'tesseract-data-mri'
'tesseract-data-msa'
'tesseract-data-mya'
'tesseract-data-nep'
'tesseract-data-nld'
'tesseract-data-nor'
'tesseract-data-oci'
'tesseract-data-ori'
'tesseract-data-pan'
'tesseract-data-pol'
'tesseract-data-por'
'tesseract-data-pus'
'tesseract-data-que'
'tesseract-data-ron'
'tesseract-data-rus'
'tesseract-data-san'
'tesseract-data-sin'
'tesseract-data-slk'
'tesseract-data-slk_frak'
'tesseract-data-slv'
'tesseract-data-snd'
'tesseract-data-spa'
'tesseract-data-spa_old'
'tesseract-data-sqi'
'tesseract-data-srp'
'tesseract-data-srp_latn'
'tesseract-data-sun'
'tesseract-data-swa'
'tesseract-data-swe'
'tesseract-data-syr'
'tesseract-data-tam'
'tesseract-data-tat'
'tesseract-data-tel'
'tesseract-data-tgk'
'tesseract-data-tgl'
'tesseract-data-tha'
'tesseract-data-tir'
'tesseract-data-ton'
'tesseract-data-tur'
'tesseract-data-uig'
'tesseract-data-ukr'
'tesseract-data-urd'
'tesseract-data-uzb'
'tesseract-data-uzb_cyrl'
'tesseract-data-vie'
'tesseract-data-yid'
'tesseract-data-yor'
)
Paperless will not allow symlinks to persistent storage! Since I run this on NAS with RAID (system disk is separate), my first thought was to create symlinks. That is not possible.
You can change the location of persistent storage for paperless, but I prefer to keep it default.
Since I use btrfs, I just created a new subvolume and mounted it at /var/lib/paperless
.
Configuration
As for the configuration, most of it is default, I have just changed few things (mostly secrets, passwords and stuff). Importantly I have change PAPERLESS_DBPORT
to what my MariaDB database is configured.
/etc/paperless.conf##template
This is a yadm template. I used template so that I can store passwords separatly in encrypted files.
# WARNING: Do not edit this file.
# It was generated by processing {{ yadm.source }}
# Have a look at the docs for documentation.
# https://paperless-ngx.readthedocs.io/en/latest/configuration.html
# Debug. Only enable this for development.
#PAPERLESS_DEBUG=false
# Required services
#PAPERLESS_REDIS=redis://localhost:6379
PAPERLESS_DBHOST=localhost
PAPERLESS_DBPORT=3306
#PAPERLESS_DBNAME=paperless
#PAPERLESS_DBUSER=paperless
#PAPERLESS_DBPASS=paperless
{% include "paperless.passwd" %}
#PAPERLESS_DBSSLMODE=prefer
PAPERLESS_DBENGINE=mariadb
# Paths and folders
PAPERLESS_CONSUMPTION_DIR=/var/lib/paperless/consume
PAPERLESS_DATA_DIR=/var/lib/paperless/data
#PAPERLESS_TRASH_DIR=
PAPERLESS_MEDIA_ROOT=/var/lib/paperless/media
PAPERLESS_STATICDIR=/usr/share/paperless/static
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{created}__{title}
PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=True
# Security and hosting
#PAPERLESS_SECRET_KEY=change-me
{% include "paperless.key" %}
#PAPERLESS_URL=https://example.com
#PAPERLESS_CSRF_TRUSTED_ORIGINS=https://example.com # can be set using PAPERLESS_URL
#PAPERLESS_ALLOWED_HOSTS=example.com,www.example.com # can be set using PAPERLESS_URL
#PAPERLESS_CORS_ALLOWED_HOSTS=https://localhost:8080,https://example.com # can be set using PAPERLESS_URL
#PAPERLESS_FORCE_SCRIPT_NAME=
#PAPERLESS_STATIC_URL=/static/
#PAPERLESS_AUTO_LOGIN_USERNAME=
#PAPERLESS_COOKIE_PREFIX=
#PAPERLESS_ENABLE_HTTP_REMOTE_USER=false
# OCR settings
#PAPERLESS_OCR_LANGUAGE=eng
#PAPERLESS_OCR_MODE=skip
#PAPERLESS_OCR_OUTPUT_TYPE=pdfa
#PAPERLESS_OCR_PAGES=1
#PAPERLESS_OCR_IMAGE_DPI=300
#PAPERLESS_OCR_CLEAN=clean
#PAPERLESS_OCR_DESKEW=true
#PAPERLESS_OCR_ROTATE_PAGES=true
#PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=12.0
#PAPERLESS_OCR_USER_ARGS={}
#PAPERLESS_CONVERT_MEMORY_LIMIT=0
PAPERLESS_CONVERT_TMPDIR=/var/lib/paperless/tmp
# Software tweaks
#PAPERLESS_TASK_WORKERS=1
#PAPERLESS_THREADS_PER_WORKER=1
#PAPERLESS_TIME_ZONE=UTC
#PAPERLESS_CONSUMER_POLLING=10
#PAPERLESS_CONSUMER_DELETE_DUPLICATES=false
#PAPERLESS_CONSUMER_RECURSIVE=false
#PAPERLESS_CONSUMER_IGNORE_PATTERNS=[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini"]
#PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=false
#PAPERLESS_CONSUMER_ENABLE_BARCODES=false
#PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
#PAPERLESS_PRE_CONSUME_SCRIPT=/path/to/an/arbitrary/script.sh
#PAPERLESS_POST_CONSUME_SCRIPT=/path/to/an/arbitrary/script.sh
#PAPERLESS_FILENAME_DATE_ORDER=YMD
#PAPERLESS_FILENAME_PARSE_TRANSFORMS=[]
#PAPERLESS_NUMBER_OF_SUGGESTED_DATES=5
#PAPERLESS_THUMBNAIL_FONT_NAME=
#PAPERLESS_IGNORE_DATES=
#PAPERLESS_ENABLE_UPDATE_CHECK=
# Tika settings
#PAPERLESS_TIKA_ENABLED=false
#PAPERLESS_TIKA_ENDPOINT=http://localhost:9998
#PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://localhost:3000
# Binaries
#PAPERLESS_CONVERT_BINARY=/usr/bin/convert
#PAPERLESS_GS_BINARY=/usr/bin/gs
# Uploads
PAPERLESS_SCRATCH_DIR=/var/lib/paperless/uploads
# Webserver
GUNICORN_CMD_ARGS='--bind=127.0.0.1:7998'
Next, create paperless
database (as defined in PAPERLESS_DBNAME
). Also a user paperless
(as defined in PAPERLESS_DBUSER
) in the MariaDB
and give it the password that you set in PAPERLESS_DBPASS
. Do not forget to give this user access to the database.
# mariadb -u root -p
> CREATE DATABASE paperless;
> CREATE USER 'paperless'@'localhost' IDENTIFIED BY '<password>';
> GRANT ALL PRIVILEGES ON paperless.* TO 'paperless'@'localhost';
> FLUSH PRIVILEGES;
> quit
To list all databases:
SHOW DATABASES;
To list all users:
SELECT User FROM mysql.user;
Show user permissions:
SHOW GRANTS FOR 'user'@'localhost';
After initial setup (and also after updates), run database migration:
# sudo -u paperless paperless-manage migrate
Create admin:
# sudo -u paperless paperless-manage createsuperuser
At this point the paperless
server should be available at localhost:7998 (I have change the default port!). Now let's set up nginx as a reverse proxy.
nginx
For HTTPS chek out Self-signed certificates for local network.
The /etc/nginx/nginx.conf
is not that interesting.
/etc/nginx/nginx.conf
user http http;
worker_processes 1;
#error_log logs/error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
#pid logs/nginx.pid;
events {
multi_accept on;
worker_connections 1024;
}
http {
# MIME
include mime.types;
default_type application/octet-stream;
# logging
#log_format main '$remote_addr - $remote_user [$time_local] "$request" '
# '$status $body_bytes_sent "$http_referer" '
# '"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log warn;
charset utf-8;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
server_tokens off;
log_not_found off;
types_hash_max_size 4096;
client_max_body_size 100M;
keepalive_timeout 65;
# GZip
gzip on;
gzip_min_length 1000;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain;
gzip_types application/xml;
gzip_types application/json;
gzip_types application/javascript;
gzip_types application/octet-stream;
gzip_types text/css;
# load configs
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
However the /etc/nginx/sites-available/paperless.conf
might give you a headache.
I looked at ArchLinux wiki / nginx / TLS, tried the Mozilla SSL Configuration Generator and I was still getting odd behavior. And when I finally got the login srceen for paperless, I got error after login CSRF verification failed
.
Thankfully with a bit of searching the internets, I found a simple solution: add proxy_set_header X-Forwarded-Proto https;
.
/etc/nginx/sites-available/paperless.conf
/etc/nginx/sites-available/paperless.conf
server {
# generated 2022-12-02, Mozilla Guideline v5.6, nginx 1.17.7, OpenSSL 1.1.1k, modern configuration, no HSTS, no OCSP
# https://ssl-config.mozilla.org/#server=nginx&version=1.17.7&config=modern&openssl=1.1.1k&hsts=false&ocsp=false&guideline=5.6
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name paperless.ark;
include private.conf;
# SSL
ssl_certificate /etc/ssl/private/ark.crt;
ssl_certificate_key /etc/ssl/private/ark.key;
ssl_session_timeout 1d;
ssl_session_cache shared:MozSSL:10m; # about 40000 sessions
ssl_session_tickets off;
# modern configuration
ssl_protocols TLSv1.3;
ssl_prefer_server_ciphers off;
location / {
# Adjust host and port as required.
proxy_pass http://localhost:7998/;
proxy_set_header X-Forwarded-Proto https;
# These configuration options are required for WebSockets to work.
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $server_name;
}
}
So now you should be able to go to the paperless over HTTPS :D
I will likely see this error:
That is simply because you are using self-signed certificates. It is possible to fix this, but I am too lazy right now (maybe later).
Usage
I highly recommend to read the documentation / usage overview section.