Datakeeper Service
Provides durable storage and fast access to job data and cached web pages, enabling efficient scraping and re-use of previously downloaded content.
This service is not open-source but its image is publicly accessible on DockerHub
Configuration
The following environment variables are used to configure this service:
Name | Description |
---|---|
DB_CONNECTION_STRING | Required MongoDB connection string with database name |
CACHE_CONNECTION_STRING | Optional. Cache connection string. See Caching |
IDEALER_ORIGIN | Required The origin of the Idealer service |
LICENSE_KEY | Required. A license key. See Plans |
MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information |
Caching
All downloaded pages are cached to reduce pressure on web resources. These cached pages can be used to scrape additional data on them without additional requests to web resources.
If a web page is in cache and the web resource supports cache control (sends ETag
and/or Last-Modified
headers in responses), all requests to this page URL will be sent with If-None-Match
and/or If-Modified-Since
headers to reuse cache data if it’s still actual.
By default, web pages are cached in the system DB (MongoDB). This behavior can be changed by providing the service with a CACHE_CONNECTION_STRING. The following providers are supported:
- MongoDB
- S3 compatible services
MongoDB
To cache web pages on a MongoDB instance other than the one used as the system database, initialize CACHE_CONNECTION_STRING with a valid MongoDB connection string with database name. For instance:
mongodb://<username>:<password>@localhost:27017/<database-name>
mongodb+srv://<username>:<password>@<cluster-name>.mongodb.net/<database-name>
S3 compatible services
To cache web pages on an S3 compatible service, initialize CACHE_CONNECTION_STRING with a connection string of the following format:
s3://<key>:<secret>@<host>:<port(optional)>/<bucket>?ssl=(true|false)