Datakeeper Service

Provides durable storage and fast access to job data and cached web pages, enabling efficient scraping and re-use of previously downloaded content.

This service is not open-source but its image is publicly accessible on DockerHub

Configuration

The following environment variables are used to configure this service:

Name Description
DB_CONNECTION_STRING Required MongoDB connection string with database name
CACHE_CONNECTION_STRING Optional. Cache connection string. See Caching
IDEALER_ORIGIN Required The origin of the Idealer service
LICENSE_KEY Required. A license key. See Plans
MIN_LOG_LEVEL Optional. Minimal log level. Default value is Information

Caching

All downloaded pages are cached to reduce pressure on web resources. These cached pages can be used to scrape additional data on them without additional requests to web resources.
If a web page is in cache and the web resource supports cache control (sends ETag and/or Last-Modified headers in responses), all requests to this page URL will be sent with If-None-Match and/or If-Modified-Since headers to reuse cache data if it’s still actual.

By default, web pages are cached in the system DB (MongoDB). This behavior can be changed by providing the service with a CACHE_CONNECTION_STRING. The following providers are supported:

MongoDB

To cache web pages on a MongoDB instance other than the one used as the system database, initialize CACHE_CONNECTION_STRING with a valid MongoDB connection string with database name. For instance:

S3 compatible services

To cache web pages on an S3 compatible service, initialize CACHE_CONNECTION_STRING with a connection string of the following format:

Please rotate your device to landscape mode

This documentation is specifically designed with a wider layout to provide a better reading experience for code examples, tables, and diagrams.
Rotating your device horizontally ensures you can see everything clearly without excessive scrolling or resizing.

Return to Web Data Source Home