Skip to main content

Docker images

To enable easy deployment of NewsFetch, we have created Docker images for the various components of NewsFetch.

NewsFetch Common Crawl

The NewsFetch Common Crawl image is available on Docker Hub at newsfetch/newsfetch-common-crawl.

Usage

To use the NewsFetch Common Crawl image, you will need to have Docker installed on your system. Docker can be downloaded from here.

Now you can pull the image using the following command:

docker pull newsfetch/newsfetch-common-crawl

Fetch the latest Common Crawl data

It is assumed that there is a directory named commoncrawl-data in the current directory. This directoy will be used to store the CommonCrawl data.

First use the docker image to download the latest CommonCrawl data.

docker run -e COMMON_CRAWL_DATA_DIR=/data -v $(pwd)/commoncrawl-data:/data -it --name newsfetch-download-warc newsfetch/newsfetch-common-crawl sh ./get_latest_warc.sh

This will download the latest WARC file to the commoncrawl-data directory.

Make a note of the name of the WARC file that was downloaded. Let us say the name was CC-NEWS-20220915230049-00936.warc.gz.

Extract the data from the WARC file

Now use the image to extract the news articles from the WARC file.

Be sure to map the volumes correctly.

In the following example, The commoncrawl-data directory is mapped to /data in the container. The WARC file name is provided in reference to this volume name. It will be /data/CC-NEWS-20220915230049-00936.warc.gz

docker run -e COMMON_CRAWL_DATA_DIR=/data -v $(pwd)/commoncrawl-data:/data -it --name newsfetch-extract-warc newsfetch/newsfetch-common-crawl sh ./extract_warc.sh /data/CC-NEWS-20220915230049-00936.warc.gz

Process extracted data with NewsFetch

Finally, process the extracted news articles.

docker run -e COMMON_CRAWL_DATA_DIR=/data -v $(pwd)/commoncrawl-data:/data -it --name newsfetch-process-warc newsfetch/newsfetch-common-crawl sh ./process_extracted_warc_files.sh /data/CC-NEWS-20220915230049-00936.warc.gz

NewsFetch Sample API

The NewsFetch sample API image is available on Docker Hub at newsfetch/newsfetch-api.

Usage

Now you can pull the image using the following command:

docker pull newsfetch/newsfetch-api

Run the API

Run the API using the following command:

docker run -p 8000:8000 -it --name newsfetch-api newsfetch/newsfetch-api