Note: Please do not merge your downloaded data back into this repository.

What is Docker?

Docker is an open-source tool for building, sharing, and running software. It’s a great way to manage environment and package dependences in a lightweight manner without having to download everything to your local machine.

Docker uses images to create isolated containers. Images contain the source code, libraries, dependencies, tools and other files that an application or script needs to run. Think of an image as a read-only template with instructions for opening and running a Docker container.

A container is a runnable instance of an image that you interact with. A Docker container can be seen as a smaller computer inside your computer. A helpful metaphor is to think of a container as a cake, while an image is the recipe to bake the cake. If the cake is good, you generally don’t want to change the recipe, and you want to reuse it to make multiple cakes in the future. You also might want to share that recipe with your friends to use in their own kitchens. Docker helps us manage the multiple cakes (containers) we’re baking (computers we’re running code in) and in different kitchens (operating systems).

The image below shows a concept map of the Docker lifecycle:

Docker concept map from DevOps for Data Science

Why do we need it?

Some of the data we use at the Equity Center is stored in PDFs. To extract data out of PDFs and get it into RStudio to clean and analyze, we use the tabulapdf package. Developed by the folks at ROpenSci, tabulapdf provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents. The package requires Java to function, and unfortunately Java does not play nicely with the new Mac M1 chips which many of us use.

As a workaround to installing Java locally, we’ll use a Docker container with R and rJava installed to extract the table data. From there, we can export the raw data back into RStudio for cleaning and analysis.

Quickstart: the Dockerfile

Docker builds images by reading the instructions from a Dockerfile. A Dockerfile is a text file containing instructions for building your source code. You’ll need to navigate using the terminal, and general bash instructions are listed below.

Step 0: Install Docker

Before using the Equity Center Dockerfile, you need to have Docker installed on your computer. Download Docker Desktop here: https://docs.docker.com/get-started/get-docker/

Step 1: Clone the Git Repo

In your terminal, change the current working directory to the location where you want the cloned directory. For this example, I’m using my Desktop so that I can find it easily.

Then type or copy/paste the below:

git clone https://github.com/virginiaequitycenter/docker

And jump into this new directory.

Step 2: Build the Image

Still in the terminal, type:

docker build --tag "ec" .

(Make sure you remember the . at the end)

Step 3: Run the Image as a Container & Link a Volume

Because Docker containers are ephemeral, any data inside of the container will be lost when you close it. As a result we need to link a volume to the container so that you can save things locally from inside of it.

To run the container, type:

docker run --rm -ti -p 8787:8787 -v /Users/sct2td/Desktop/docker/data:/home/rstudio/data ec

But replace /Users/sct2td/Desktop/ with the location where you cloned this repo.

Step 4: Open the RStudio Image in a Browser

After you run the command above, you will see a temporary password printed in the terminal. This is your one-time login password for using RStudio in the browser. Each time you launch a new RStudio window you’ll need a new password.

Open up a browser, and in the URL field type:

localhost:8787

You will be sent to a login screen. The username is rstudio (all lowercase) and the password is the password you retrieved in Step 3.

Once you type in the user name and password, the familar RStudio IDE will be available in your browser.

Step 5: Load the R Libraries

In the browser-based RStudio IDE load the rJava and tabulapdf libraries by typing the code below. Note this will take a few minutes to download.

library(rJava)
install.packages("tabulapdf")
library(tabulapdf)

Step 6: Extract Data

Still in the browser-based RStudio IDE, use the instructions from the tabulapdf documentation or another PDF scraper, to extract the data you need. Don’t worry about tidying it too much, you’ll be able to do this step in your local RStudio instance. The goal here is to get the data out of the PDF and stored locally, not to tidy, analyze, or model it.

Step 7: Save It

Now that you have a data object, still in the browser-based IDE, save that in the docker/data folder. This is a temporary location that you will then access from your local RStudio instance. Please do not merge this data back into the original Equity Center docker repository.

Step 8: Access Locally

Navigate back to your local RStudio instance. You should now see your saved data object in the docker/data/ folder. Since this is still a temporary location, move this data to the directory you would like to save it in. For example, if I’m looking at eviction data from court records, and I extract a file called filings.csv I would then move that file out of the docker/data/ directory and into my evictions/data/ directory or RStudio Project.

Now that your data is extracted and saved locally, it’s ready to be tidied and analyzed.

Happy Dockering!

Instructions for Using the Equity Center Dockerfile

Samantha Toet