Note: Please do not merge your downloaded data back into this repository.
Jump to Quickstart
Docker is an open-source tool for building, sharing, and running software. It’s a great way to manage environment and package dependences in a lightweight manner without having to download everything to your local machine.
Docker uses images to create isolated containers. Images contain the source code, libraries, dependencies, tools and other files that an application or script needs to run. Think of an image as a read-only template with instructions for opening and running a Docker container.
A container is a runnable instance of an image that you interact with. A Docker container can be seen as a smaller computer inside your computer. A helpful metaphor is to think of a container as a cake, while an image is the recipe to bake the cake. If the cake is good, you generally don’t want to change the recipe, and you want to reuse it to make multiple cakes in the future. You also might want to share that recipe with your friends to use in their own kitchens. Docker helps us manage the multiple cakes (containers) we’re baking (computers we’re running code in) and in different kitchens (operating systems).
The image below shows a concept map of the Docker lifecycle:
Some of the data we use at the Equity Center is stored in PDFs. To
extract data out of PDFs and get it into RStudio to clean and analyze,
we use the tabulapdf package. Developed by the
folks at ROpenSci, tabulapdf provides R bindings to the
Tabula java library, which can be used to computationaly extract tables
from PDF documents. The package requires Java to function, and
unfortunately Java does not play nicely with the new Mac M1 chips which
many of us use.
As a workaround to installing Java locally, we’ll use a Docker container with R and rJava installed to extract the table data. From there, we can export the raw data back into RStudio for cleaning and analysis.
Docker builds images by reading the instructions from a Dockerfile. A Dockerfile is a text file containing instructions for building your source code. You’ll need to navigate using the terminal, and general bash instructions are listed below.
Before using the Equity Center Dockerfile, you need to have Docker installed on your computer. Download Docker Desktop here: https://docs.docker.com/get-started/get-docker/
In your terminal, change the current working directory to the location where you want the cloned directory. For this example, I’m using my Desktop so that I can find it easily.
Then type or copy/paste the below:
git clone https://github.com/virginiaequitycenter/docker
And jump into this new directory.
Still in the terminal, type:
docker build --tag "ec" .
(Make sure you remember the . at the end)
Because Docker containers are ephemeral, any data inside of the container will be lost when you close it. As a result we need to link a volume to the container so that you can save things locally from inside of it.
To run the container, type:
docker run --rm -ti -p 8787:8787 -v /Users/sct2td/Desktop/docker/data:/home/rstudio/data ec
But replace /Users/sct2td/Desktop/ with the location
where you cloned this repo.
After you run the command above, you will see a temporary password printed in the terminal. This is your one-time login password for using RStudio in the browser. Each time you launch a new RStudio window you’ll need a new password.
Open up a browser, and in the URL field type:
localhost:8787
You will be sent to a login screen. The username is rstudio (all lowercase) and the password is the password you retrieved in Step 3.
Once you type in the user name and password, the familar RStudio IDE will be available in your browser.
In the browser-based RStudio IDE load the rJava and
tabulapdf libraries by typing the code below. Note this
will take a few minutes to download.
library(rJava)
install.packages("tabulapdf")
library(tabulapdf)
Still in the browser-based RStudio IDE, use the instructions from the
tabulapdf
documentation or another PDF scraper, to extract the data you need.
Don’t worry about tidying it too much, you’ll be able to do this step in
your local RStudio instance. The goal here is to get the data out of the
PDF and stored locally, not to tidy, analyze, or model it.
Now that you have a data object, still in the browser-based IDE, save
that in the docker/data folder. This is a temporary
location that you will then access from your local RStudio instance.
Please do not merge this data back into the original Equity Center
docker repository.
Navigate back to your local RStudio instance. You should now see your
saved data object in the docker/data/ folder. Since this is
still a temporary location, move this data to the directory you would
like to save it in. For example, if I’m looking at eviction data from
court records, and I extract a file called filings.csv I
would then move that file out of the docker/data/ directory
and into my evictions/data/ directory or RStudio
Project.
Now that your data is extracted and saved locally, it’s ready to be tidied and analyzed.
Happy Dockering!