How to Make Interactive Maps With R and Leaflet

I love Python but sadly the best way to do GIS data is to use R and Leaflet. There are no one way to make an interactive map but here are a few steps that I would use:

Pre-Processing:

  • Acquire the necessary shapefile: This step can often take longer than you think. Governments are frequently the source of the shapefile since they’re the ones who usually employ the surveyors + engineers necessary to build the file. https://census2011.adrianfrith.com/place/798 (Place where I acquired the necessary files to do my project on Johannesburg.)
    • Dealing with these files can often be intimidating because they are stored in extremely strange formats. Example a .kml file is a file used by Google Earth but can be converted into a ‘.shp’ file format.
    • The best way to work with any GIS related files is with QGIS.
Shapefiles are comprised of many files and all need to be present when loading them into QGIS. Strange right? It’s a good that you can export them into a single geojson file type.
  • Joining your shapefile (or whichever file type you settle on) with a text file with the same unique identifier. The reason why we’re doing this is so that we can display the data that is associated with the geographical boundary or point.
  • Here’s another thing to consider while you prepare your map. Are you going to display geographical boundaries, example: voting precinct, or points on a map?
    • If you need geographical boundary, that data most likely lives in a shapefile, if you need points on a map you can usually import that from a .csv file with lat long coordinates.

Leaflet:

Leaflet is a package for R. I’m not going to teach you all to use R in this point (maybe a future post) but I will run you through some of the stuff I found difficult and soon enough you’ll be off making maps that look as great as the one below.

Download full interactive map here. It’s in a Google Drive link, download the HTML file and then open it.

Most of the stuff you can teach yourself by copying the code on the R Leaflet site but the part I found most difficult was the labeling system. Labels are important because they allow you to display attribute data for geographical boundaries. How many people live in a district, what’s the poverty rate in district?

Here’s the full code to this map. Notice how I’m creating a list of display labels and R subsections to build a label object. In R, to access part of the data you can use ‘city[[1]]’ to access the first column in the dataframe. Then to label the column notice how we’re using a ‘%s’ to indicate that we’re summoning a string object in the label object. Notice how we’re using the word “City” before the ‘%s’ to indicate what string we’d like to display before the dynamic string.

I could spend pages, spelling step by step explaining how to build a leaflet map for R using QGIS but I think I’ll let you try this by yourself. The only way to truly get good at GIS by trial and error but I hope I’ve done enough to explain some of the more difficult concepts.

A Straightforward Guide to Extracting Data From API's

One of the most useful features invented in the modern computer era is the API or application programming interface. This may sound confusing or complicated but it’s actually not.

There are more complicated iterations of an API but for the most part most people use API’s to extract data from a database. If you think about it, most applications are just elaborate tools to access and send data. The Uber app, allows you to send data about your current location and trip plan to a driver who can then send you data on when they’ll arrive to pick you up.

Dictionaries

– Most of the time, applications will send you data in a format that will appear in the form of a data dictionary. What’s a data dictionary, we perhaps it’s easiest at the start to show you what one looks like:

Screenshot taken from w3school.

Great, that’s somewhat helpful. Imagine you’re trying to connect to a website API that tracks cars, to extract the make of the car you would type:

thisdict['brand']

…to extract “Ford.”

Now most api calls are a lot more complicated than this and usually require two more important features to work properly :

  • A proper request command.
  • A json method to interpret the response data.

Let’s use the API for Tiingo.com as an example of how to do this correctly. Tiingo is a financial services website that provides near real time trading data that can be accessed programmatically via an api.

import requests
headers = {
    'Content-Type': 'application/json'
}
requestResponse = requests.get("https://api.tiingo.com/tiingo/daily/aapl/prices?startDate=2019-01-02&token=7471448cb0d6ac0ee714d622d7cada65b28a552e", headers=headers)
print(requestResponse.json())

# https://api.tiingo.com/documentation/end-of-day

Let’s examine this code block we acquired from the Tiingo documentation website.

  • First we can see that we have to “import requests.” ‘Requests” is the name of a package that’ll handle standard https requests for Python.
  • In this case we’re modifying the headers to work with json. Not all api’s use a step like this. Consult you’re api documentation but in this case the documentation that we modify the header parameter when requesting data from the API to specify json.
  • The next step is to enact the “request.get()” command to call the api using the url structure provided by the website.
  • Then you’ll usually want to take the object your store the request and run the .json() method to explicitly output the data with the assumption of a json structure in mind.
  • Python will then export your data in a dictionary format that you can subset and manipulate in the same way you would a regular python dictionary.
From my Jupyter notebook.

SQL Shortcuts

SQL is a beautiful language that’s highly useful for extracting large amounts of data from large data warehouses.

There are a few cool shorthand you can use to make you’re experience go faster.

Abbreviating Table Shorthand

/* here's how you would normally write a shortcut for 
a table name */

select *
from table as t

/* but here's how you can write it without the word as */

select *
from table t

/* seems a bit silly but if you're writing a lot of code
every abbreviation helps, plus it doesn't hamper readability,
 in my opinion */

Abbreviating Group By and Order By Columns

/* here's how you would normally write out a group by
statement */

select column_1, column_2, count(agg_col)
from table t
group by t.column_1, t.column_2

/* here's how you can speed up the process */

select column_1, column_2, count(agg_col)
from table t
group by 1, 2

/* it's not as readable, but I feel like it's
not too confusing */

Getting Rid of Multiple Or Statements

/* here's how you would normally write out multiple 
or statements */

select column_1, column_2
from table t
where column_1 = 'x' or column_1 = 'y'

/* instead you can just do this */

select column_1, column_2
from table t
where column_1 in ('x', 'y')

/* ok, so this one is more commonly known
but if you don't know this trick you can waste a lot of time
in my opinion */

Why Elections Are So Hard to Predict, (Hint, It's Not the Math.)

One of my favorite scenes from the West Wing is when Joey Lucas, their pollster, tells Josh not let the White House get lead around by bad poll numbers. Not because data isn’t important or that the math doesn’t add up but because people are often complicated liars who don’t know what they want all the time.

How does this relate to data science?

Most voter targeting models use a binary classifier (with a bit of secret sauce running under the hood) to predict the probability of that a person in the state’s voter file will support or oppose the intended candidate.

If we wanted to generate some example data, this is what the first five rows of a voter file might look like + the support column as a predicted column – meaning that the value 0 = non support and 1 equals support because the decision to vote in an election usually results in a binary choice.

Most predictive models for voting work by polling people in the voter file to predict their likely candidate preference and use attribute like age, gender, party id and other x_factors to predict what other people, who share similar characteristics to the polled sample, might do.

There’s a more complicated way to do this that generates a direct probability for each individual voter but for the purposes of this post, let’s keep this hypothetical restricted to a [0, 1] style outcome.

The important thing to remember is that the target support column:

  • Is something that we can only predict by polling people.
  • Something that almost impossible to validate after the fact, due to the fact that we have a secret ballot in this country.
  • Something that changes constantly throughout the course of the election and therefore extremely prone to error.

Let’s circle back to the West Wing episode where Lucas accuses the voters of lying. Now lying is a bit of a stretch in my opinion, but there are certainly issues that come up when collecting opinions from the pubic.

Image result for garbage in garbage out data

If there’s one common data adage that applies to predicting voter behavior, it’s: “garbage in, garbage out.” You can have any many MIT trained data scientists working for your campaign but if the public is uncooperative, there’s a limit to how useful your model can be.

The good news is that not all hope is lost. Many believe that the increased volume of robocalls has discouraged people from picking up the phone for pollsters.

The volume of robocalls has skyrocketed in recent years, reaching an estimated 3.4 billion per month. Since public opinion polls typically appear as an unknown or unfamiliar number, they are easily mistaken for telemarketing appeals.

https://www.pewresearch.org/fact-tank/2019/02/27/response-rates-in-telephone-surveys-have-resumed-their-decline/

But, in light of the onslaught, Congress just passed a new law in 2019 that puts restrictions on robocalls. My hope is that people will be more responsive once the law takes full effect.