What’s the Best Way to Enforce Data Science Ethics?

The data science professional will likely yield more and more power as time goes on. Therefore it may be a good idea to start thinking several ideas to help safeguard our society from bad actors, disasters and honest mistakes with minimal disruption.

Here are a few ideas & things to consider when crafting policy.
(Disclosure: I am not a lawyer.)

Starting Out With a National Breach Notification Law

Policing the data science professional is an extremely complex proposition so I suggest that we start off with easier to digest problems that aren’t likely to change any time soon and are things that we can tackle right now.

Currently, data scientists who collect large data sets with personally identifiable information should do their best to anonymize the data but strengthening, streamlining and standardizing our breach notifications laws at the federal level could push companies to police their own employees.

Currently companies that house personal data can delay notifying their customers of a data breach for a very long time, in fact time to disclose varies from state to state. This is extremely confusing, burdensome and terrible for all parties involved. I agree with a lot of what’s been written on the subject.

It shouldn’t be that hard to craft a law that requires companies to disclose data breaches in a timely manner, incentives them to care more about protecting personal data and create a national standard that is easier for companies to comply with.

Make It Easier for People & Shareholders to Sue For Unethical Practices

Writing laws & rules to govern behavior can be tricky and it’s often tough to strike the right balance between with what’s safe and what’s flexible enough to ensure that data scientists are still employable.

Perhaps one way to deal with inherent awkwardness is to have Congress make certain unethical behavior punishable in civil court vs making it a criminal offense and or making it the business of a federal agency.

In the legal world there’s a standard called “standing” that prevents parties from bringing suit against another without there being concrete damages. This is great because it generally keeps completely frivolous suits out of court and the enforcement mechanism is used when another person feels like there’s been an actual injustice done to them.

Congress has a lot to say in how the civil courts work or don’t work in this country, sometimes for better and sometimes for worse.

Set Up a Professional Organization to Self Police Members

Doctors have the American Medical Association, realtors have the National Association of Realtors and the financial services industry has Financial Industry Regulatory Authority – also known as FINRA.

Why not create a professional licensing organization for data scientists?

Pros:

  • Enforcement of professionals by professional who have better knowledge what’s ethical and what’s not.
  • Can more quickly change the rules than a official government enforcement body and can be generally more flexible.
  • Would burnish the reputation of the profession which relies so heavily on trust.

Cons:

  • Some view professional standards enforcement as weak turkey.
  • Can’t address unethical criminal behavior but can refer them to the authorities.
  • May not be interested in enforcement against influential members but this would probably applies to government enforcers as well.

How to Make Interactive Maps With R and Leaflet

I love Python but sadly the best way to do GIS data is to use R and Leaflet. There are no one way to make an interactive map but here are a few steps that I would use:

Pre-Processing:

  • Acquire the necessary shapefile: This step can often take longer than you think. Governments are frequently the source of the shapefile since they’re the ones who usually employ the surveyors + engineers necessary to build the file. https://census2011.adrianfrith.com/place/798 (Place where I acquired the necessary files to do my project on Johannesburg.)
    • Dealing with these files can often be intimidating because they are stored in extremely strange formats. Example a .kml file is a file used by Google Earth but can be converted into a ‘.shp’ file format.
    • The best way to work with any GIS related files is with QGIS.
Shapefiles are comprised of many files and all need to be present when loading them into QGIS. Strange right? It’s a good that you can export them into a single geojson file type.
  • Joining your shapefile (or whichever file type you settle on) with a text file with the same unique identifier. The reason why we’re doing this is so that we can display the data that is associated with the geographical boundary or point.
  • Here’s another thing to consider while you prepare your map. Are you going to display geographical boundaries, example: voting precinct, or points on a map?
    • If you need geographical boundary, that data most likely lives in a shapefile, if you need points on a map you can usually import that from a .csv file with lat long coordinates.

Leaflet:

Leaflet is a package for R. I’m not going to teach you all to use R in this point (maybe a future post) but I will run you through some of the stuff I found difficult and soon enough you’ll be off making maps that look as great as the one below.

Download full interactive map here. It’s in a Google Drive link, download the HTML file and then open it.

Most of the stuff you can teach yourself by copying the code on the R Leaflet site but the part I found most difficult was the labeling system. Labels are important because they allow you to display attribute data for geographical boundaries. How many people live in a district, what’s the poverty rate in district?

Here’s the full code to this map. Notice how I’m creating a list of display labels and R subsections to build a label object. In R, to access part of the data you can use ‘city[[1]]’ to access the first column in the dataframe. Then to label the column notice how we’re using a ‘%s’ to indicate that we’re summoning a string object in the label object. Notice how we’re using the word “City” before the ‘%s’ to indicate what string we’d like to display before the dynamic string.

I could spend pages, spelling step by step explaining how to build a leaflet map for R using QGIS but I think I’ll let you try this by yourself. The only way to truly get good at GIS by trial and error but I hope I’ve done enough to explain some of the more difficult concepts.

A Straightforward Guide to Extracting Data From API’s

One of the most useful features invented in the modern computer era is the API or application programming interface. This may sound confusing or complicated but it’s actually not.

There are more complicated iterations of an API but for the most part most people use API’s to extract data from a database. If you think about it, most applications are just elaborate tools to access and send data. The Uber app, allows you to send data about your current location and trip plan to a driver who can then send you data on when they’ll arrive to pick you up.

Dictionaries

– Most of the time, applications will send you data in a format that will appear in the form of a data dictionary. What’s a data dictionary, we perhaps it’s easiest at the start to show you what one looks like:

Screenshot taken from w3school.

Great, that’s somewhat helpful. Imagine you’re trying to connect to a website API that tracks cars, to extract the make of the car you would type:

thisdict['brand']

…to extract “Ford.”

Now most api calls are a lot more complicated than this and usually require two more important features to work properly :

  • A proper request command.
  • A json method to interpret the response data.

Let’s use the API for Tiingo.com as an example of how to do this correctly. Tiingo is a financial services website that provides near real time trading data that can be accessed programmatically via an api.

import requests
headers = {
    'Content-Type': 'application/json'
}
requestResponse = requests.get("https://api.tiingo.com/tiingo/daily/aapl/prices?startDate=2019-01-02&token=7471448cb0d6ac0ee714d622d7cada65b28a552e", headers=headers)
print(requestResponse.json())

# https://api.tiingo.com/documentation/end-of-day

Let’s examine this code block we acquired from the Tiingo documentation website.

  • First we can see that we have to “import requests.” ‘Requests” is the name of a package that’ll handle standard https requests for Python.
  • In this case we’re modifying the headers to work with json. Not all api’s use a step like this. Consult you’re api documentation but in this case the documentation that we modify the header parameter when requesting data from the API to specify json.
  • The next step is to enact the “request.get()” command to call the api using the url structure provided by the website.
  • Then you’ll usually want to take the object your store the request and run the .json() method to explicitly output the data with the assumption of a json structure in mind.
  • Python will then export your data in a dictionary format that you can subset and manipulate in the same way you would a regular python dictionary.
From my Jupyter notebook.

SQL Shortcuts

SQL is a beautiful language that’s highly useful for extracting large amounts of data from large data warehouses.

There are a few cool shorthand you can use to make you’re experience go faster.

Abbreviating Table Shorthand

/* here's how you would normally write a shortcut for 
a table name */

select *
from table as t

/* but here's how you can write it without the word as */

select *
from table t

/* seems a bit silly but if you're writing a lot of code
every abbreviation helps, plus it doesn't hamper readability,
 in my opinion */

Abbreviating Group By and Order By Columns

/* here's how you would normally write out a group by
statement */

select column_1, column_2, count(agg_col)
from table t
group by t.column_1, t.column_2

/* here's how you can speed up the process */

select column_1, column_2, count(agg_col)
from table t
group by 1, 2

/* it's not as readable, but I feel like it's
not too confusing */

Getting Rid of Multiple Or Statements

/* here's how you would normally write out multiple 
or statements */

select column_1, column_2
from table t
where column_1 = 'x' or column_1 = 'y'

/* instead you can just do this */

select column_1, column_2
from table t
where column_1 in ('x', 'y')

/* ok, so this one is more commonly known
but if you don't know this trick you can waste a lot of time
in my opinion */

Why Elections Are So Hard to Predict, (Hint, It’s Not the Math.)

One of my favorite scenes from the West Wing is when Joey Lucas, their pollster, tells Josh not let the White House get lead around by bad poll numbers. Not because data isn’t important or that the math doesn’t add up but because people are often complicated liars who don’t know what they want all the time.

How does this relate to data science?

Most voter targeting models use a binary classifier (with a bit of secret sauce running under the hood) to predict the probability of that a person in the state’s voter file will support or oppose the intended candidate.

If we wanted to generate some example data, this is what the first five rows of a voter file might look like + the support column as a predicted column – meaning that the value 0 = non support and 1 equals support because the decision to vote in an election usually results in a binary choice.

Most predictive models for voting work by polling people in the voter file to predict their likely candidate preference and use attribute like age, gender, party id and other x_factors to predict what other people, who share similar characteristics to the polled sample, might do.

There’s a more complicated way to do this that generates a direct probability for each individual voter but for the purposes of this post, let’s keep this hypothetical restricted to a [0, 1] style outcome.

The important thing to remember is that the target support column:

  • Is something that we can only predict by polling people.
  • Something that almost impossible to validate after the fact, due to the fact that we have a secret ballot in this country.
  • Something that changes constantly throughout the course of the election and therefore extremely prone to error.

Let’s circle back to the West Wing episode where Lucas accuses the voters of lying. Now lying is a bit of a stretch in my opinion, but there are certainly issues that come up when collecting opinions from the pubic.

Image result for garbage in garbage out data

If there’s one common data adage that applies to predicting voter behavior, it’s: “garbage in, garbage out.” You can have any many MIT trained data scientists working for your campaign but if the public is uncooperative, there’s a limit to how useful your model can be.

The good news is that not all hope is lost. Many believe that the increased volume of robocalls has discouraged people from picking up the phone for pollsters.

The volume of robocalls has skyrocketed in recent years, reaching an estimated 3.4 billion per month. Since public opinion polls typically appear as an unknown or unfamiliar number, they are easily mistaken for telemarketing appeals.

https://www.pewresearch.org/fact-tank/2019/02/27/response-rates-in-telephone-surveys-have-resumed-their-decline/

But, in light of the onslaught, Congress just passed a new law in 2019 that puts restrictions on robocalls. My hope is that people will be more responsive once the law takes full effect.