Python & SQL for Data Science Applications

I provide guidance on using Python 3’s named argument/ dictionary-style string formatting to keep track of arguments, and I explain how to construct complex queries that use variables for column and table names.

The thing about the world’s data is that it is not sitting in perfectly formatted static JSON files ready to feed your D3 visualizations. Data is not just queued up to train your machine learning models by itself. Even with Hadoop and the multitude of distributed file systems out there, a huge amount the data available to institutions is stored in fairly traditional SQL relational databases. As a data scientist trying to leverage the value of your organizations data, a big chunk of the job is getting data out of these databases in order to support advanced analysis, AI, recommendation engines, and all the other stuff that is “the real data science”. Here I’ll discuss some helpful tips and tricks for using Python and SQL to query relational databases.

As a Python programmer, one of the most helpful things I have learned, is how to effectively manage the mess of SQL I needed for the applications I build. In this post, I explain how to write a custom class to manage your queries and database connections. I weigh when it makes sense to skip SQL parameterization (there are a lot of conditions that have to be met). I provide guidance on using Python 3’s named argument/ dictionary-style string formatting to keep track of arguments, and I explain how to construct complex queries that use variables for column and table names. 

Write a custom class to manage your SQL queries

I am a fierce believer that if you write any Python code that could ever be reused (even if it’s just by you), it is worth it to take an object oriented view of the project and organize your code into functions and classes. Establishing classes dramatically increases readability, maintainability and reusability. Well structured classes allow different developers to hop on and off projects without having to re-invent the project from scratch. This is absolutely applies to managing SQL in your Python applications. It’s easy to manage SQL queries with classes so that you can flexibly and dynamically serve data to your application.

The Query Class

example:

import psycopg2
class query():
def __init__(self):
"""Set credential information here, or pass the information as
arguments to the class, or read your credentials from environment
variables. Anyway you choose to handle crendtials, pass them to these
class variables."""
self.DB_USER = '<username>'
self.DB_PWD = '<password>'
self.DB_HOST = '<host>'
self.DB_NAME = '<Database name>'
self.cn = self.connect_to_db()
def connect_to_db(self):
"""Creates psycopg2 connection to host database"""
self.conn = psycopg2.connect(
host=self.DB_HOST,
dbname=self.DB_NAME,
user=self.DB_USER,
password=self.DB_PWD
)
return self.conn.cursor()
def read_db(self, query):
self.cn.execute(query)
return self.cn.fetchall()
def generate_query(args):
"""Constructs query statement from args and executes query"""
query = '''select {select_clause}
from {schema}.{table}
where 1=1
{where_clause}
'''.format(**args)
return self.read_db(query, args)
db_query = query()
args = {'select_clause':'first, last, age, address',
'schema':'university',
'table':'students',
'where_clause':' AND age > 30'}
results = db_query.generate_query(args)
view raw query_maker.py hosted with ❤ by GitHub

Connect to the database

This simple class has three standard methods plus the special __init__ method. The __init__ method executes when you first initialize the class (which happens here in line 47). In the __init__ method I am using placeholders to show where you would pass your credential information to. In a live application you should pass credential information from a credential management system or encrypted variables rather than hardcoding that information. 

After accepting credential information, the __init__ method calls the connect_to_db method which accepts the credentials you provided. This establishes a connection to the database using the appropriate Python database connector for your database (Check out this post to pick the right best Python database connector for your database here).

The cursor object maintains the connection to the database. Here, the cursor object is assigned to the cn class object. This object allows you to make multiple calls to the database without resetting the connection. 

The read_db method (lines 26-28) accepts a query string as an argument. It handles the process of passing the query to the database and returns an object containing the results of the query. 

Construct the query

The generate_query method (lines 30-39) uses string formatting to construct a properly formatted SQL query. The method takes a dictionary of values to be passed to the string as an argument, processes the string formatting, then passes that query to the read_db method and returns the results object. The next section dives further into utilizing string formatting to write custom queries.

Calling the class

Lines 41-49 demonstrate how to interact with this class. Line 41 instantiates the class. Lines 43-46 contain the dictionary that you pass to the generate_query method. This constructs your query using the built-in Python named argument or dictionary style string formatting method.

When can you skip SQL parameterization?

First, the rest of this article probably is not for you if you are building a public facing application on top of any sort of sensitive data. Public facing databases should use query parametrization, a topic into which I will delve another day.

This post is about writing queries for data science applications. It is not guidance for general web development. This means that I make explicit assumptions that the data you are working with is appropriately locked down at the database and user level. Anyone with access to your application must already have approved access to all of the underlying data. Users accessing the application must be set up with individualized accounts. Never use the same account for multiple users and make sure that the access levels are correct for each user. Appropriately managed permissions prevent unauthorized access to underlying data in the event of unauthorized access to the application. If your application does not meet these minimum criteria, you should only use parameterized queries and you can skip the rest of this post.

Okay, with that disclaimer out of the way, there are two scenarios where it makes sense to skip rigid parameterization and embrace string formatting. The first is that parameterization requires that you to track variables in ordered lists, which makes code to read and maintain complicated super quickly. The second is that parameterization is structured so that you can only feed values to your query, not database names, schema names, table names or clauses. This severely limits how flexible you can make your code.

Using Python 3 named arguments or dictionary style string formatting to build SQL queries

In version 3.0 and later, Python introduced new and tremendously helpful ways to manipulate strings that eliminates both of these issues. (These features were later back-ported to all the way to version 2.6). You are now able to pass named arguments and dictionaries to your strings via the .format() method. 

Previously, if you wanted to use string formatting you had to specify the type for each variable you passed. Subsequently, it was necessary to keep track of your arguments by the positional index of the tuple of values you passed to the string. With Python 3, forced-types have been defenestrated and order doesn’t matter.

Take the example:

args = {‘name’:’George’, ‘age’:32}
string = “The student’s name is {name}, age {age}.”.format(**args)

Notice that in both cases we pass strings and integers using identical syntax. There is no need to specify the variable type. Secondly, you can keep track of objects by the key value of objects in the dictionary passed to the string. Also note, You must use the ** operator to unpack the dictionary, because string formatting actually takes named arguments. The ** operator converts the dictionary to named arguments. 

Alternatively you can use named arguments explicitly:

string = “The student’s name is {name}, age {age}.”.format(name=’George’, age:32)

These two features make string formatting really easy, and make it unreasonably easy to customize SQL queries. 

Construct complex queries with variable, column, and table names

Take the example in line 33-37 from the class above:

query = '''select {select_clause}
			from {schema}.{table}
			where 1=1
			{where_clause}
			'''.format(**args

and the dictionary we pass to it here:

args = {'select_clause':'first, last, age, address',
				'schema':'university',
				'table':'students', 
				'where_clause':' AND age > 30'}

Above, I’ve set up what I like to think of as a skeleton query. This consists of the basic syntax of a correctly formatted SQL query plus placeholders for both the clauses and values that need to adapt to the parameters of the application. 

A couple of things about this that to point out. First, because we aren’t using SQL parameterization (see the caveat above) we are not limited to passing values to known database structures. For example, we can pass an entire select clause specifying on the fly which fields to include.

Think about how easy this makes it to maintain the application if data structures are migrated on the backend, you just need to swap out the schema and table names where you set your default values. 

The snippet where 1 = 1 might seem a bit odd in this example. It is a where condition that always evaluates to true. This is a helpful trick, as it allows you to pass either additional where conditions or an empty string and get back the results you were intending. For example, if you were building an application where users can select multiple filters, you could pass those filters to the query just by adding additional conditions to the where clause. 

I hope this post provided helpful principles for querying data with Python and SQL in your data science applications.   

Best Python database connectors for Postgresql, Microsoft SQL Server, MySQL and more

Figuring out which Python database connector package to use with which database can be tricky. There are many packages out there that work with each database, but some are much easier to use and better maintained than others. A generic search may turn up sub-par packages that are difficult to install and poorly documented.

From personal experience, these are the best Python packages to connect Python to Postgresql, Microsoft SQL Server and MySQL.  Each of these packages install via pip. Each package works on Windows, Linux and Mac OS operating systems, and each one is really well documented. 

Use these packages when you’re connecting to your database. You will spend less time trying to resolve obscure installation and OBDC errors, and you will get querying faster.

DatabasePython connector packageDocumentation
Postgresqlpsycopg2Documentation
Microsoft SQL serverpyodbcDocumentation
MySQLPyMySQLDocumentation
MongoDBPyMongoDocumentation

Deploy Python applications with Docker, Makefile and Virtualenv

The process of taking code from development, to staging, to production is always a challenge. It seems to me that on every project I’ve worked, and even on different iterations of the same project, that process has changed as teams have tried to identify the fastest and easiest way to ensure that code will run in any environment. Recently, consensus seems to have coalesced around Docker. Docker is an almost magical solution that allows you to utilize prebuilt images of thousands of different software setups as base environments to build your application on. The learning curve is fast and once you’ve wrapped you’re head around it, it makes some of the most tedious elements of development (building and deploying environments) blazing fast and super reliable.

I’m primarily a Python developer. For years, my go-to for deploying Python code has been to use a virtual environment with pip to install packages and a Makefile to invoke the virtual environment and call the application. A lot of discussion around Docker suggests that virtual environments are unnecessary, that you can skip the effort (and MBs) of managing packages via a virtual environment and just install directly to the Docker container. For the most part this is true, but for a number of edge cases you’ll need a virtual environment to install packages. More so for the sake of continuity with older versions of an application, maintaining your virtual environment setup just makes sense.

So, if you find yourself compelled to employ docker but aren’t ready to give up your virtual environment, this is an easy method to integrate the two. I will demonstrate how to set up your Dockerfile to call a Makefile that sets up your virtual environment and then invokes the virtual environment when you call your application from the makefile.

FROM python:3.7.3-alpine3.9
RUN apk add build-base
RUN apk update && apk add postgresql-dev gcc python3-dev musl-dev
COPY requirements.txt /myapp/requirements.txt
COPY Makefile /myapp/Makefile
RUN python3 -m pip install virtualenv
RUN cd myapp && make venv
view raw Dockerfile hosted with ❤ by GitHub
venv:
if [ -d .py ] ; \
then \
echo "virtualenv already built, skipping…"; \
else \
python3 -m venv .py; \
.py/bin/python3 -m pip install -r requirements.txt; \
fi
myapp_run: venv
.py/bin/python3 myapp.py
view raw Makefile hosted with ❤ by GitHub
import numpy as np
from pandas import DataFrame, Series
import pandas as pd
print("Hello World")
view raw myapp.py hosted with ❤ by GitHub
python-dateutil==2.7.5
numpy
pandas==0.23.4
psycopg2==2.7.6.1
urllib3==1.24.1

When you build your image from the Dockerfile you’ll need to install the build-base application for alpine which will include the make utility. You can also install all package dependencies in the Docker container, as we do here for psycopg2 in line 3. This Dockerfile then copies the requirements.txt and Makefile into your new docker container. In a more real-life set up, you might clone the git repository for your application into your container in this step. Line 6 installs virtualenv on your docker container and Line 7 calls the make file to build the virtual environment.

The venv stage of the make file checks to see if a virtual environment already exists. If it does it invokes it and makes it callable to other stages in the makefile. If not, it generates a new virtualenv and installs packages from the requirements.txt. Subsequent makefile stages inherit the venv variable.

Once you’ve built and launched the container your can call the make myapp_run to call your application with the packages installed on your virtual environment.

Hello World!

Greetings!

This is a whim, a bit of a resolution… I will be making an attempt to write about the issues that I run across as I do data science. I dabble in whatever tools will get the job done. My most reached-for tools in the past year or so have been Python, Pandas, Postgres, scikit-learn, Linux, Docker, AWS, and Flask. I’ll try to dip into issues on each of these. I intend to keep posts brief. I’ll try to focus on a particular problem and provide a solution.

Posts will only trickle out when I can get to them, so I imagine you’ll only find this blog if it describes exactly the problem you are trying to solve.

Adam