Python & SQL for Data Science Applications

I provide guidance on using Python 3’s named argument/ dictionary-style string formatting to keep track of arguments, and I explain how to construct complex queries that use variables for column and table names.

The thing about the world’s data is that it is not sitting in perfectly formatted static JSON files ready to feed your D3 visualizations. Data is not just queued up to train your machine learning models by itself. Even with Hadoop and the multitude of distributed file systems out there, a huge amount the data available to institutions is stored in fairly traditional SQL relational databases. As a data scientist trying to leverage the value of your organizations data, a big chunk of the job is getting data out of these databases in order to support advanced analysis, AI, recommendation engines, and all the other stuff that is “the real data science”. Here I’ll discuss some helpful tips and tricks for using Python and SQL to query relational databases.

As a Python programmer, one of the most helpful things I have learned, is how to effectively manage the mess of SQL I needed for the applications I build. In this post, I explain how to write a custom class to manage your queries and database connections. I weigh when it makes sense to skip SQL parameterization (there are a lot of conditions that have to be met). I provide guidance on using Python 3’s named argument/ dictionary-style string formatting to keep track of arguments, and I explain how to construct complex queries that use variables for column and table names. 

Write a custom class to manage your SQL queries

I am a fierce believer that if you write any Python code that could ever be reused (even if it’s just by you), it is worth it to take an object oriented view of the project and organize your code into functions and classes. Establishing classes dramatically increases readability, maintainability and reusability. Well structured classes allow different developers to hop on and off projects without having to re-invent the project from scratch. This is absolutely applies to managing SQL in your Python applications. It’s easy to manage SQL queries with classes so that you can flexibly and dynamically serve data to your application.

The Query Class

example:

import psycopg2
class query():
def __init__(self):
"""Set credential information here, or pass the information as
arguments to the class, or read your credentials from environment
variables. Anyway you choose to handle crendtials, pass them to these
class variables."""
self.DB_USER = '<username>'
self.DB_PWD = '<password>'
self.DB_HOST = '<host>'
self.DB_NAME = '<Database name>'
self.cn = self.connect_to_db()
def connect_to_db(self):
"""Creates psycopg2 connection to host database"""
self.conn = psycopg2.connect(
host=self.DB_HOST,
dbname=self.DB_NAME,
user=self.DB_USER,
password=self.DB_PWD
)
return self.conn.cursor()
def read_db(self, query):
self.cn.execute(query)
return self.cn.fetchall()
def generate_query(args):
"""Constructs query statement from args and executes query"""
query = '''select {select_clause}
from {schema}.{table}
where 1=1
{where_clause}
'''.format(**args)
return self.read_db(query, args)
db_query = query()
args = {'select_clause':'first, last, age, address',
'schema':'university',
'table':'students',
'where_clause':' AND age > 30'}
results = db_query.generate_query(args)
view raw query_maker.py hosted with ❤ by GitHub

Connect to the database

This simple class has three standard methods plus the special __init__ method. The __init__ method executes when you first initialize the class (which happens here in line 47). In the __init__ method I am using placeholders to show where you would pass your credential information to. In a live application you should pass credential information from a credential management system or encrypted variables rather than hardcoding that information. 

After accepting credential information, the __init__ method calls the connect_to_db method which accepts the credentials you provided. This establishes a connection to the database using the appropriate Python database connector for your database (Check out this post to pick the right best Python database connector for your database here).

The cursor object maintains the connection to the database. Here, the cursor object is assigned to the cn class object. This object allows you to make multiple calls to the database without resetting the connection. 

The read_db method (lines 26-28) accepts a query string as an argument. It handles the process of passing the query to the database and returns an object containing the results of the query. 

Construct the query

The generate_query method (lines 30-39) uses string formatting to construct a properly formatted SQL query. The method takes a dictionary of values to be passed to the string as an argument, processes the string formatting, then passes that query to the read_db method and returns the results object. The next section dives further into utilizing string formatting to write custom queries.

Calling the class

Lines 41-49 demonstrate how to interact with this class. Line 41 instantiates the class. Lines 43-46 contain the dictionary that you pass to the generate_query method. This constructs your query using the built-in Python named argument or dictionary style string formatting method.

When can you skip SQL parameterization?

First, the rest of this article probably is not for you if you are building a public facing application on top of any sort of sensitive data. Public facing databases should use query parametrization, a topic into which I will delve another day.

This post is about writing queries for data science applications. It is not guidance for general web development. This means that I make explicit assumptions that the data you are working with is appropriately locked down at the database and user level. Anyone with access to your application must already have approved access to all of the underlying data. Users accessing the application must be set up with individualized accounts. Never use the same account for multiple users and make sure that the access levels are correct for each user. Appropriately managed permissions prevent unauthorized access to underlying data in the event of unauthorized access to the application. If your application does not meet these minimum criteria, you should only use parameterized queries and you can skip the rest of this post.

Okay, with that disclaimer out of the way, there are two scenarios where it makes sense to skip rigid parameterization and embrace string formatting. The first is that parameterization requires that you to track variables in ordered lists, which makes code to read and maintain complicated super quickly. The second is that parameterization is structured so that you can only feed values to your query, not database names, schema names, table names or clauses. This severely limits how flexible you can make your code.

Using Python 3 named arguments or dictionary style string formatting to build SQL queries

In version 3.0 and later, Python introduced new and tremendously helpful ways to manipulate strings that eliminates both of these issues. (These features were later back-ported to all the way to version 2.6). You are now able to pass named arguments and dictionaries to your strings via the .format() method. 

Previously, if you wanted to use string formatting you had to specify the type for each variable you passed. Subsequently, it was necessary to keep track of your arguments by the positional index of the tuple of values you passed to the string. With Python 3, forced-types have been defenestrated and order doesn’t matter.

Take the example:

args = {‘name’:’George’, ‘age’:32}
string = “The student’s name is {name}, age {age}.”.format(**args)

Notice that in both cases we pass strings and integers using identical syntax. There is no need to specify the variable type. Secondly, you can keep track of objects by the key value of objects in the dictionary passed to the string. Also note, You must use the ** operator to unpack the dictionary, because string formatting actually takes named arguments. The ** operator converts the dictionary to named arguments. 

Alternatively you can use named arguments explicitly:

string = “The student’s name is {name}, age {age}.”.format(name=’George’, age:32)

These two features make string formatting really easy, and make it unreasonably easy to customize SQL queries. 

Construct complex queries with variable, column, and table names

Take the example in line 33-37 from the class above:

query = '''select {select_clause}
			from {schema}.{table}
			where 1=1
			{where_clause}
			'''.format(**args

and the dictionary we pass to it here:

args = {'select_clause':'first, last, age, address',
				'schema':'university',
				'table':'students', 
				'where_clause':' AND age > 30'}

Above, I’ve set up what I like to think of as a skeleton query. This consists of the basic syntax of a correctly formatted SQL query plus placeholders for both the clauses and values that need to adapt to the parameters of the application. 

A couple of things about this that to point out. First, because we aren’t using SQL parameterization (see the caveat above) we are not limited to passing values to known database structures. For example, we can pass an entire select clause specifying on the fly which fields to include.

Think about how easy this makes it to maintain the application if data structures are migrated on the backend, you just need to swap out the schema and table names where you set your default values. 

The snippet where 1 = 1 might seem a bit odd in this example. It is a where condition that always evaluates to true. This is a helpful trick, as it allows you to pass either additional where conditions or an empty string and get back the results you were intending. For example, if you were building an application where users can select multiple filters, you could pass those filters to the query just by adding additional conditions to the where clause. 

I hope this post provided helpful principles for querying data with Python and SQL in your data science applications.