Lesson 34. Efficiently Write Large Database Query Results To Disk

What Gets Uploaded, Must Get Downloaded

Data in a database is pointless unless someone is dragging it back out to do analysis with. Most often, the dragging is simply a query to a screen or maybe a pull of physical data to a spreadsheet. The spreadsheet, of course, represents a tiny subset of the total historical data assets.

However, sometimes, you exchange data with a third party, and you do that in the form of a physical file that has to be produced. If you are lucky, the file that has to be produced is some sort of daily file that only reports information from a 24-hour period.

If you are not lucky, you might have to produce a huge file. It may even only be from a 24-hour period, but so much stuff happens in 24 hours that the file is a monster.

Ok so now what? You know Pandas has limitations so you can pull the data into a data frame. Sure, you could mess around with the Chunksize parameter, but there is a better way.

Instead of using Pandas, we are going to skip the data frame and write data directly to disk one line at a time.

Script Performance

Writing large amounts of data to disk in CSV format will eat some resources, but it is not near as bad as storing it in a database. When you store stuff in a database, there is all kinds of other information that gets written to disk at the same time. Here we are just dropping the pure 1s and 0s of the file with no logging, shuffling, or metadata.

This script completed on my box in about 10 minutes. The really fun thing is, if you go look, the file we kicked out is much smaller than the file we imported in Lesson 33.

Examples

Example #1: Let’s Pack ‘Em Up And Move ‘Em Out!

We are going to drag back out the data that we just got through loading in the last lesson.

Since the data is currently sitting in a stage table, we are going query the view. That way, we do not wind up with audit columns in our file.

If this were a real world ETL process, we would have cleansed the data after we loaded it in lesson 33. As it turns out, the data got loaded with some hitch hikers. The carriage return line feed character got loaded along with the last column so that needs to be removed.

Once that is done, we can proceed with creating the file.

Most of the code here should be familiar. The new bits are working with the csv module. That module has functions that allows us to create CSVs files in the format that most the world expects when dealing with CSVs.

--remove carrage return and line feed characters
UPDATE euro.EurostatData
SET SUP_QUANTITY = REPLACE(REPLACE(SUP_QUANTITY,CHAR(13),''),CHAR(10),'')
WHERE 1 = 1
AND Cleansed = 0
AND ErrorRecord = 0
AND Processed = 0
import pyodbc as db
import os
import time
import csv

if not 'script_dir' in globals():
    script_dir = os.getcwd()
data_directory = 'data\\'
example_directory = 'WriteLargeFilesExample\\'
file_name = 'EurostatDataOut.csv'
target_path = os.path.join(script_dir,data_directory,example_directory,file_name)

sql = 'SELECT * FROM EurostatData'

#Set up the connection.
print('Connecting to SQL Server database' + time.strftime(' %H:%M:%S'))
connection_string = 'DSN=ETL;'
conn = db.connect(connection_string)
print('Preparing database for update' + time.strftime(' %H:%M:%S'))
csr = conn.cursor()
csr.execute(sql)

with open(target_path,'w',newline='') as f:
    writer = csv.writer(f, delimiter = '|', quoting=csv.QUOTE_NONE)
    writer.writerow([x[0] for x in csr.description])
    for row in csr:
        writer.writerow(row)
        
print('Processing file {} complete.'.format(file_name) + time.strftime(' %H:%M:%S'))
conn.commit()
csr.close()
conn.close()

Last updated