Tutorials
Hands-On Python
Hands-On Python
  • Hands-On Python Tutorial For Real-World Business Analytics Problems
  • Preface
    • Section I. A Note From The Author
    • Section II. Tutorial Overview
    • Section III. What Is The Preflight Checklist?
    • Section IV. Supplimentery Material
  • Preflight Checklist
    • Section V. Select Your Difficulty Setting
    • Section VI. Download Anaconda
    • Section VII. Download PyCharm (Optional)
    • Section VIII. Download SQL Server Developer Edition
    • Section IX. Configure Database Environment
    • Section X. Download The Source Code
    • Section XI. Starting JupyterLab
    • Section XII. How To Get Help With This Tutorial
  • Language Basics
    • Lesson 1. Obligatory Hello World
    • Lesson 2. Code Comments
    • Lesson 3. Data Types
    • Lesson 4. Variables
    • Lesson 5. String Concatenation
    • Lesson 6. Arithmetic Operators
    • Lesson 7. Making Decisions
    • Lesson 8. Control Flow With if-elif-else
    • Lesson 9. Control Flow With while
    • Lesson 10. Data Structures Part I: List
    • Lesson 11. Data Structures Part II: Tuples
    • Lesson 12. Data Structures Part III: Dictionaries
    • Lesson 13. Looping With for
    • Lesson 14. Functions
    • Lesson 15. Importing Modules
    • Lesson 16. Python Programming Standards
  • Advanced Topics
    • Lesson 17. Functional Programing With map
    • Lesson 18. Generators
    • Lesson 19. Comprehensions
    • Lesson 20. Basic File Operations
    • Lesson 21. Working With Data In Numpy
    • Lesson 22. Working With Data In Pandas
    • Lesson 23. Working With JSON
    • Lesson 24. Making File Request Over HTTP And SFTP
    • Lesson 25. Interacting With Databases
    • Lesson 26. Saving Objects With Pickle
    • Lesson 27. Error Handling
    • Lesson 28. Bringing It All Together
  • Solutions To Real World Problems
    • Lesson 29. Download A Zip File Over HTTP
    • Lesson 30. Looping Over Files In A Directory
    • Lesson 31. Convert Comma Delmited Files To Pipe Delimited
    • Lesson 32. Combining Multiple CSVs Into One File
    • Lesson 33. Load Large CSVs Into Data Warehouse Staging Tables
    • Lesson 34. Efficiently Write Large Database Query Results To Disk
    • Lesson 35. Working With SFTP In The Real World
    • Lesson 36. Executing Python From SQL Server Agent
Powered by GitBook
On this page
  1. Solutions To Real World Problems

Lesson 34. Efficiently Write Large Database Query Results To Disk

What Gets Uploaded, Must Get Downloaded

Data in a database is pointless unless someone is dragging it back out to do analysis with. Most often, the dragging is simply a query to a screen or maybe a pull of physical data to a spreadsheet. The spreadsheet, of course, represents a tiny subset of the total historical data assets.

However, sometimes, you exchange data with a third party, and you do that in the form of a physical file that has to be produced. If you are lucky, the file that has to be produced is some sort of daily file that only reports information from a 24-hour period.

If you are not lucky, you might have to produce a huge file. It may even only be from a 24-hour period, but so much stuff happens in 24 hours that the file is a monster.

Ok so now what? You know Pandas has limitations so you can pull the data into a data frame. Sure, you could mess around with the Chunksize parameter, but there is a better way.

Instead of using Pandas, we are going to skip the data frame and write data directly to disk one line at a time.

Script Performance

Writing large amounts of data to disk in CSV format will eat some resources, but it is not near as bad as storing it in a database. When you store stuff in a database, there is all kinds of other information that gets written to disk at the same time. Here we are just dropping the pure 1s and 0s of the file with no logging, shuffling, or metadata.

This script completed on my box in about 10 minutes. The really fun thing is, if you go look, the file we kicked out is much smaller than the file we imported in Lesson 33.

Examples

Example #1: Let’s Pack ‘Em Up And Move ‘Em Out!

We are going to drag back out the data that we just got through loading in the last lesson.

Since the data is currently sitting in a stage table, we are going query the view. That way, we do not wind up with audit columns in our file.

If this were a real world ETL process, we would have cleansed the data after we loaded it in lesson 33. As it turns out, the data got loaded with some hitch hikers. The carriage return line feed character got loaded along with the last column so that needs to be removed.

Once that is done, we can proceed with creating the file.

Most of the code here should be familiar. The new bits are working with the csv module. That module has functions that allows us to create CSVs files in the format that most the world expects when dealing with CSVs.

--remove carrage return and line feed characters
UPDATE euro.EurostatData
SET SUP_QUANTITY = REPLACE(REPLACE(SUP_QUANTITY,CHAR(13),''),CHAR(10),'')
WHERE 1 = 1
AND Cleansed = 0
AND ErrorRecord = 0
AND Processed = 0
import pyodbc as db
import os
import time
import csv

if not 'script_dir' in globals():
    script_dir = os.getcwd()
data_directory = 'data\\'
example_directory = 'WriteLargeFilesExample\\'
file_name = 'EurostatDataOut.csv'
target_path = os.path.join(script_dir,data_directory,example_directory,file_name)

sql = 'SELECT * FROM EurostatData'

#Set up the connection.
print('Connecting to SQL Server database' + time.strftime(' %H:%M:%S'))
connection_string = 'DSN=ETL;'
conn = db.connect(connection_string)
print('Preparing database for update' + time.strftime(' %H:%M:%S'))
csr = conn.cursor()
csr.execute(sql)

with open(target_path,'w',newline='') as f:
    writer = csv.writer(f, delimiter = '|', quoting=csv.QUOTE_NONE)
    writer.writerow([x[0] for x in csr.description])
    for row in csr:
        writer.writerow(row)
        
print('Processing file {} complete.'.format(file_name) + time.strftime(' %H:%M:%S'))
conn.commit()
csr.close()
conn.close()
PreviousLesson 33. Load Large CSVs Into Data Warehouse Staging TablesNextLesson 35. Working With SFTP In The Real World

Last updated 3 years ago