Lesson 24. Making File Request Over HTTP And SFTP

So, all the cool data is in the cloud, right? The ubiquitous cloud? Is a third-party bank’s on prim file server part of the cloud? Whatever. The point is the data is not local so we gotta go out and get it.

Up to this point, when we snag a file from the internet, we have been doing it with Pandas. That’s cool, but the problems is when you download a file with Pandas, it gets loaded into a data frame which is limited by your machine’s memory before it gets written to disk.

Today, I am going to show you how to skip the data frame and just write a file straight to disk. We will be using a module called urllib to yank files off the internet using the same protocols used to retrieve webpages.

I am also going to show you how to get files from an FTP server which can be kind of a goat rope.

Sometimes you need to get data from an FTP server. However, the data is probably sensitive so you cannot transfer the file in clear text. Here is where secured FTP (SFTP) comes in.

In order to run the SFTP example below you will need to do two things.

Pip install pysftp . This is the module that we are going to use to download our file stored on a secured server.

We also need to install some sort of graphical FTP too. I use WinSCP. We are going to need this tool to explore the FTP server so we can find out what we need to write our download script.

Examples

Example #1: Where’s The Beef?!

Acquiring data from the internet is often not as straight forward as dropping a URL in a script. In some cases, in order to get the data that you want, you have to pass in URL variables. Below is a good example of this.

We are going to download some beef data from the USDA website. We are going to pull down an XML file, but the file is not sitting static on the server. The file is built from a reports generating application on the USDA server.

Normally users get their data by using the web application’s GUI. But we need to automate things. So, the strategy is to use the GUI once. When you do that, you take note of the query string that gets built in the URL. Then you make Python variables out of any URL parameters. Now you can bypass the GUI and automate the process of downloading the data.

import urllib
import os

report_date = '09/01/2020'

if not 'script_dir' in globals():
    script_dir = os.getcwd()
data_directory = 'data\\'
example_directory = 'HTTPAndFTPExample\\'
file_name = 'BeefReport.xml'

target_path = os.path.join(script_dir,data_directory,example_directory,file_name)

url = 'https://mpr.datamart.ams.usda.gov/ws/report/v1/beef/LM_XB459?'
url = url + 'filter={%22filters%22:[{%22fieldName%22:%22Report%20date%22,'
url = url + '%22operatorType%22:%22GREATER%22,%22values%22:[%22' + report_date + '%22]}]}'

with urllib.request.urlopen(url) as source_file:
    with open(target_path, 'wb') as target_file:
        target_file.write(source_file.read())

Example #2: Download File From SFTP Server

We are going to use a test SFTP server called Rebex. The protocols for server access can be found at https://test.rebex.net.

If you go to that site and look at the SFTP protocol, you will find the following settings.

host: test.rebex.net

username: demo

password: password

port: 22

Plug those settings into your FTP tool and log into the server. Navigate around and see if you can find an interesting file to download. For this example, we are going to keep is simple and just grab the readme.txt file from the root folder.

This is a simplified example for clarity. Below you will see the line where we set cnopts.hostkeys = None. This actually leaves you open to something called a "man in the middle" attack. In a real scenario, you would have to use something called a host key. We will tackle that in the solutions section.

import urllib
import os
import pysftp

if not 'script_dir' in globals():
    script_dir = os.getcwd()
data_directory = 'data\\'
example_directory = 'HTTPAndFTPExample\\'
file_name = 'readme.txt'

host = 'test.rebex.net'
username = 'demo'
password = 'password'

target_path = os.path.join(script_dir,data_directory,example_directory,file_name)

cnopts = pysftp.CnOpts()
cnopts.hostkeys = None  

with pysftp.Connection(host = host, username = username, password = password, cnopts=cnopts) as sftp:
        sftp.get(file_name, target_path)

Now you try it!

Don't copy and past. Type the code yourself!

Last updated