Fetch and Combine CSV Files from the Web
Written by Jessica Kane, November 11, 2016
If you’ve ever analyzed large amounts of data, you know that it can be time consuming to get all the data you need into one place. Data providers (often government agencies) frequently provide the data online in a comma delimited format (aka CSV). The data is regularly broken down into multiple files to provide smaller downloads or to provide more manageable data sets for users. It is a time-consuming task to fetch and combine CSV files from the web into one file for large scale analysis.
A recent discussion with an engineer inspired us to put together a simple program to solve this problem. We created a tool to fetch and combine CSV files from the web into one locally stored file. This can save you lots of time if you routinely need to retrieve the same data sets over and over for analysis. The tool can be configured to fetch the files you need and then scheduled to run at routine intervals.
Everybody likes free stuff, so we’re making this tool available free of charge to any of our readers. If this interests you, we have some instructions below to help you get started.
Popular data sets include files put out by the Census Bureau, Department of Energy and many other governmental agencies that distribute data for public consumption. The Michigan Public Service Commission (MPSC) compiles data from monthly natural gas production reports submitted by oil and gas operators in the state of Michigan. We’ll be using some CSV files from their website as an example.
Step 1: Download the Tool
You can download the zip file containing the program files here:
After you unzip the file, you will see that the tool consists of two files: a “main.exe” file and a text file called “urls.txt.”
The executable “main.exe” is what you run to fetch your files. The file “urls.txt” is the file you use to configure what files the program will fetch.
Step 2: Locate the desired CSV files
For this step we will use an example site to demonstrate how you can use the tool. When you go to the Michigan Natural Gas Production Reports page of the MPSC’s website, you’ll see that the Commission has compiled natural gas production data from July 1990 through March 2016. There are over a million rows of data for this time period and the Commission has broken it down into five CSV files:
Step 3: Inspect the CSV files to make sure they all have the same columns
When using this tool, you should keep in mind that the tool is designed to fetch and combine multiple CSV files that use the same schema. That means that all of your CSV files should have the same columns in the same order (when opened in Microsoft Excel). You can see that all five of the files we are combining from the MPSC site have the following columns:
Step 4: Copy your URL(s) from the web by right-clicking on the link and selecting “Copy link address”
Step 5: Paste the URL for each file you want included in your output onto a new line in the urls.txt file and save the file
Open the “urls.txt” file in a standard text editor such as Notepad. After you’ve pasted the URL for each file onto a new line, save and close the file.
*PLEASE NOTE: You should paste the URL with the column labels that you want first. The tool will use these column labels and all subsequent labels from the other URLs will be lost.
Step 6: Run the “main.exe” file
Executing the “main.exe” file will initiate a command line dialogue. This dialogue may appear to be doing nothing at first, but be patient. When it is finished with one file, it will show the progress to the next file. No user action is needed in this screen. When the operation is finished, the screen will automatically close.
Step 7: Open “output.csv”
Once the operation is finished a new file will be in the folder, named “output.csv.” This file will have all of the data from the multiple CSVs.
If you find yourself routinely (weekly, monthly… etc.) needing to fetch and combine CSV files from the web because the provider is frequently posting updates, try using Window’s Task Scheduler.
You can set tasks and triggers with Task Scheduler that Windows will automatically run based on set time intervals or in response to other actions.
We hope this is a useful tool for you to fetch and combine CSV files from the web. If you have any issues, would like additional support or would like to us to help you develop a more powerful tool, contact us. We’d love to hear from you!
PLEASE NOTE: This application is provided free of charge. It is still in beta mode and may contain bugs or issues. Users should review the output.csv file to ensure that all data was fetched and combined correctly.
- Welcoming Douglas to our Team
- Python for Quick and Easy GIS Data Manipulation
- Updates from the Line 45 Team
- Fetch and Combine CSV Files from the Web
- Resources for Water, Wastewater and Stormwater Utilities