A couple of weeks ago, I introduced you to an idea I had about testing the viability of Hadoop and/or Spark solutions for ETL. I planned out a real-world type ETL problem, with what I hoped would be real-world scale.
Since that introduction, I’ve had the opportunity to work on the first part of the testing process, the SSIS execution. Keep reading for details on the test plan and the results of the execution.
You’ll remember from theinitial post that I’m looking to ingest a bunch of weather data from NOAA, perform some transformations on that data, and write it back to a SQL Server for later analysis. This procedure mimics many ETL scenarios that are implemented every day.
For this test, I downloaded ALL of the NOAA_daily archive data — totaling nearly 90GB in CSV files uncompressed. Not HUGE data, but also, not a small amount of data. It’s enough that it should offer a good look into the comparison between SMP solutions like SSIS and MPP solutions like Apace Spark.
This CSV data is well formatted, and didn’t require too much cleansing for my purposes. One big task that I needed to complete was a pivot of the data to ensure that one output row corresponded with a single station/day combination. I also filtered the data down to the following measurements:
- Total Precipitation
- Total Snowfall
- Min Temp
- Max Temp
- Avg Temp
I performed a second pivot to also store the time each of these events were recorded. The SSIS package I created ended up looking something like this:
The NOAA data is stored in a single file for each year starting with 1796 and ending with 2016. The SSIS package was set up to iterate through each file one at a time — a common method when ingesting lots of CSV files with SSIS.
The environment I executed this package on contained three individual machines:
1 ETL Server (Virtual)
- SQL 2014 SP1
- 4 vCores
- 32GB RAM
- Package deployed to SSISDB
1 File Server (physical)
- All CSV files stored in a single directory
- ~90GB uncompressed
1 SQL Server (Physical)
- SQL 2014 SP1
- 8 physical cores
- 96GB RAM
- RAID 10 array for logs
- RAID 10array for data
- Database instantiated with 100GB data and log files
- All machines connected with 1000 mbps LAN
- Located in one rack
Honestly, when this package started, I thought it was going to move pretty fast! The first years recorded in the NOAA data are quite small, and the package was flying through them. Once it hit in the mid 1900’s, however, everything started slowing down. Those files are over 1GB in size and the impact of the sorts (which is a blocking transformation) prior to each pivot and also again prior to the merge join really took a toll.
The final data landed in a table that looks like this:
Since I executed this from SSISDB, I was able to monitor progress using the SSIS reports — they are a pretty handy tool in modern editions of SSIS.
In the end — the package execution from end-to-end took 61645 seconds — which works out to about 17.1 hours to finish. Honestly, I think that’s a bit longer than I was originally expected. My estimates were more in the 8 or 9 hour range.
The next step
Now that I have the SSIS package and testing complete, I’ll be moving on to the Spark development and execution. In a couple of weeks, I’ll be sharing the results of that test. I’m really excited to see how it works out — I’m a bit nervous about the performance of Spark writing data directly to SQL Server. I’m pretty sure Spark will churn through the CSV files faster than SSIS, but I think the database writes are going to be a big bottleneck.