Efficiently Load Excel Files to Snowflake Without ETL Tools

Excel is everywhere! As a data professional, when I hear that the source data is coming from Excel – it causes a visceral reaction knowing there are a lot of potential pitfalls in getting the data from Excel into any database system. Loading an Excel file to Snowflake is no picnic, or is it? Sure, we could use an ETL tool to load the data from Excel to a Snowflake table; however, if we leverage several Python libraries that are pre-loaded into Snowflake, we can load the Excel file in just a few simple steps – all within Snowflake.

Prep Work

The code samples shown assume the desired Excel file is already loaded into an internal Snowflake stage and the user executing the code has proper use/table creation privileges on the stage and target database schema in Snowflake. For the purposes of the demo, I used a sample Excel file from thespreadsheetguru.com

If you’re working with a new Snowflake demo account and need to setup an environment to match the next few code samples, the initial setup script for my environment is on my GitHub Repo.

Confirm Excel File Location

First, let’s confirm our desired Excel file is loaded to our demo stage location. Running the query below will list the files in my Snowflake stage.

USE ROLE ST_DEMO_ROLE;
USE SCHEMA ST_DEMO_DB.ST_INTERNAL_STAGES;
USE WAREHOUSE ST_DEMO_XS_WH;

LIST @ST_DEMO_STAGE;
Employee Sample Data Excel file on ST_DEMO_STAGE

Python Worksheet

Next, create a new Python worksheet in Snowsight. We’ll need to include a few packages first, so on the Packages drop down, under “Anaconda Packages” add pandas and openpyxl

Add Packages to Python Worksheet

Next, we’ll add pandas and os to our list of imports at the top of our script. The os library will be used to assist with joining file paths – not necessary, but helpful to simplify code.

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
import pandas as pd
import os

Capture Excel Sheet Names

Now, we’re ready to read use pandas to read our Excel file from the stage location. To confirm we have captured the Excel file properly, we’ll temporarily return a Snowpark dataframe containing the sheet names in the file. In this case, the file only has one sheet named “Data”.

def main(session: snowpark.Session): 
    #SET THE FULL STAGE PATH
    str_stg_name = "@ST_DEMO_DB.ST_INTERNAL_STAGES.ST_DEMO_STAGE"
    #SET THE DESIRED EXCEL FILE NAME
    str_excel_file = "Employee Sample Data.xlsx"
    #STREAM THE EXCEL FILE TO AN PANDAS EXCEL OBJECT WITH FILE STREAM
    xl_file = pd.ExcelFile(session.file.get_stream(os.path.join(str_stg_name,str_excel_file)))
    #CAPTURE THE SHEET NAMES
    sp_df = session.createDataFrame(xl_file.sheet_names,schema=["Sheet Name"])
    return(sp_df)
Python Excel Sheet Names

Read Excel Sheet to Screen

Now that we’ve confirmed that the file is accessible and we can get the sheet names, let’s capture that data into a data frame and return it to the screen. With just a few simple modifications to our script above, we can do just that!

def main(session: snowpark.Session): 
    #SET THE FULL STAGE PATH
    str_stg_name = "@ST_DEMO_DB.ST_INTERNAL_STAGES.ST_DEMO_STAGE"
    #SET THE DESIRED EXCEL FILE NAME
    str_excel_file = "Employee Sample Data.xlsx"
    #STREAM THE EXCEL FILE TO AN PANDAS EXCEL OBJECT WITH FILE STREAM
    xl_file = pd.ExcelFile(session.file.get_stream(os.path.join(str_stg_name,str_excel_file)))
    #READ THE "DATA" SHEET TO PANDAS DATA FRAME
    data_df = pd.read_excel(xl_file,sheet_name="Data")
    #CONVERT PANDAS TO SNOWPARK DF
    sp_df = session.createDataFrame(data_df)
    #RETURN THE DATA
    return(sp_df)
Excel Data Output to Screen

Write Excel Sheet to Snowflake

Great! Now we’ve confirmed we can read the data from the Excel file. Next, let’s write it to a transient table in Snowflake in the RAW_DATA schema. I’ll also set the auto_create_table and overwrite options to TRUE so Snowflake will create the table if it doesn’t exist, or overwrite it if it does. These settings are for the purposes of re-running this demo code; however, adjust them as needed for your specific situation.

def main(session: snowpark.Session): 
    #SET THE FULL STAGE PATH
    str_stg_name = "@ST_DEMO_DB.ST_INTERNAL_STAGES.ST_DEMO_STAGE"
    #SET THE DESIRED EXCEL FILE NAME
    str_excel_file = "Employee Sample Data.xlsx"
    #SET THE DESIRED TARGET TABLE
    str_snow_table = "EMPLOYEE_DATA"
    #STREAM THE EXCEL FILE TO AN PANDAS EXCEL OBJECT WITH FILE STREAM
    xl_file = pd.ExcelFile(session.file.get_stream(os.path.join(str_stg_name,str_excel_file)))
    #READ THE "DATA" SHEET TO PANDAS DATA FRAME
    data_df = pd.read_excel(xl_file,sheet_name="Data")
    #WRITE THE DATA TO SNOWFLAKE
    session.write_pandas(df=data_df,table_name=str_snow_table,table_type='transient',auto_create_table=True,overwrite=True)
    #RETURN THE LIST OF TABLES IN RAW_DATA FOR CONFIRMATION
    sp_df = session.sql("SHOW TABLES LIKE 'EMPLOYEE%'")
    #RETURN THE DATA
    return(sp_df)
EMPLOYEE_DATA Table Details
Employee Data in Snowflake Transient Table

Conclusion

By leveraging Python libraries pre-loaded into Snowflake, we can efficiently load Excel files into Snowflake in a few simple steps, avoiding the complexity of traditional ETL tools. In this demo, we walked through how to prepare, read, and write Excel data to Snowflake from a Python worksheet. This could easily be extended to run as a procedure and include much more dynamic methods of determining file names, sheet names and/or the table target.

Follow me on LinkedIn and Medium for more content on Data Management and demos including Snowflake, Streamlit, and SQL Server.

2 thoughts on “Efficiently Load Excel Files to Snowflake Without ETL Tools

  1. Hello,

    First of all, great tutorial.

    I try to follow steps presented on this site and i went successfully through ‘Confirm Excel File location’ (so i am getting confirmation that .xlsx file is where it should be) but next steps are failing in such a way that they do not return anything although script is executed and is successful. I just don’t get to see sheet name neither data in .xlsx file.

    Do you have any idea why is that?

    Regards,

    Alfred

    Like

    1. Do you have the stage, file and sheet name set exactly the same as in the script? If not, make sure you update the variables in the script to match your stage and files.

      Like

Leave a comment