Foundational Principles in Data Engineering: Staying Grounded Amidst Tech Trends

As I scroll through social media – LinkedIn, X, Reddit, and even Facebook, I see many posts and advertisements for various solutions in the data space. Some are for entirely new data solutions such as Snowflake, Databricks, and BigQuery. Others are for ETL/ELT tools offering to solve an age-old problem – collecting data in one place. Still, others are new flavors of tools many veteran data engineers were early adopters of – SQL Server, Oracle, PostgreSQL, and others.

The tools that haven’t changed in my 20 years in the data and technology space aren’t the database systems, the cloud options, or even ETL/ELT tools. The tools that haven’t changed are the foundational concepts all data engineering teams need to apply to ANY solution for it to be successful. The emergence of exciting new cloud-based technology has captivated many to chase the shiny new tool because we see a demo that solves our current problem effortlessly, or at least that’s what the sales guys tell us.

In the next few sections, I’ll discuss several “tools” that are still critical in today’s data engineering environment but seem to get lost at times in the whirlwind of emerging tech.

Begin With The End In Mind

In his book, The Seven Habits of Highly Effective People, Steven Covey lists one of the effective habits as “begin with the end in mind.” Applying this principle to data engineering simply means that before a single database, table, or ingestion pipeline is coded, take the time to determine what the final data solution should be. The “end” in this case doesn’t need to be a full ERD with all the tables and primary keys established; however, at least establish a rough idea of the end solution.

Asking the basic questions we all learned in elementary school can help:

  1. Who is the end consumer of our data?
  2. Where will the data live?
  3. What is the footprint of the initial data set?

You get the idea. Spending some time upfront to sketch a rough idea of the solution gives shape to the project and a north star to reference as work begins.

Rely On Your Expertise

Know your area and tool(s) of expertise and rely on them. With so many different tools and platforms coming to the marketplace daily, it is easy to get caught up in the shiny and new. I’ve been there recently – Snowflake did that to me. Snowflake is everywhere on my social feeds – I took the plunge, achieved my SnowPro Core Certification, and am a member of the Certification Subject Matter Expert (SME) team. It’s very cool technology; however, it is still new, and I’m still learning.

I am certainly not saying that learning new technology is detrimental; however, I am saying that constantly changing your team’s technology approach will impact the team’s ability to deliver. Isn’t embracing new technology a good thing? Yes – but are you suddenly an expert on that technology compared to other data tools? Probably not. Knowing a particular toolset and knowing it well goes a long way for data engineering teams when trying to solve complex data problems. By sticking with your expertise, the time is spent on solving the problem with the data, not tracking down if the limitation is the data or the technology.

Documentation

Ugh! Really? Documentation? Yes, every developer hates it. I certainly do. But even some rough comments in code, a Jira ticket, Wiki, or design document will go a long way. Your “future self” will thank you.

I’m a drummer and played for several band leaders over the years. One of them had a saying that is applicable here: “A short pencil is always better than a long memory.” Applying this to data engineering means even rough documentation is better than relying on tribal knowledge of how a process came together. As issues arise in the future (and they will show up), even a rough outline of what a process is doing, or how data flows through the system helps track down a bug.

It doesn’t have to be pretty out of the gate. I’ve applied this recently with my team. Our documentation tool of choice is Confluence and it provides a really simple WYSIWYG interface for a wiki – but the part I like best is how easy it is to edit, update, or even move entire documents within the space. Simply “get it in there” and make it pretty as time permits.

Wrapping Up

Not all data engineering work is fun stuff like coding procedures, building ETL/ELT flows, or even learning new data technology. Some of the most “boring” and basic work like planning and documentation continue to pay dividends in the long run.

The prep work should be proportional to the overall task at hand. Spending weeks planning for a project that ends up being a single database with five tables and a few stored procs is likely overkill. But the prep work should still occur so the team knows where to go.

Don’t sacrifice the basics for the promise of a magic bullet.

Follow me on LinkedIn and Medium for more content on Data Management and demos including Snowflake, Streamlit, and SQL Server.

TL;DR Made Easy: Summarizing Web Content with Snowflake and Streamlit

This article is inspired by the acronym TL;DR (too long; didn’t read). As I dive more into technology topics such as Snowflake, Streamlit, AI, data management, and more, I find a lot of content that can be rather lengthy. Not all the content has a nice intro or summary. I’m doing a little finger-pointing at myself here because I’ve skipped that crucial intro on several articles.

TL;DR

In this demo, I’ll show a Streamlit app using Snowflake Cortex to extract content from a URL and produce a summary of the content. Prerequisites include a Snowflake account with Cortex enabled and a Python dev environment with Streamlit, Snowflake, and BeautifulSoup libraries installed.

Use AI to Summarize Web Content

With some of the “tinkering” I’ve been doing with Cortex, Snowflake’s integrated AI toolset, I wondered how I could leverage it to summarize web content without copying and pasting the full content into an app. That spawned a simple Streamlit app to do just that.

Getting Started

You’ll need a Snowflake account with the appropriate permissions applied to access the functions. I covered adding this permission set in a previous article.

If not already done, install the Snowflake Snowpark, Connector and ML libraries, Streamlit, and BeautifulSoup4 into your Python environment. I use VSCode, so I installed all these from the command line using pip.

Import Libraries and Snowflake Credentials

Now that all the prerequisites are handled, we can start building the app in our Python dev environment. First, we’ll import all our needed libraries and create our Snowflake credentials dictionary. Enter your Snowflake account information where specified and be sure to use the user and role with Cortex access granted in the first step.

#IMPORT SNOWFLAKE LIBARIES
from snowflake.snowpark import Session as sp
from snowflake import cortex as cx
import streamlit as st
#IMPORT "EXTRA" LIBRARIES
import requests
from bs4 import BeautifulSoup

#SET SNOWFLAKE CREDS
snowflake_creds = {
            "account":"YOUR ACCOUNT",
            "user":"YOUR USERNAME",
            "password":"YOUR PASSWORD",            
            "database":"YOUR DB",
            "schema":"YOUR SCHEMA",
            "role":"YOUR ROLE",
            "warehouse":"YOUR WAREHOUSE"
        }    

URL Text Input

For this app, I’m going to keep it pretty simple and have a header and single input box for the URL of the content to summarize. Once the URL is entered and submitted to the app, I’ll add some conditional logic to check if a URL was submitted, if it is valid, the content can be read, etc.

st.set_page_config(page_title="Snowflake Cortext URL Summarizer")
st.header("Snowflake Cortex URL Summarizer")
#EXTRACT TEXT FROM PUBLIC WEBSITE WITH "GET"
summary_url = st.text_input(label="Enter url to summarize (must be public)")

Making Soup

BeautifulSoup is a Python library that helps with parsing HTML and XML content. For this app, we’ll use it to parse the HTML content from a GET request and extract just the text. We don’t need all the HTML code like JavaScript or tags – just the “body” of the article or page. BeautifulSoup does that in just a few lines of code.

response = requests.get(summary_url)
#MAKE SOUP
soup = BeautifulSoup(response.content,features='html.parser')

Summarize the Content

Now that we’ve made soup, we’ll send that soup to Snowflake to run the Cortex Summarize function on the text and then write that summary to a chatbot window. I used the chatbot window for fun, but it could be a standard write function, a text area, or whatever you choose.

content_summary = cx.Summarize(soup.get_text(),sf_session)
chat_out = st.chat_message
with chat_out(name="ai"):
        st.write(content_summary) 

Running the App

Now that the app is complete, run the app by calling the streamlit run utility from the command line (or terminal in VSCode). Once the app loads in a browser window, paste a URL into the box and press Enter. Within a few seconds, your app will read the document and then provide a summary of the content. The animation below shows the app in action on one of my recent blog articles. 

Streamlit app in action

Conclusion

In this article I showed how to use Snowflake Cortex Summarize in a Streamlit app to provide a summary of web content – content that’s not even in Snowflake! Imagine using Snowflake’s AI functions with Python or Streamlit for rapid POCs for datasets not yet loaded to Snowflake – the possibilities are endless.

The complete demo code is available on my Github repository. Follow me on LinkedIn and Medium for more Snowflake, Streamlit, and Data Management content.

Streamlit Power-Up: Advanced Image Display Options for Snowflake Users

In my last article, I showed two ways to add images to your Streamlit in Snowflake app – one from an external link and another from the same internal stage as the app. What if the images or other resources needed for your app aren’t in the same internal stage or even in Snowflake at all? Below I walk through two more options for displaying images in a Streamlit in Snowflake app.

“Other” Internal Stage

In the previous demo, I simply loaded the images I wanted to display into a folder in the same internal stage as the app. Perhaps you have a set of “common” images for your Streamlit apps – this is where a common stage would come in handy.

Create Stage

First, create a new stage in your Snowflake. In this example, I’ll include it in the same database and schema as my app, but as long you set permissions correctly, the location of the stage shouldn’t matter. When creating the stage, you’ll want to set the encryption to SNOWFLAKE-SSE and enable DIRECTORY. SNOWFLAKE-SSE uses Snowflake Server Side Encryption on the files in the stage so that they can be accessed outside of Snowflake with a PRESIGNED URL. Enabling the directory allows us to query the stage using the DIRECTORY table function that we’ll need to get our image URLs.

/*SET ENVIRONMENT*/
USE ROLE ST_DEMO_ROLE;
USE WAREHOUSE STREAMLIT_XS_WH;
USE DATABASE STREAMLIT_DB;
USE SCHEMA STREAMLIT_STAGES;


/*CREATE DEDICATED INTERNAL IMAGES STAGE*/
CREATE OR REPLACE STAGE STG_IMAGES
    ENCRYPTION = (TYPE='SNOWFLAKE_SSE')    
    DIRECTORY = (ENABLE=TRUE)
    ;

Once the stage is created, upload the app images using your preferred method. For this demo, I grabbed three images and uploaded them using Snowsight.

Images loaded to STG_IMAGES

Get Images

Next, using the GET_PRESIGNED_URL function and the DIRECTORY table function, build a query against the stage to return the relative path of the images and a URL for accessing them. If you have a complex folder structure on the stage you may also want to include a column to parse out the file name from the relative path.

SET STAGE_NAME = '@STREAMLIT_DB.STREAMLIT_STAGES.STG_IMAGES';

SELECT
    RELATIVE_PATH AS FILE_NAME,
    GET_PRESIGNED_URL($STAGE_NAME,RELATIVE_PATH) AS IMG_URL
FROM DIRECTORY($STAGE_NAME);

Add Images to Streamlit

Using the query created in the previous step, add a few lines of code to your app to retrieve the image information from your stage. In this example, I’m looping through all the images on the stage and displaying them to the screen with the relative path added as a caption to the image.

stg_name = "@STREAMLIT_DB.STREAMLIT_STAGES.STG_IMAGES"
img_sql = f"""
SELECT
RELATIVE_PATH AS FILE_NAME,
GET_PRESIGNED_URL({stg_name},RELATIVE_PATH) AS IMG_URL
FROM DIRECTORY({stg_name})
"""
img_df = session.sql(img_sql).collect()
for img in img_df:
    st.image(image=img["IMG_URL"],caption=img["FILE_NAME"],width=200)
    link = f'[Download {img["FILE_NAME"]}]({img["IMG_URL"]})'
    st.markdown(link)
Images in Streamlit App

External Stage

What if you already have all your images or assets located externally from Snowflake? Perhaps they’re on Amazon S3 or Microsoft Azure – no worries, we can do that too!

External S3 Bucket

The external stage implementation of this method only differs from using an internal stage in the setup of the stage. For this example, I created an S3 bucket in my AWS account and generated an Access Key and Secret for it. Note: connecting to S3 buckets using Access Key and Secret is allowable, but not recommended best practice. Using a storage integration is the recommended best practice; however, the Access Key and Secret are “good enough” for this demo. I also loaded the same three images to the S3 bucket as my internal stage.

CREATE OR REPLACE STAGE STG_S3_IMAGES
   URL = 's3://eheilman-streamlit-images'
   CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY' AWS_SECRET_KEY = 'YOUR SECRET')
   ENCRYPTION = (TYPE = 'AWS_SSE_S3') /*MATCH YOUR S3 BUCKET ENCRYPTION*/
   DIRECTORY = (ENABLE=TRUE)
    ;

Use External Stage in Streamlit

Switching to the external stage in the app is as easy as changing the name of the stage in the variable. Since we enabled directory access to the external stage during creation, we can reuse the same query for the internal stage.

stg_name = "@STREAMLIT_DB.STREAMLIT_STAGES.STG_S3_IMAGES"
img_sql = f"""
SELECT
RELATIVE_PATH AS FILE_NAME,
GET_PRESIGNED_URL({stg_name},RELATIVE_PATH) AS IMG_URL
FROM DIRECTORY({stg_name})
"""
img_df = session.sql(img_sql).collect()
for img in img_df:
    st.image(image=img["IMG_URL"],caption=img["FILE_NAME"],width=200)
    link = f'[Download {img["FILE_NAME"]}]({img["IMG_URL"]})'
    st.markdown(link)

Conclusion

In this demo, we walked through two additional methods for incorporating images into a Streamlit in Snowflake application. Through the methods shown, you can efficiently display images within your Streamlit in Snowflake apps, whether the images are stored internally within Snowflake or externally on platforms like Amazon S3 or Microsoft Azure.