Create Rich Metadata Using Snowflake Cortex with Matillion Data Productivity Cloud

Matillion Data Productivity Cloud (DPC) has been quick to integrate some great AI features. AI helps with accelerating data engineering processes and can automate tasks which previously required laborious human effort. One of the use cases where leveraging AI can be beneficial is in the enrichment of data with metadata. In this blog post, Erick Ferreira illustrates this by enriching the publicly available IMDb dataset (see reference below). The enrichment is done with Cortex, which is Snowflake’s managed AI and machine learning service, and Matillion’s DPC, for fast and reliable engineering of data pipelines.

About the IMDb Dataset

IMDb provides an online database containing information about films and other sorts of digital content, such as television shows, video games and streaming content. The database was launched 34 years ago as a fan-operated movie database and nowadays, it is owned and operated by IMDb.com, Inc, a subsidiary of Amazon. The non commercial datasets can be used free of charge for non-commercial purposes1).

At Snap Analytics, we use the IMDb dataset for training purposes. Snap’s dimensional model (pictured on this page) for the IMDb database gives us a good starting point for metadata enrichment with AI. Here is the simplified model:

On top of this dimensional model, we added a reporting table with consolidated information about films, such as the name of the film, year of release, the director’s name, a summary of the film, duration, genre, etc…

IMDb data model

The challenge

We have the following two requirements for our use case:

  1. We would like to know which film directos might have German ancestors. This information is not available in the metadata. We will use Cortex AI Completion to analyse the directors names, and add a column with an indication whether the name has a German origin or not.
  2. For the Spanish speaking users, we would like to add a translation of the English film summary into Spanish. We will use Cortex Translate to automatically create this translation.

Below are the steps to achieve both requirements, by creating a DPC pipeline with some out of the box components.

Step 1: Create a Transformation Pipeline

First, we created a Transformation pipeline with an Input Table component pointing to the IMDb reporting table, have a look at the screenshot below containing a sample of the data.

Step 2: Getting Directors with German Ancestors

From the reporting table, we would like to recommend the films from German directors. We have used the Cortex Completions component using the Model llama-70b-chat with the following user prompt: Could you tell me the origin of this name? I am not interested in the nationality of the director; I would like to know the background of the name. We applied the question against the Director column. This will try to find all the surnames which have a German background.
Look at the image below for some reference about the settings and how the pipeline looks like:

After doing a sample over the Cortex Completions component, we can see the “completion_result” column in JSON format with the answer in the “messages” attribute containing the origin of the director’s name.

This simple action of getting the background of a surname would have been a very laborious task without leveraging the use of Large Language Models (LLM). Moreover, it may require some extensive research, and potentially integrating with a third-party tool.

Step 3: Translate the Film Summary to Spanish

We use the Cortex Translate component in DPC to generate a Spanish translation. The “Overview” column contains the English original summary, which is used as input. Look at the example below for the settings and the outcome after sampling the data:

The Cortex Translate component will not only let you to translate from English to Spanish. There are several languages which you can select from the drop-down menu of the component. You can even have multiple columns as part of the output, one for each language that you need.

Step 4: Extract from semi-structured data

Now, as we are working in a Transformation pipeline and the outcome for the director’s surname is in JSON format, we can extract the “messages” attribute and then filter it out later in our pipeline. To do this, we can use the Extract Nested Data component. We will only check the messages field after autofilling the columns, this will allow us to get a new column with just the value for the “messages” attribute in plain text. Look at this below:

Step 5: Filtering the data

As we mentioned earlier, we want to only get the films for directors with German surnames, so we can use the Filter component to get only the rows where the text “Germany” or “German” appears. Check the filter condition property on the screenshot below:

This simple step will help us to remove all the rows in which the LLM did not identify a surname with a German background.

Step 6: Create a new table in Snowflake

Finally, we just must write the outcome of our pipeline into a table, we can use the Rewrite Table component to achieve this very easily. We will call our table “german_directed_films”. After running the pipeline, we should have a new table in Snowflake showing the specific outcome we built.

Step 7: The results

From the original reporting table containing 1,000 rows, we now have a smaller table containing only 100 rows. Each row represents a film with a director who has a German surname. Also, we have an extra column with the overview of the film in Spanish. Have a quick look at a sample of the data with the three new columns we created while building our pipeline:

Final thoughts

The new Snowflake Cortex is a game changer for analytics projects, it has the capacity to enhance metadata as we have demonstrated, and it is easy to see the value for more complex use cases, such as incorporating sentiment analysis or performing in-depth document analysis to generate outcomes that can be joined with more traditional business metrics.

Even though DPC is a platform mainly built for data engineering, you can see how easy it is to leverage the new AI features offered by Snowflake with a full no-code approach. DPC also brings other out of the box Orchestration components that leverage Large Language Models that we are also planning to cover as part of the other posts, such as vector databases and embeddings.

If you are looking to build data pipelines quickly and combine your analytics project with new AI features, DPC is definitely a platform you should try out. Here at Snap Analytics, we can also support you with your data and AI projects, especially if you are already a Snowflake user!


References and footnotes

Footnote 1:
The IMDb non-commercial datasets can be used free of charge within the restrictions as set out by the non-commercial licensing agreement and within the limitations specified by copyright / license.
Please refere to this page for more information:
IMDb (2024) IMDb Non-Commercial Datasets. Available at: https://developer.imdb.com/non-commercial-datasets/ (Accessed 16 September 2024)

Useful links:
Snowflake (2024) Snowflake Cortex Functions. Available at: https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions (Accessed 16 September 2024)

Matillion (2024) Cortex Completions. Available at: https://docs.matillion.com/data-productivity-cloud/designer/docs/cortex-completions/ (Accessed 16 September 2024)

Matillion (2024) Cortex Translate. Available at: https://docs.matillion.com/data-productivity-cloud/designer/docs/cortex-translate/ (Accessed 16 September 2024)

Be a Data Hero and deliver Net Zero!

The biggest problem in the WORLD!

It is clear that we need radical changes to save our planet. Governments, the private sector and individuals aspire to achieve ‘Net Zero’ – but radically changing the way we operate is not going to be easy.

Achieving this goal is going to be a huge challenge for big, complex organisations.  There are so many areas to explore, from reducing travel and fossil fuel consumption, leveraging renewable energy, improving efficiency of existing equipment, or simple behavior change.  With so much complexity the task can be daunting. 

Can data save us?…

Starting with data can help you to understand where the quickest and biggest wins are.  This helps you to understand what to focus on first.  As Peter Drucker once famously said “You can’t manage what you don’t measure”.

To create a link between desired outcomes and measurable targets you can use a ‘Data Value Map’. Whilst I love technology and data…it’s only useful when it drives actions and creates positive change.  The Data Value Map helps to visualise how data can help you to achieve your goals.  If your goal is Net Zero…it could look something like this:

Data Value Maps can be achieved using a mind mapping or collaboration tool (I like Mindmeister and Miro) and are best done as a highly collaborative team workshop…don’t forget to bring the coffee and cakes!

Now you have a clear view what data is required to measure and act (your “use cases”) to deliver the Net Zero goal.  Next you can score these in terms of Value and Complexity.  Something like a prioritisation matrix can help:

By focusing in on the ‘high priority’ and ‘low complexity’ use cases you can deliver quick wins to the business.  This will help you to demonstrate you are a true Data Hero and can help your organisation to fly!

Once you have prioritised your use cases, you can start to map out the underpinning systems and processes that are needed to deliver connected, structured data to drive your Net Zero goals. 

Delivering at lightning speed…

There are numerous technologies out there that can help you connect all of this data, but we love Matillion for being able to easily and quickly connect to almost any source and transform and join data to make it useful.  As a data platform Snowflake is fantastic for virtually unlimited storage, blistering speed, data warehousing and data science capabilities.  These technologies will certainly enable you to hone your capabilities as a true Data Hero!! There are also many other fantastic cloud solutions that can help you to supercharge your Net Zero data capabilities.

Join the Data League!

Snap Analytics’ team of Data Heroes are helping one of the UK’s largest food manufacturers to leverage data to drive positive change…but if we’re going to solve humanity’s greatest threat…it’s going to take a whole Justice League of Data Heroes.  So join us on this mission to save the planet, and lets all make sure the decision makers in our organisations have the data they need to drive positive change.  Don’t delay…be a Data Hero today!

We believe that businesses have a responsibility to look after our earth…it’s the only one we have!  We will give any organisation a 15% discount on our standard rates for any work directly linked to making a positive change to the environment!

AI for good – how data is helping to change the world

Artificial intelligence has been with us since the 1950s, but many people’s understanding of it still comes through sci-fi movies or shock newspaper headlines. Many worry that this technology is taking away our ability to think and act for ourselves, invading our privacy and taking our jobs. A recent poll by YouGov even found that 41 percent of the British public saw AI as a threat equivalent to nuclear weapons!

The reality is generally much more low key. Rather than creating new dystopias, AI has been most successful when applied to small, specific tasks, which are either too difficult or too time consuming for humans to carry out. As the potential of AI becomes more clear, ethical, or ‘Responsible AI’ has begun to be embraced by members of the tech community involved in solving some of humanity’s more intractable problems, potentially changing the world for good.

Agriculture

We’ve heard about the future of agriculture before thanks to the great GM revolution which promised to feed the planet with crops that were free from blight and disease. But it didn’t take long for GM foods to become as reviled as Victor Frankenstein’s final creation. Now AI is attempting to help farmers successfully produce more food. Sainsbury’s supermarket is testing sensors that will be able to provide data instantly to let farmers know the areas of their farm which most need water – perfect in drought prone countries or inaccessible locations. Meanwhile large scale farming is becoming more efficient thanks to AI powered drones. These drones are able to scan extremely large areas of farmland producing a large number of images, whilst also being able to use AI technology such as image recognition to be able to very quickly and accurately detect areas of farmland which are affected by disease in a way that even the most dedicated farmers could ever dream of.

Healthcare

AI’s role in assisting and sometimes replacing doctors is one of the more sensitive areas. In 2017 the UK’s data protection watchdog ruled that the NHS had illegally handed over the data of 1.6 million British patients to Google. The case showed that safeguards are needed whenever personal data is being used. However, when accessed responsibly, there is no denying that the results can be impressive. A two year partnership between Google’s DeepMind and London’s Moorfield hospital used data from thousands of retinal scans to train AI algorithms to detect signs of eye disease. It worked more quickly and efficiently than any human, cutting down the work done by a highly trained and expensive specialist from hours to just seconds. The next step is to use the same AI to analyse radiotherapy scans for cancer.

Endangered species

The appearance of AI driven drones in our skies bring with them fears of inescapable state surveillance. But studies by scientists tracking populations of endangered animals have found a new use for the technology – detecting and tracking species in the most remote locations. Using satellite data together with thermal and infrared imaging, drones are able to spot animals with between 43% and 96% more accuracy than human-made observation. At the moment, the limited range of drones means that success in tracking wide ranging species like polar bears is proving harder to achieve.

With the world now facing unprecedented challenges caused by climate change, epidemics and an ageing population, the importance of AI’s role in tackling these problems has never been greater. The battle to convince the public that it is in their best interest, however, is only just beginning.

Why AI is a lot like teenage sex – and how you can get better at it

In 2013 Dan Ariely, professor of behavioural economics at Duke University, got the analytics world all aquiver when he stated: “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” His risqué comments appeared in the midst of a data revolution, the experience of which was less than satisfactory for many companies.

Six years on and we’re doing it better. However, AI is in danger of being in the same position as Big Data was six years ago. Some data strategists would have you believe that AI and its bedfellow machine learning are better than sex. But as with anything in the data revolution, don’t expect to achieve total fulfilment overnight.

Cost vs kudos

A report from the McKinsey Global Institute predicts that early adopters of this technology could grow the value of their business by 120%. It adds that those who fail to jump on board the AI gravy train could lose a fifth of their cash flow. No surprise that companies are throwing money at the problem – but not necessarily always for the right reasons.

Kudos is not enough of a reward for a business to pump considerable sums of money into an AI project. A survey by the US analytics company Figure Eight showed that the majority of companies were spending at the very least $50,000, rising to over $5 million for those serious about making it a central part of their business.

AI has now been around for two decades. Advances in tech plus heavy investment from the likes of Google mean that the cost of leveraging AI tools will continue to fall, levelling the playing field and allowing smaller businesses to utilise AI. If you’ve made it your business to amass a wealth of clean, properly managed data you are already well positioned to launch an effective AI project.

Let’s not get ahead of ourselves

With any data project, there are things you need to think about before you get started. If you’re looking to get started with AI specifically, first consider whether the problem you would like to address is best served by the technology. Don’t expect AI to act as a sort of panacea; you need to be deploying it in the right way and for the right reasons. If you’re unsure, talk to an expert first (yes, we can help with that) to assess what sort of data analysis would be best for your particular problem.

Maybe there is an area of your business you are certain would benefit from an AI solution – if only you could convince the CFO to invest. If they’re keeping a tight hold of the purse strings ask yourself: does this align with the greater corporate strategy? If no, you’d probably be better focusing your efforts somewhere that does.

Finally, when you have identified the right AI project and hired yourself a crack analytics specialist (hello), don’t assume that the thing will just run itself. AI is smart but it still needs help. That means putting together the right team – and not just a couple of people borrowed from the IT department. Successful AI needs buy-in from people who understand the business need and who are working with the numbers on a daily basis.

Get it right and you’ll transform your AI experience from a meaningless one night stand to a satisfying relationship that grows into something really special.

From wayfinding to driverless cars – explaining the analytics maturity curve

Once upon a time when the world was young, people got around by remembering landmarks, looking at the stars and making the occasional lucky guess. For the most part they didn’t have far to travel so taking a wrong turn here or there did not mean getting lost forever. Until recently, the business world was a bit like this too, with people relying on assumptions about their customers and acting on hunches based on past experience.

But now we’re living in a globally connected society and operating in a sophisticated data driven landscape where chances are, if you rely too heavily on your nose and just hope for the best you’re going to get badly lost. Thankfully analytics can help, whether you’re tracking sales or avoiding traffic jams in an unknown neighbourhood.

The process exists on what we call a ‘maturity curve’, a four part journey which takes us from the most basic statistics to a process driven entirely by AI. Understanding the different stages will give you an idea of how the business of analytics works and will help you plot a course for your business. Gartner’s model helps to visualise the analytics journey:

Descriptive: Say what happened

One day people got sick of walking through the woods, taking a wrong path and stumbling across a sloth of angry bears. After returning to their cabin and counting their remaining limbs they decided to begin to chart those woods and eventually the rest of the world around them.

Diagnostic: Why did it happen?

Without accurate maps, unpleasant bear encounters seemed inevitable. But once people began to join up all their fragments, accurate maps began to appear. People got lost far less and the bears were left to get on with whatever it is that bears do.

So it was in business that people began to make accurate records of their sales which they used year on year to measure growth and diagnose where their problems were. In data analytics this is known as ‘descriptive analysis’ and it is the bedrock of understanding your business.

Predictive: What’s going to happen?

The paper maps were all well and good but what if you hit road works and need to stray beyond the confines of your usual route? SatNav provided the solution, removing the need even for basic wayfinding skills – it simply tells you where to go.

This is how the second ‘predictive’ stage on the maturity curve functions. It combines the historical (descriptive) data with current variables that may affect your business, things like weather or an influx of tourists; it then accurately predicts how your business will fare in the months and years ahead.

Prescriptive: What do I need to do?

Now you no longer need to worry about how to get somewhere and your fancy SatNav can even tell you what time you will arrive. The next stage involves removing the need to even engage in the mechanical process of driving as all that crucial information is accessed by a driverless car that makes all the key decisions for you. Traffic jam forming up ahead? Sit back and relax while it swerves past the accident takes you the scenic route through the woods (don’t forget to wave to the bears).

The final ‘prescriptive’ stage of the maturity process offers you the ability to hand over more and more business decisions to AI. So, for example if you sell ice cream, the data will look at the weather forecast and automatically send extra stock to shops in areas where there is a heatwave. And when you reach the top of the maturity curve the system can be set up to read a huge variety of cues and make automated decisions right across your business.

In analytics – as in life – there are no shortcuts to reaching the top of the curve. It is a long and sometimes difficult journey. But thanks to technology it is becoming increasingly rewarding, if done right.