50/50 Vision: Working Toward a Gender Balanced Workforce 

“Seeing is believing” is an expression used to emphasise that people are more likely to believe something when they see it with their own eyes. This expression can be applied to various situations, but I will apply this expression to the representation of Women in STEM.  

“I wish I had more awareness of my female predecessors prior to entering college. I stumbled into this field.”

Shannon Loftis, Former VP of Microsoft Xbox Games Studios

We’re fortunate to live in a time where events like Women In Technology, Women in Data and Women of Silicon Roundabout serve as significant platforms for women in the technology sector to share their insights. These events, accessible both online and in-person, attract attendees from diverse backgrounds, countries, ethnicities, and expertise levels. I recently attended Big Data London, where I had the opportunity to connect with women from diverse backgrounds, each at different stages in their tech careers. Engaging with so many talented women and participating in diversity and inclusion seminars was an eye-opener—it made me realise that there are far more of us in the field than I had initially imagined. Events like Data Science Festival and Big Data London not only foster a sense of community but also offer students and recent graduates’ invaluable insights and guidance from experienced professionals. Seeing others who resemble themselves and have encountered similar challenges helps boost their confidence and in other cases alleviate impostor syndrome. Drawing inspiration from their stories empowers women to assert themselves and pave the way for future generations. 

“I am so proud to see Minecraft: Education Edition engaging both boys and girls and teaching STEM subjects like coding and Chemistry in a wildly different way than they’ve been taught in the past 25 years.”

Deirdre Quarnstrom, VP of Microsoft Education

Notions suggesting that girls are less intelligent than boys or that it’s uncommon for girls to pursue STEM subjects have long persisted. From primary school through to university, some girls have grappled with the notion of being an outlier in their classes. While there has been notable improvement in gender balance, disparities still exist. However, the efforts of organizations like Girls Who Code, aimed at narrowing the gender gap by empowering girls to defy stereotypes, are significant. Guided by values of Bravery, Sisterhood, and Activism, they’ve garnered 14.6 billion engagements globally. Their initiatives, including summer immersion programmes, in-person classes, and clubs, have spurred 580,000 girls, women, and non-binary individuals to embark on their coding journeys, with 50% coming from underrepresented groups. This year, a dedicated group from Snap Analytics took on the Three Peaks Challenge, raising funds to support Hayesfield Girls’ School in upgrading their IT suite. Looking ahead, Snap plans to engage further by hosting sessions with the students, aiming to inspire and encourage them to explore careers in STEM. These sessions will also provide valuable insights into what it’s like to be a woman in tech, empowering the next generation of female leaders in the industry. 

“Young girls are digital natives with the creativity and confidence to use STEM to drive positive change, yet we are failing to keep them engaged and excited about the possibilities.”

Mary Snapp, Vice President of Strategic Initiatives at Microsoft Corporate External & Legal Affairs

In 2019, the UK Department of Education reported a 25% increase in the number of women accepted onto full-time STEM undergraduate courses since 2010, with women constituting 54% of UK STEM postgraduates (Department of Education, 2019). Despite this progress, women continue to face challenges in applying for and securing STEM-related jobs. Research indicates that women occupy only 22% of all tech roles across European companies. Furthermore, a 2022 analysis by McKinsey revealed a projected tech talent gap of 1.4 million to 3.9 million people by 2027 in 27 EU countries (McKinsey Digital, 2023). While Google achieved its Racial Equity Commitment of increasing leadership representation of Black+, Latinx+, and Native American+ employees by 30% (Blumberg et al., 2023), achieving fair representation in the tech industry remains a distant goal. Despite these advancements, there is still a considerable journey ahead to achieve equitable representation for women and underrepresented groups. Employee retention goes beyond financial compensation, company culture plays a major role in this. One of the females at Snap commented “At Snap I feel like I am making a difference and that I am part of a team. There is not one day where I feel like I don’t have people to go to when I am struggling, but more importantly there are always people to support you and cheer for you when you are succeeding. I am constantly learning by observing the people around me and they inspire me everyday.” 

“I think we need to mentor young girls and women to help show them what they can achieve with technology – not just what technology is, but what they can create with technology.”

Bonnie Ross, Corporate Vice President at Microsoft, Head of 343 Industries, Halo

Starting from a young age, parents can enrol their children in clubs, similar to those offered by Girls Who Code, to cultivate an early interest in technology. Teachers play a crucial role by intentionally sparking young girls’ interest in subjects like maths, physics, and chemistry. Furthermore, encouraging collaboration between young boys and girls fosters a comfortable environment for future teamwork. As they progress to high school and university, attending events like Women In Technology and Women in Data offers opportunities to connect with peers and seek mentorship from experienced women. Mentorship experiences often inspire recipients to pay it forward, creating a cycle of support for future generations of women. The goal is to empower young women to envision themselves succeeding in the tech industry by interacting with those who are currently in those positions. 


Sources

Choney, S.(2019, March 13). Why do girls lose interest in STEM? Microsoft. Why do girls lose interest in STEM? New research has some answers — and what we can do about it – Stories (microsoft.com) 

Girls Who Code. (n.d.). We’re on a mission to close the gender gap in tech. Girls Who Code | About Us 

Department of Education. (2019). Minister calls to dispel girls’ misconception of STEM subjects. GOV.UK. Minister calls to dispel girls’ misconceptions of STEM subjects – GOV.UK (www.gov.uk) 

Blumberg, S., Krawina, M., Makela, E., & Soller, H. (2029, January 24). Women in tech: The best bet to solve Europe’s talent shortage. McKinsey Digital. Women in tech in Europe | McKinsey 

Google. (2023). Strengthening our culture of respect for all. Diversity Annual Report – Google Diversity Equity & Inclusion (about.google)  

Mind the Gap: 4 key actions data engineers can do to help bridge the digital divide

Digital exclusion is a pressing concern. According to the UK Government in its report on the Data Skills Gap, between 2019 and 2022, approximately 46% of businesses struggled to recruit for roles that required basic data skills. Moreover, about 25% of businesses reported a lack of data skills in machine learning, 22% in programming, 23% in knowledge of emerging technologies and solutions, and 22% in advanced statistics within their sectors. It is estimated that by 2030, UK will face its largest skills gap in basic digital abilities. AI has gained significant popularity over time, however, without targeted action, the growing use of AI will widen the divide between marginalized communities and those who are digitally connected. While regulatory bodies will lead most of the targeted actions, data engineers can also contribute significantly by actioning small changes to ensure everyone has access to the benefits of the AI experience. In this article we will look at what ‘digital exclusion’ means, and how simple changes in data engineering practices can make a difference.

The integration of AI has emerged as a game-changer, enabling businesses to personalize strategies, optimize processes, and enhance customer experiences. AI-driven analytics has revolutionized how companies connect with their target audience. However, concerns remain regarding digital exclusion, which can present itself in the form of the digital divide or algorithmic bias. As data engineers, it’s essential to recognize these challenges and proactively address the risks, ensuring that AI’s transformative potential benefits all users equitably. Later, I’ll present 4 actions data engineers can employ to mitigate the impact of algorithmic bias to help bridge this digital divide.

Digital divide

The digital divide describes the gap between people who have easy access to computers, phones or the internet compared to those who do not. Factors such as access barriers therefore play a major role in the increase of the gap. Access barriers can be described as obstacles that prevent people from using or benefiting from technology. These can include high costs, lack of infrastructure, limited digital literacy, and restrictive policies that prevent access to devices, internet, and digital services.  In 2023, the House of Lords Communication and Digital Committee highlighted that digital exclusion remains a critical issue, with basic digital skills projected to be the UK’s most significant skills gap by 2030. The committee noted that the cost-of-living crisis has worsened the situation, making it even harder for people to afford internet access (Tudor, 2024(1)).

Algorithmic biases

Algorithmic bias refers to the discriminatory treatment which may stem from biases embedded within algorithms. As a result, disadvantages or advantages may be offered to certain groups of people. This bias appears in various forms, such as race, gender, ethnicity, age, or socioeconomic status. Furthermore, algorithmic biases can make unfair situations worse by leaving out some groups or reinforcing stereotypes as a result of skewed user demographics, leading to inaccurate consumer profiling and discriminatory targeting.

What you can do

Navigating these challenges requires proactive measures to mitigate biases. Data engineers can carefully scrutinize AI algorithms and implement transparent data practices. These include employing bias detection and mitigation algorithms, ensuring diverse and inclusive data collection and model development processes, and enhancing transparency and accountability in AI development and deployment. Scoring datasets is one method that can be used to achieve this. When it comes to scoring datasets on diversity properties, the goal is to assess how diverse the data is in terms of representation across different demographic groups or attributes. 4 key actions to follow to score these datasets include:

  1. Defining diversity metrics – Identify relevant key diversity dimensions or attributes relevant.
  2. Quantifying diversity – This could involve calculating representation percentages.
  3. Set thresholds or Benchmarks – Base these on organisational goals, industry standards, or regulatory requirements.
  4. Score Diversity – For example, a dataset with balanced representation across different demographic groups would receive a higher diversity score.

Alternatively, data engineers can conduct representation analysis paired with the fairness analysis to assess if different demographic groups are represented equally in both the data and the outcomes produced by the algorithm. Initially a baseline comparison of the data using preferred demographics can be conducted. Following this, a fairness metrics such as demographic parity, equal opportunity, and disparate impact to evaluate how the algorithm treats different groups can be assessed. From the results the appropriate adjustments can be made to ensure greater representation.

Snap Analytics have progressed from a start up to a scale up. While diversity is a priority, formal measurement of diversity have only recently been implemented. By leveraging HR platforms and applicant tracking systems, valuable insights are being gathered. Snap’s approach includes 2 of the 4 key steps: (1) Defining diversity metrics and (3) Setting thresholds or benchmarks. Gender has been identified as the key diversity dimension, with the organization striving towards a 50/50 gender balance. However, as the company grows, they plan to expand the range of diversity metrics. Currently, diversity is measured through the following methods:

  • Diversity of candidates applying for roles at Snap.
  • Diversity within the organisation, across the different levels.
  • Job Satisfaction.
  • Employee retention.
  • Employee engagement.
  • When someone leaves, an exit interview is conducted with a follow up survey focusing on inclusivity, culture and diversity.

Businesses must prioritize diverse and representative datasets to mitigate inherent biases and provide users with the best experience possible. Additional ways to mitigate digital exclusion include implementing rigorous testing, and validation procedures can help identify and rectify any biases present in AI algorithms. Training and monitoring on ethical awareness among team members is also considered crucial, ensuring responsible deployment of AI technologies. Furthermore, ongoing monitoring and adjustment of AI systems are essential to address emerging biases and uphold ethical standards.

Policy makers have recently presented the EU AI Act which outlines regulations that ensure ethical AI usage, protect consumer privacy, and promote transparency. However, the gap between well connected and poor connected will not close if we leave it to government legislation alone. Socially responsible enterprises must develop and demonstrate plans to reach marginalized communities, using algorithms and datasets that avoid favouring majority groups. Data engineers can take the initiative by employing diversity metrics or representation analysis paired with the fairness analysis to identify unequal outcomes across different groups.


Sources

(1) Tudor, S. (2024, January 30). Digital exclusion in the UK: Communications and Digital Committee report. UK Parliament. Digital exclusion in the UK: Communications and Digital Committee report – House of Lords Library (parliament.uk)

GOV.UK. (2021, May 18). Quantifying the UK Data Skills Gap – Full report. Quantifying the UK Data Skills Gap – Full report – GOV.UK (www.gov.uk)

SAP licence constraints – explainer

Ever wondered how you can get data out of SAP without violating the license agreement? You’re not alone. Most organisations planning to move SAP data up to a Cloud Data Platform are struggling with that very question. Here is a little explainer which hopefully helps you understand what you can or cannot do. But first:

Disclaimer
The terms and conditions of your contract with SAP are agreed between your company and SAP. I don’t know the specifics of your contract, nor am I able to provide legal advice. This article is based on my observation and interpretation. When you are planning to take data out of SAP, I recommend you consult with your SAP account manager and your legal team to ensure you comply with the terms and conditions of the contract.
Enterprise license vs Runtime license

You might have heard that for certain extraction methods an ‘enterprise license’ is required. This is to do with how the database on which the SAP system runs is licensed. When you run an SAP system, you have to install a database system first. SAP ERP can run on a variety of database systems (Oracle, IBM, MS SQL Server, and so on, as well as SAP HANA (the database). The SAP S/4HANA ERP system only runs on SAP HANA. The license restriction applies regardless of what database system you run the SAP ERP application on.

Runtime license

You can purchase a runtime license for the database with your SAP ERP license. The runtime license allows you to run SAP ERP, but nothing else. The SAP application is the only direct user of the underlying database. All other users and usage is managed through the SAP application. Having a runtime license means you cannot create tables directly in the database, but you can create tables in the SAP application (which in turn results in a table creation in the database, with the SAP system as owner). The license agreement does not let you create stored procedures or extraction programs in the database directly. You are also not allowed to read the database directly, or extract data from the database directly or extract data from the database log tables, without going through the SAP application.

Enterprise license

An enterprise license gives you unlimited rights on the database. You can create your own tables and applications on the database as well as running the SAP application. In this case, you are allowed to extract data from tables directly, either by using a 3rd party application which connects to the database or by creating your own extraction processes. An enterprise license will be significantly more expensive than a runtime license. If your company does not have an enterprise license and you want to take data out of SAP, you need to find a way to go through the application layer, instead of the database layer.

Using standard SAP interfaces

SAP has APIs and OData services for getting data out of SAP. Most of these are designed for operational purposes. For example: Give me the address of customer abc. Or: update the price of material abc. These are not really suitable for data extraction. The exception to this are the function modules related to the ODP framework: They can be consumed through OData, and this is still allowed by SAP. You can find more information on using OData for data extraction through ODP here.

Note that it is not permitted to use the ODP function modules through an RFC connection.
Please refer to this blog for more info on that .

There is a standard function module which can be used through RFC, which is the RFC_READ_TABLE or even better, one of the successors of that function module. (RFC_READ_TABLE can’t handle wide tables). Which versions are available depend on your system version, so best to search for it on the SAP system itself. I have the fewest problems with /BODS/RFC_READ_TABLE2. I wouldn’t recommend anyone to build a data warehouse solution based on this extraction method, not least because I am pretty sure SAP have specified somewhere that these FMs are meant for internal use, and might be changed at any time. I wouldn’t be surprised if SAP announces it will forbid the use of these function module in a similar fashion as the ODP function modules.

Third party applications

Third party applications can either use the APIs (Function Modules) mentioned above or create their own application logic to get data out of SAP. If they are using the standard function modules then the same restrictions apply. This means ODP extraction through RFC is not allowed – even if this process is managed by a 3rd party application.

Applications which implement their own interfaces on the SAP system are ‘safe’ – at least for the time being. The small downside of this approach is that you need to implement a piece of code (delivered by the application vendor) in each SAP system you want to connect to. The upside is that the end-to-end process is more robust, better performing and easier to maintain than solutions built on the SAP standard APIs.

Be mindful of third party applications which read the database log or otherwise connect directly to the database layer: You will need an Enterprise license for this, using a 3rd party application does not make a difference from a licensing perspective.

SAP Datasphere

And then there is SAP. SAP Datasphere is perfectly capable of getting data out of SAP, and onto the cloud data platform of your choice. If this would be the only use case you have for SAP Datasphere, then I would imagine this is a very pricy solution. Still, I wanted to make sure I cover all the options.

Great expectations – SAP’s announcements for DataSphere at Sapphire

The data & analytics community is anticipating some big announcements about SAP DataSphere at Sapphire (to be held in Orlande, 3-5 June 2024). This year could be the ‘coming of age’ year for SAP DataSphere, as it has shaken off some of the teething problems and starts to become a truly enterprise- grade data platform as a service. I don’t think SAP DataSphere will become as flexible and open to 3rd party applications as the main competitors (Snowflake, Azure, AWS, Google Cloud) but it could still support a wide variety of use cases and I can easily see it would do the job just fine for many customers. Surprisingly, SAP DataSphere seems to become the easiest way to get data out of SAP and onto the cloud platform of your choice so even if you are running your data platform outside of SAP, you still might consider SAP DataSphere as part of your landscape.  

Whatever your opinion is of the current state of SAP DataSphere, it will be interesting to see which of the features SAP promised during previous announcements are now becoming ‘generally available’ and what other carrots SAP will dangle in front of us. In this article I will go through a hand-picked selection of features, existing or announced, and the improvements I hope to see released in the not-too-distant future.

What is SAP DataSphere?

If or when SAP DataSphere delivers on all its promises, it will be the most complete ‘all in one’ data platform money can buy. Features include business- and data modelling, ETL- and real-time replication capabilities, Cloud data platform administration, data protection and data governance and, if used in conjunction with SAP Analytics Cloud (SAC), analytics and planning capabilities as well as a range of advanced analytics features (predictive, scenario planning, natural language processing). Many features across the platform benefit from AI integration, speeding up development time and giving business users a better experience.  

All the features above are already available in SAP DataSphere, but they vary in maturity from ‘embryonic’ to ‘enterprise grade’. Here lies the challenge for customers: They first need to define which features are business critical to them and then find out if SAP is serious about this feature, or if they are just trialling something new which they might drop again in a future release. 

Killer feature #1: Deeply integrated planning capabilities 

SAC is still the only BI application I know of with integrated planning capabilities. This astonishes me, as it is extremely useful. I had expected the competition to have caught up with this by now. Many enterprises still run separate BI and planning applications, with tedious processes to keep datasets in sync between the different applications (read: Excel). I’m not a planning expert but my understanding is that SAC planning is now widely adopted, well received, and feature rich. The new features SAP announced last March for simulation (SAC Compass) will no doubt please the planning/predictive specialist musers. I hope SAP have managed to blur the lines between SAC and SAP DataSphere, so users have a truly seamless experience, making it easy to pull in any datsets from DataSphere into SAC and writing planning versions, generated predictive outputs and simulations to DataSphere.  

Killer feature 2: Change Data Capture (CDC) and Kafka data streaming 

Business users have finally lost patience with the data engineering world and demand data on the data platform in real time. That is regardless of whether there is a business case which would require data in real time. Real time data integration at scale is still costly and complex, and when cost estimates are shared these demands are often dropped.  

SAP DataSphere does support data streaming with Kafka. Kafka is an open standard and widely used for data streaming so it is great to see SAP embracing an open standard instead of trying to push its own standards.  

CDC goes hand-in-hand with data streaming: They are two different concepts, sometimes confused, and each is testimony of clever engineering. The magic happens when you combine the two. This allows you to keep very large and complex data systems in sync in real time. SAP DataSphere provides some great features to capture data changes from any table in S/4HANA – even those which don’t have a timestamp or a sequenced index.  

I have not yet been able to try out CDC and Kafka integration in anger. I do know that SAP is using database triggers to underpin CDC, which does put some concerns in my mind. Using change log replication is much more efficient at handling very large volumes of changes. Having said that, for many use cases, CDC based on database triggers will work just fine. I do hope though SAP will put change log replication on the roadmap.

Killer feature 3: AI Co-pilot – although I am slightly confused 

SAP offers a Natural Query Language (NQL). interface in SAC under the name ‘Just Ask’ It is also rolling out its AI Co-pilot ‘Joule’. This Co-pilot is generally available in SuccessFactors and recenly also in DataSphere and SAC and will soon be rolled out across more SAP products. Why do we need two? My understanding is that Joule does do a lot more than ‘just’ NQL. So maybe Just Ask will be phased out when Joule has matured? 
Apparantely Joule can utilise knowledge graphs created or generated in SAP DataSphere. Knowledge graphs provide better context for AI, so it improves the answers and suggestions (or so I’ve been told). The knowledge graphs look cool in the preview demos. I don’t need a crystal ball to predict that AI & Co-pilot features will be amongst the biggest announcements this Sapphire. I hope SAP comes up with a consistent approach to AI instead of having multiple disconnected point solutions throughout the data platform features.  

Keeping track of the SAP roadmap  

For those lucky enough to join the live event in Orlando it will be a fantastic opportunity to experience some of the DataSphere innovations firsthand. Like many others, I will have to watch from the sidelines and join the virtual Sapphire experience. There is an overwhelming number of sessions on SAC and DataSphere on offer (both virtually and in person) so I hope you can enjoy at least some of them. Once the event gets on its way, I’m sure many more blog posts will follow to share the latest and greatest. I am interested to see what other people have on their wishlist for SAC / SAP DataSphere so plesae leave a comment. Once the event is over, we can look back and see whos wishes have been fulfilled!  

SAP’s cynical move to keep control of your enterprise data (aka note 3255746)

SAP has rocked the boat. They have issued an SAP note (3255746), declaring a popular method for moving data from SAP into analytics platforms out of bounds for customers. Customers and software vendors are concerned. They have to ensure they operate within the terms & conditions of the license agreement with SAP. It seems unfair that SAP unilaterally changes these Ts and Cs after organisations have purchased their product. I will refrain from giving legal advice but my understanding is that SAP notes are not legally binding. I imagine the legal teams will have a field day trying to work this all out. In this article I will explain the context and consequences of this SAP note. I will also get my crystal ball out and try and predict SAPs next move, as well as giving you some further considerations which perhaps help you decide how to move forward.

What exactly have SAP done this time?

SAP first published note 3255746 in 2022. In the note, SAP explained that 3rd parties (customers, product vendors) could use SAP APIs for the Operational Data Provisioning (ODP) framework but these APIs were not supported. The APIs were meant for internal use. As such, SAP reserved the right to change the behaviour and/or remove these APIs altogether. Recently, SAP have updated the note (version 4). Out of the blue, SAP declared it is no longer permitted to use the APIs for ODP. For good measure, SAP threatens to restrict and audit the unpermitted use of this feature. With a history of court cases decided in SAPs favour over license breaches, it is no wonder that customers and software vendors get a bit nervous. So, let’s look at the wider context. What is this ODP framework and what does it actually mean for customers and product vendors?

SAP ODP – making the job of getting data out of SAP somewhat less painful

Getting data out of SAP is never easy, but ODP offered very useful features to take away some of the burden. It enabled external data consumers to subscribe to datasets. Instead of receiving difficult to decipher raw data, these data sets would contain data which was already modelled for analytical consumption. Moreover, the ODP framework supports ‘delta enabled’ data sets, which significantly reduces the volumes of data to refresh on a day-to-day basis. When the ODP framework was released (around 2011(1)), 3rd party data integration platforms were quick to provide a designated SAP ODP connector. Vendors like Informatica, Talend, Theobald and Qlik have had an ODP connector for many years. Recently Azure Data Factory and Matillion released their connector as well. SAP also offered a connection to the ODP framework through the open data protocol OData. This means you can easily build your own interface if the platform of your choice does not have an ODP plug-in.

One can imagine that software vendors are not best pleased with SAP’s decision to no longer permit the use of the ODP framework by 3rd parties. Although all platforms mentioned above have other types of SAP connectors(2), the ODP connector has been the go-to solution for many years. The fact that this solution was not officially supported by SAP has never really scared the software vendors. ODP was and remains to be deeply integrated in SAP’s own technology stack and the chances that SAP will change the architecture in current product versions are next to zero.

Predicting SAP’s next move

You might wonder why SAP is doing this? Well, in recent years, customers have voted with their feet and moved SAP data to more modern, flexible and open data & analytics platforms. There is no lack of competition. AWS, Google, Microsoft, Snowflake and a handful of other contenders all offer cost effective data platforms, with limitless scalability. On these data platforms, you are free to use the data and analytics tools of your choice, or take the data out to wherever you please without additional costs. SAP also has a data & analytics platform but this is well behind the curve. There are two SAP products to consider, SAP Analytics Cloud (SAC) and SAP DataSphere.
The first is a planning and analytics toolset for business users and was introduced in 2015. For a long time, it was largely ignored. In recent years, it has come to maturity and should now be considered a serious contender to PowerBI, Tableau, Qlik and so on. I’m not going to do a full-blown comparison here but the fact that SAC has integrated planning capabilities is a killer feature.
SAP DataSphere is a different story. It is relatively new (introduced as SAP Data Warehouse Cloud in 2020) – and seasoned SAP professionals know what to do with new products: If you’re curious you can do a PoC or innovation project. If not, or you don’t have the time or means for this kind of experimenting, you just sit and wait until the problems are flushed out. SAP DataSphere is likely to suffer from teething problems for a bit longer, and it will take time before it is as feature-rich as the main competitor data platforms. One of the critical features which was missing until very recently was the ability to offload data to cloud storage (S3/Blob/buckets, depending on your cloud provider). That feature was added in Feb 2024. Around the same time as when SAP decided that 3rd parties could no longer use the ODP interface to achieve exactly the same. Coincidence?

So where is SAP going with this? Clearly they want all their customers to embrace SAP DataSphere. SAP charges for storage and compute so of course they try and contain as many workloads and as much data as they can on their platform. This is not different from the other platform providers. What is different is that SAP deliberately puts up barriers to take the data out, where other providers let you take your data wherever you want. SAP’s competitors know they offer a great service at a very competitive price. It seems SAP doesn’t want to compete on price or service, but chooses to put up a legal barrier to keep the customer’s data on their platform.

SAP Certification for 3rd party ETL tools no longer available

Blocking the use of ODP by 3rd party applications is only the beginning. SAP has already announced it will no longer certify 3rd party ETL tools for the SAP platform(3). The out-and-out SAP specialists have invested heavily in creating bolt-on features on the SAP platform to replicate large SAP data sets efficiently, often in near real-time. The likes of Fivetran, SNP Glue and Theobald have all introduced their own innovative (proprietary) code purely for this function. SAP used to certify this code, but has now stopped doing so. Again, the legal position is unclear and perhaps SAP will do a complete u-turn on this, but for now it leaves these vendors wondering what the future will be for their SAP data integration products.

What do you need to do if you use ODP now through a 3rd party application?

My advice is to start with involving your legal team. In my opinion an SAP note is not legally binding like terms & conditions are, but I appreciate my opinion in legal matters doesn’t count for much.
If you are planning to stay on your current product version for the foreseeable future and you have no contract negotiations with SAP coming up then you can carry on as normal. When you are planning to move to a new product version though, or if your contract with SAP is up for renewal, it would be good to familiarise yourself with alternatives.

As I mentioned before, most 3rd party products have multiple ways of connecting to SAP, so it would be good to understand what the impact is if you had to start using a different method.
It also makes sense to stay up-to-date with the SAP DataSphere roadmap. When I put my rose-tinted glasses on, I can see a future where SAP provides an easy way to replicate SAP data to the cloud storage of your choice, in near-real time, in a cost effective way. Most customers wouldn’t mind paying a reasonable price for this. I expect SAP and its customers might have a very different expectation of what that reasonable price is but until the solution is there, there is no point speculating. If you are looking for some inspiration to find the best way forward for you, come talk to Snap Analytics. Getting data out of SAP is our core business and I am sure we can help you find a futureproof, cost effective way for you.


Footnotes and references

(1) – The ODP framework version 1.0 was released around 2011, specifically with SAP NetWeaver 7.0 SPS24,  7.01 SPS 09, 7.02 SPS 08. The current version of ODP is 2.0, which was released in 2014 with SAP Netweaver 7.3 SPS 08, 7.31 SPS 05, 7.4 SPS 02. See notes 1521883 and 1931427 respectively.

(2) – Other types of SAP connections: One of my previous blog posts discusses the various ways of getting data out of SAP in some detail: Need to get data out of SAP and into your cloud data platform? Here are your options

(3) – Further restrictions for partners on providing solutions to get data out of SAP, see this article: Guidance for Partners on certifying their data integration offerings with SAP Solutions

How to Automatically Shut Down an Azure Matillion Instance After a Schedule Finishes

This blog follows on from the How to Automatically Shut Down an AWS Matillion Instance After a Schedule Finishesblog but instead provides the steps relevant for Azure rather than AWS.
I would strongly recommend reading the introduction and The “Death Loop” Issue sections in that blog before proceeding with the below steps. Fortunately, configuring this for Azure is simpler than AWS due to Azure giving instances managed identities by default. Unlike Azure, AWS requires the instance be granted a custom role with a policy to allow it to turn itself off.

Step 1: Installing the Azure CLI

The Azure CLI is a powerful tool for interacting with the Azure Cloud Platform in various ways. Here, we will use a simple CLI command to deallocate an Azure VM.  To begin, you will need to install the Azure CLI on the Matillion VM, which can be done by following this installation guide by Microsoft.

Step 2: Creating a Deallocate Bash Script

Create a file with the below script in the following directory:

 /home/custom_scripts/deallocate_server

by SSHing into the VM and ensure that the centos user owns the file.

sleep 30
az login --identity
az vm deallocate --resource-group <MY_RESOURCE_GROUP> --name <MY_VM_NAME>

The first command sleeps for 30 seconds to ensure that the Matillion schedule has enough time to complete safely before the VM is deallocated. The second command authenticates with the Azure CLI with the VM’s managed identity. The final command executes the VM deallocation using the Azure CLI.

A couple of things to note:

  • If you have a separate production Matillion instance, the above steps will need to be redone on that instance, and the new resource group and VM name will need to be used in the deallocate_server script.
  • The VM’s Enterprise Application in Azure will need at least the ‘Desktop Virtualization Power On Off Contributor’ role on the VM. Usually, the VM will already have sufficient privileges for this.

Step 3: Implementing in Matillion

From here, we will use a Bash Script component to execute the above deallocate_server script. A wrapper job will be needed around your main pipeline where you can attach a Bash Script component to the end of the pipeline (this wrapper job will be the one run by your Matillion schedule). Important: the flow from the main pipeline (in this case e2e_nightly) will need to be unconditional (grey) so that the server is turned off regardless of whether the pipeline was successful. Otherwise, your VM will stay on in the event of a pipeline failure if the Bash Script is only set to execute when the main pipeline is successful (unless you have perfect pipelines… ?).

Within the Bash Script, place the below command which will execute the deallocate_server script that we created on the VM in step 2.

sh /home/custom_scripts/deallocate_server >/tmp/deallocate_server.log &

Crucially, the ampersand symbol (&) at the end of the command enables the command to be executed without waiting for the script to finish. This allows the Bash Script component to immediately flag as completed in the eyes of the Matillion task scheduler, and therefore the schedule will be marked as complete. This avoids the aforementioned “death loop” as there is no dependency on the deallocation commands completing before the Matillion schedule can finish. Additionally, the script exports the output of the deallocate command to a log file for auditing purposes.

Final thoughts

The solution proposed in this blog uses the Azure CLI to deallocate your Matillion VM by simply running a Bash Script component. It should be noted that there are a number of alternative ways to achieve this, such as using message queues to trigger a cloud function to shut down the VM, which is equally valid.

Once you have this deallocation functionality configured, you can rest assured that your Matillion VM will dynamically shut down once your schedule completes. Please feel free to reach out to me on LinkedIn or drop a comment on this blog if you have any further questions.

How to Automatically Shut Down an AWS Matillion Instance After a Schedule Finishes

Matillion customers, in their effort to optimise credit consumption, are eager to reduce unnecessary costs by minimizing the uptime of their instances. One particularly tricky aspect of this optimisation is managing instance shutdown after a routine schedule has completed, be it a successful or failed run. Unfortunately, Matillion doesn’t offer an inherent feature to automatically switch off instances as part of a pipeline. Furthermore, the execution duration of these schedules can vary due to factors like data volumes and the day of the week, making it impractical to implement a fixed-time shutdown. Consequently, a flexible alternative solution is required. The configuration process for enabling this functionality is slightly different between AWS and Azure.

This blog will cover the steps for AWS; the steps for Azure can be found here

The “Death Loop” issue discussed below is relevant to any instance: AWS, Azure or other.

The “Dead Loop” Issue

Before delving into the steps for enabling this functionality, it is crucial to address an issue concerning VM deallocation during a running job. Consider this scenario: your nightly schedule is running, all jobs complete (regardless of success or failure), and you want the last component in your pipeline to deallocate the VM (we’ll cover how to create a deallocate component in the following sections). Matillion will expect the deallocation component to return a success or failure response, like any other component, before it can mark the running task as complete. But, the deallocation component will never be seen to complete by the Matillion task manager due to the server deallocating in that instant. Consequently, when the VM is switched back on, the task scheduler detects the job didn’t fully complete and automatically resumes the job from where it left off, which was at the “Deallocate Server” component. As a result, this will enter the VM into what I like to call a “death loop” where the VM repeatedly switches itself off every time it’s turned on. Breaking this loop is challenging, but this approach avoids this problem by decoupling the deallocation from the scheduled job. The key to the solution is to do the deallocation by calling a bash script for deallocation instead of putting the deallocation command in an embedded bash script in Matillion. Below are the steps to achieve this.

Step 1: Assigning a Role to the Instance

Firstly, an AWS role withs the ability to turn off the instance needs to be created and given to the Matillion EC2 instance.

  1. Create a policy in AWS via IAM by selecting ‘Create policy’ in the Policies page.
  2. Select ‘EC2’ as the Service.
  3. Search for and select the ‘StopInstances’ Action.
  4. We will want to restrict this to only work for the specific Matillion instance so select ‘Add ARNs’. In the pop-up choose the appropriate account radio box and enter the resource’s region and ID.
    We will want to restrict this to only work for the specific Matillion instance so select ‘Add ARNs’. In the pop-up choose the appropriate account radio box and enter the resource’s region and ID.

  5. Feel free to add request conditions such as the requester’s IP address being the Matillion IP. Click ‘Next’.
  6. Provide a Policy name, then create the policy.
  7. Next, we need to create a role to assign the policy to. Select ‘Create role’ in the Roles page.
  8. Select the ‘AWS service’ Trusted entity type and ‘EC2’ as the Use case. Click ‘Next’.
  9. Search for and select the Policy created in the previous steps. In my case, this is ‘EC2StopInstancePolicy’. Click ‘Next’.
  10. Provide a Role name, then create the role.
  11. Lastly, we need to assign the newly created role to the Matillion EC2 instance. Head to the EC2 Dashboard, and then to the Instances page.
  12. Select the Matillion instance, in the top right click ‘Actions’ > ‘Security’ > ‘Modify IAM role’.
  13. Select the Role created in the previous steps and click ‘Update IAM role’.

Step 2: Installing the AWS CLI

The AWS CLI is a powerful tool for interacting with the AWS Cloud Platform in various ways. Here, we will use a simple CLI command to deallocate an EC2 instance. You will need to install the AWS CLI on the Matillion VM, which can be done by following this installation guide.

Step 3: Creating a Deallocate Bash Script

Create a file with the below script in the following directory:

/home/custom_scripts/deallocate_server

by SSHing into the VM and ensure that the centos user owns the file.

sleep 30
aws ec2 stop-instances --instance <Your Instance ID>

The first command sleeps for 30 seconds to ensure that the Matillion schedule has enough time to complete safely before the VM is deallocated. The second command executes the VM deallocation using the AWS CLI. It is worth mentioning that if you have a separate production Matillion instance in a different AWS account, the above steps will need to be redone in that account, and the new instance ID will need to be used in the deallocate_server script.

Step 4: Implementing in Matillion

From here, we will use a Bash Script component to execute the above deallocate_server script. A wrapper job will be needed around your main pipeline where you can attach a Bash Script component to the end of the pipeline (this wrapper job will be the one run by your Matillion schedule). Important: the flow from the main pipeline (in this case e2e_nightly) will need to be unconditional (grey) so that the server is turned off regardless of whether the pipeline was successful. Otherwise, your VM will stay on in the event of a pipeline failure if the Bash Script is only set to execute when the main pipeline is successful (unless you have perfect pipelines… ?).

Within the Bash Script, place the below command which will execute the deallocate_server script that we created on the VM in step 2.

sh /home/custom_scripts/deallocate_server >/tmp/deallocate_server.log &

Crucially, the ampersand symbol (&) at the end of the command enables the command to be executed without waiting for the script to finish. This allows the Bash Script component to immediately flag as completed in the eyes of the Matillion task scheduler, and therefore the schedule will be marked as complete. This avoids the aforementioned “death loop” as there is no dependency on the deallocation commands completing before the Matillion schedule can finish. Additionally, the script exports the output of the deallocate command to a log file for auditing purposes.

Final thoughts

The solution proposed in this blog uses the AWS CLI to deallocate your Matillion VM by simply running a Bash Script component. It should be noted that there are a number of alternative ways to achieve this, such as using message queues to trigger a cloud function to shut down the VM, which is equally valid.

Once you have this deallocation functionality configured, you can rest assured that your Matillion VM will dynamically shut down once your schedule completes. Please feel free to reach out to me on LinkedIn or drop a comment on this blog if you have any further questions.

5 main challenges getting data out of SAP and how to overcome them

One of the most common questions I get from clients is “why is getting data out of SAP so hard? Isn’t it just another source system?”. After a while pondering over this question I thought I would list out the reasons based on our numerous projects getting data from SAP into cloud systems such as Snowflake.  Once I’d started I couldn’t stop. Here is my top 5 outlined below.

1. The data is complex

SAP systems are the nerve centres of global enterprises. Many business critical processes are managed and controlled with SAP systems. Consequently, SAP systems contain the most treasured information large organisations have – namely their financial and operational data. As a result, these systems are both complex in terms of the number of processes that they have but also in the volume of the data. A typical SAP system contains 100,000s tables and there are many complex relationships between the tables.

2. SAP systems and SAP data are heavily governed

As a system with the most crucial and sensitive data it’s only right that there is a lot of governance and processes in place to protect the data and the system itself. That means added time and complexity when trying to get the data out of SAP. This will often involve various SAP teams and stakeholders will need to be involved to ensure the correct access is given to enable you to get your data out of SAP, and operational processes are not jeapordized by the process of ‘getting the data out’.

3. SAP at its core is old technology

SAP is 50 years old this year. When SAP started, memory was precious, to the extent that table names and field names were abbreviated to four and six characters respectively. SAP was originally developed in Germany, so the abbreviations are German abbreviations. Over time, SAP have created several metadata layers which can sometimes help to get more descriptive names in a data model but when you look at a system today, at its core you still find the incomprehensible abbreviations. This is why you need to be a SAP functional expert to understand how to both get to and make sense of the data that you need.

4. Lots of different options and frameworks for ‘getting the data’

SAP systems have a wide variety of SAP specific  object types and is not immediately obvious to those not from an SAP background – extractors, ABAP reports, BADIs, iDocs, ADSOs, CompositeProviders etc.

Lastly, to be able to use things such as the ODP framework to be able to use SAP extractors then you will need to set up various things on the SAP system itself. This will often require the help of SAP basis teams to ensure that you are able to use delta extraction enabling incremental loading which is a must given the data volumes of some key sources of data such as GL line items. For more information on the SAP extraction options then please read this excellent blog from the SAP guru Jan van Ansem here.

5. Complex licensing model, which means options which are technically available may not be used within the license agreement.

Often customers think that the easiest way to extract the data will just be by connecting to the HANA database and replicating the data that they need. Whilst this is a relatively simple process there are licensing constraints that constrain most organisations from doing so. Those that have a HANA Runtime license (which is the majority of clients) are not able to extract from the database layer and can only extract from the application layer. SAP has been known to sue for some extremely large sums of money when their licensing constraints are broken by their clients, as Anheuser Busch found out to their peril.

Hopefully the above gives you a bit of an idea of why loading from SAP to cloud data platforms is not just the same as other source systems and why it’s imperative to have people that are both experts in SAP systems and cloud data platforms and architectures. Luckily at Snap we are a team of SAP data experts with a focus on modern data cloud technologies such as Snowflake and Matillion and have a range of accelerators to simplify the process of extracting your data from SAP systems. Please do reach out to us if you’re interested in maximising the value of your SAP data in the cloud.

Photo credit: Mitchell Luo on Unsplash

Lessons Learned from an Sustainability ESG Reporting Project

What is ESG and EPR?

Organizations in the UK involved in packaging supply or importation must now adhere to the ‘Extended Producer Responsibility’ regulation (EPR). This regulation carries significant weight as they are legally binding and not complying could cause serious brand damage.

The processes related to managing packaging fall within the realm of ‘Environmental and Social Governance’ (ESG). Having recently completed an ESG reporting project for a global food manufacturer, I thought it useful to share some lessons learned here.

Companies are expected to provide evidence regarding the recyclable and non-recyclable components of their packaging. This requirement has been in effect since 2023, and achieving automation in this process necessitates a verified enterprise data set and a suitable platform for generating the required outputs.

Managers responsible for this task may find it daunting. Manual execution of the work is excessively time-consuming, labour-intensive, and susceptible to errors.

The data is complex and requires subject matter experts throughout the project.

The data required is complex. It entails product master data and bills of materials (BOMs) typically stored in SAP ERP systems. Additional data may be necessary from other packaging specification databases. Multiple versions of BOMs may exist, and packaging specifications reside within the system, incorporating various fields related to weight and dimensions. Some packaging items are composite in nature, consisting of both plastic and cardboard, requiring separation in calculations. Addressing all these factors requires careful consideration and understanding in collaboration with business subject matter experts (SMEs) and data owners.

Requirements will change. Be prepared to adapt.

The reporting output requirements are still unclear and evolving. 2023 marks the inaugural year for the formal collection of EPR data and reporting, but the precise details of what data should be reported and how are yet to be finalized. However, certain agreed classifications include:

  • – Packaging activity: How the packaging is supplied.
  • – Packaging type: Whether the packaging is household or non-household.
  • – Packaging class: Whether the packaging is primary, secondary, shipment, or tertiary.
  • – Packaging material and weight.

This calls for a solution that can swiftly adapt to changes, likely necessitating a platform separated from the strict internal SAP change control process.

Traditional methods of piecing together reports are inadequate. Attempting to manipulate sales volume data at the BOM component level for all products across multiple sites using MS Excel often results in unwieldy and unmanageable files.   The ideal solution for this scenario involves leveraging a cloud-based Data Warehouse equipped with robust capabilities to handle substantial data volumes. It should be accompanied by an efficient ETL (Extract, Transform, Load) tool capable of seamlessly extracting and loading data from SAP and other databases. Additionally, a versatile toolkit enabling flexible manipulation of the data into reusable assets is crucial. To effectively present the data in diverse report formats, a data visualization tool would be essential.

Check the quality of your data.

Jumping directly to final report outputs will lead to frustration. It is crucial to comprehend the state of your data beforehand. Master data may be incomplete or inconsistent. Begin by creating master data reports that allow for comprehensive data analysis and filtering. Generate exception reports highlighting products with missing weight data, for instance. Correct any underlying data issues and then proceed to generate the required reports with specific calculations in a second phase.

Choose the appropriate technology stack.

At Snap, we possess extensive experience in extracting data from SAP environments  and combining SAP data with other sources into cloud data warehouses such as AWS Redshift, Snowflake, and Google BigQuery. We use the best of breed cloud data platform tools for managing the data warehouse processes in an effective way. We use modern BI platforms to provide actionable insights in the ESG context, using Matillion. We also have expertise in data visualization across various tools.

Find a partner with prior experience.

Collaborating with a team that has gone through this process before will expedite your work and minimize risks. ESG responsibilities encompass more than just EPR and often require similar data transformation projects.

I hope this article helps you with your ESG reporting. If you like to discuss your requirements and how to create a flexible reporting and analytics platform for ESG then please contact us at Snap Analytics.

Be a Data Hero and deliver Net Zero!

The biggest problem in the WORLD!

It is clear that we need radical changes to save our planet. Governments, the private sector and individuals aspire to achieve ‘Net Zero’ – but radically changing the way we operate is not going to be easy.

Achieving this goal is going to be a huge challenge for big, complex organisations.  There are so many areas to explore, from reducing travel and fossil fuel consumption, leveraging renewable energy, improving efficiency of existing equipment, or simple behavior change.  With so much complexity the task can be daunting. 

Can data save us?…

Starting with data can help you to understand where the quickest and biggest wins are.  This helps you to understand what to focus on first.  As Peter Drucker once famously said “You can’t manage what you don’t measure”.

To create a link between desired outcomes and measurable targets you can use a ‘Data Value Map’. Whilst I love technology and data…it’s only useful when it drives actions and creates positive change.  The Data Value Map helps to visualise how data can help you to achieve your goals.  If your goal is Net Zero…it could look something like this:

Data Value Maps can be achieved using a mind mapping or collaboration tool (I like Mindmeister and Miro) and are best done as a highly collaborative team workshop…don’t forget to bring the coffee and cakes!

Now you have a clear view what data is required to measure and act (your “use cases”) to deliver the Net Zero goal.  Next you can score these in terms of Value and Complexity.  Something like a prioritisation matrix can help:

By focusing in on the ‘high priority’ and ‘low complexity’ use cases you can deliver quick wins to the business.  This will help you to demonstrate you are a true Data Hero and can help your organisation to fly!

Once you have prioritised your use cases, you can start to map out the underpinning systems and processes that are needed to deliver connected, structured data to drive your Net Zero goals. 

Delivering at lightning speed…

There are numerous technologies out there that can help you connect all of this data, but we love Matillion for being able to easily and quickly connect to almost any source and transform and join data to make it useful.  As a data platform Snowflake is fantastic for virtually unlimited storage, blistering speed, data warehousing and data science capabilities.  These technologies will certainly enable you to hone your capabilities as a true Data Hero!! There are also many other fantastic cloud solutions that can help you to supercharge your Net Zero data capabilities.

Join the Data League!

Snap Analytics’ team of Data Heroes are helping one of the UK’s largest food manufacturers to leverage data to drive positive change…but if we’re going to solve humanity’s greatest threat…it’s going to take a whole Justice League of Data Heroes.  So join us on this mission to save the planet, and lets all make sure the decision makers in our organisations have the data they need to drive positive change.  Don’t delay…be a Data Hero today!

We believe that businesses have a responsibility to look after our earth…it’s the only one we have!  We will give any organisation a 15% discount on our standard rates for any work directly linked to making a positive change to the environment!