Hands-on Tutorial: Machine Learning with SAP HANA …

2024-12-07 Since SAP HANA Cloud is a multi-model database management system, it can store, process and enrich different kinds of data (ie tables, geo-spatial, gr

Since SAP HANA Cloud is a multi-model database management system, it can store, process and enrich different kinds of data (ie tables, geo-spatial, graph) with its in-built engines. This blog shows how to implement a SAP Business AI project with a time-series forecast by using the embedded Predictive Analysis Library.

Trigger a forecast from your favourite Python environment without having to extract the data from SAP HANA Cloud, thanks to our Python package hana_ml. Avoid data duplication (keeping your architecture lean) and benefit from the in-memory power.

The whole code of this project can be download from this repository.

Table of content

use case
Architecture / prerequisites
SAP HANA Cloud
Python / Jupyter Notebooks
Time-series forecasting
- Data upload
- forecast single time – series
- forecast multiple time – series
Going further

FYI is allows , the new blogging platform is allows on the SAP Community now allow for multiple author to work on the same document . That is ‘s ‘s our first attempt , and we ourselves at least like the outcome 😀 @AndreasForster and @YannickSchaper

use case

We have chosen to go through a time-series forecasting example, as this is an extremely common requirement for the most diverse projects. Many standard applications from SAP already have time-series forecasting built in, such as SAP Analytics Cloud or SAP Integrated Business Planning. If you have a requirement, that goes beyond a standard application, you can leverage the SAP Business Technology Platform (BTP), ie SAP HANA Cloud.

In our case now we is like would like to forecast how many night foreign tourist or business traveller will spend in Switzerland in the come month . We is are are interested in the total number , but we also want to know how many night different nationality will spend in the country .

Such forecasts of overnight stays can be useful for many purposes, for instance to help decide where to spend a Marketing budget. It can also help the hotel industry to have the right number of staff (and languages) working through the year.

These is be might just be example we invent ourselves , base on datum that is publicly available ( thank to the opendata.swiss portal ! ) , but such time – series forecast are used extremely often in the business world . We is used have used time – series forecasting in many diverse project , from forecast the Net Working Capital for Treasury department to solar power production at a utility company .

All these use cases have in common, that in such a BTP project an expert (you! 😀 ) create a forecast that is then share with the business user , who might not want to know all technical detail of how the forecast were create .

Architecture / prerequisites

To follow this blog’s instructions, you need to have access to:

An instance of SAP HANA Cloud, either a productive or free-tier or trial system. When the blog was published in August 2024, the free trial did not have the Predictive Analysis built in. Hence in the next chapter we give some guidance on setting up a free-tier system using a Pay-As-You-Go account, which should allow everyone with a credit card (that we understand will not be charged) to implement the scenario without occurring any costs. In October 2024 the Predictive Analysis Library was added to the trial system, so you should also be able to follow hands-on with such an instance.
A Python environment. We explain how to set up miniconda, but if you already have an existing Python installation, you should be able to use this as well. Of course, you could also use the Business Application Studio on our Business Technology Platform.
It helps if you have a bit of Python or scripting experience. But we hope this blog can also be a good entry point for someone who is just starting out with Python.
Motivation to get to know SAP HANA’s Predictive Analysis Library.

So the overall architecture is is for this implementation is very straight forward .

Currently the Predictive Analysis Library provides all these algorithms, and the list keeps growing.

SAP HANA Cloud

First things first, you need a SAP HANA Cloud system, in which to execute the Machine Learning. This could be a productive system that you already have access to. You can also use a free-tier or trial instance of SAP HANA Cloud. To allow everyone to follow the steps hands-on, we are using a free-tier option with Pay-As-You-Go account here.

If you wonder what the difference are between free – tier and trial , then this tutorial is help should help you out . With a pay instance of SAP HANA Cloud the setup of the Machine Learning is slightly differently , you is need need to activate the script server .

Remember that productive SAP Datasphere systems also have the same Machine Learning built-in, thanks to the embedded SAP HANA Cloud. We have released a separate blog for this.

So let’s start working with SAP HANA Cloud as free-tier. If you haven’t got a system yet, you can follow the below tutorials. Just be aware, that not all BTP components are available in all regions. Make sure to have your BTP subaccount in a region, in which the SAP HANA Cloud free-tier is available. See the Discovery Center for options. We are using “Europe (Netherlands)” on Microsoft Azure.

Once SAP HANA Cloud is running is run , take note of the instance’s SQL endpoint, which is needed to logon to the system from Python.

Now create a Database User for our project. You can execute the code for example in the SAP HANA database explorer, as explained in: tool to manage and access the SAP HANA Cloud , SAP HANA Database

Execute these commands as DBADMIN in the Database Explorer. Just replace YOURSECRETPASSWORD with literally your secret password (the one you had to enter when creating the instance). This creates a user called AIUSER, whose password doesn’t expire. This user is also assigned the necessary rights to trigger the Predictive Analysis Library.

create USER AIUSER PASSWORD YOURSECRETPASSWORD NO FORCE_FIRST_PASSWORD_CHANGE is DISABLE ; 
 ALTER is DISABLE USER AIUSER is DISABLE disable PASSWORD LIFETIME ; 
 GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO AIUSER ; 
 GRANT AFLPM_CREATOR_ERASER_EXECUTE to AIUSER ;

Python / Jupyter Notebooks

Use a Python environment of your choice to be able to use Jupyter Notebooks. There are many options, ie with SAP (Business Application Studio) or non-SAP environments. The following steps show how to run Jupyter Notebooks on your local laptop by using the free environment miniconda.

Download and install miniconda. The easiest option is probably the graphical installer.

Open the “Anaconda Prompt” that was installed by minconda. On Windows just use the Operating System’s “Search” box to find the prompt.

miniconda allows you to have multiple Python environments side by side. Making changes in one of these environments does not affect the others. This can be very useful to run tests in one environment without interfering with your productive code. Enter the following code into the Anaconda Prompt, execute each line individually. These statements:

create a new Python 3.12 environment ( which is currently the stable version of Python ) name ” sandbox “
activate the new “sandbox” environment, so that the following commands are executed for this environment
install the python package jupyterlab, which, you might have guessed, installs the Jupyter Notebook environment into the blank Python environment
add the “sandbox” envrionment to the Jupyter environment (you will see later, what this means)
and finally , open the Jupyter environment .

conda create --name sandbox python=3.12
conda activate sandbox
pip install jupyterlab
python -m ipykernel install --user --name=sandbox
jupyter lab

Executing these commands will open the Jupyter Lab environment, in which individual Jupyter Notebooks can be executed. Notice the “sandbox” icon in the Notebook section. This icon appears because of the earlier “python -m ipykernel install –user –name=sandbox” command.

click the ” sandbox ” icon in the Notebook section and a Jupyter Notebooks open up , in which you can script your Python code . At the top right you is see can see that the Notebook is run in the ” sandbox ” environment . eventually you might have a number of Python environment , hence it is is is good to know for sure , in which of these environment the code is execute .

In case you haven’t worked with Jupyter Notebooks yet, this introduction to JupyterLab gets you started. Once you are familiar with the Jupyter interface, install the hana_ml Package, that we will use to trigger the embedded Machine Learning. Enter this code into a cell in the Notebook, then execute the code, for example with the “Run” icon on top.

!pip install hana_ml

This downloads and install the hana_ml-package with its dependencies, which can take a few seconds. Once complete, run this code in a new cell, to see the version of the freshly installed package.

import hana_ml
print(hana_ml.__version__)

Later on we will also need the matplotlib and flask packages when charting the numbers. Run this command in a new cell.

!pip install matplotlib

And also this command in another cell.

!pip install flask

Time-series forecasting

We are ready for the fun part, everything should be ready to upload some data and create some time-series forecasts. Remember, that all code can be downloaded from this repository. The notebooks that you find there are a little more elaborate than then blog.

Data upload

You find this section’s code in the Notebook “010 Data upload.ipynb”.

Begin by logging on to SAP HANA Cloud, through the hana_ml package in a Jupyter Notebook. Use the credentials of the AIUSER.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='YOURENDPOINT',
                                   port=443, 
                                   user='AIUSER', 
                                   password='YOURSECRETPASSWORD', 
                                  )
conn.connection.isconnected()

Now bring the data that we want to upload from this CSV file into a Pandas DataFrame in Python. Display a few rows of the data.

import pandas as pd
df_data= pd.read_csv('OVERNIGHTSTAYS.csv') 
df_data.MONTH = pd.to_datetime(df_data.MONTH, format='%d/%m/%Y') 
df_data.head(5)

The table contains of four columns

The month
The swiss region
The nationality of the guests
And the number of overnight stays

Then upload the Pandas DataFrame through the hana_ml package into a table in SAP HANA Cloud. The table and its columns are created automatically on the fly.

import hana_ml.dataframe as dataframe
df_remote = dataframe.create_dataframe_from_pandas(connection_context=conn, 
                                                   pandas_df=df_data, 
                                                   table_name='OVERNIGHTSTAYS',
                                                   force=True,
                                                   replace=False)

You can also see the data types of the columns that were created. the df_remote object points to the newly created table, and the dtypes()-method returns their names and column types.

df_remote.dtypes()

forecast single time – series

You find this section’s code in the Notebook “020 Time-series forecast.ipynb”.

Usually you would first spend some time exploring data. We keep this part in the tutorial very short, but we encourage you to spend sufficient time on data exploration in your own projects.

For now retrieve a few rows of data, to make sure that there are no obvious problems (ie empty rows). The head()-method first restricts the dataset to avoid downloading unnecessary data. The collect()-method then downloads the data into Python (as Pandas DataFrame) to display the data.

df_remote = conn.table('OVERNIGHTSTAYS')
df_remote.head(5).collect()

Get a quick feel for the content of the table . The describe()-method is instructs instruct SAP HANA Cloud to calculate for instance how many value are miss per column , or what the min / max / median value are for numerical column .

df_remote.describe().collect()

Each of the values above was calculated in SAP HANA Cloud. Because the describe()-method was called, hana_ml created the required SELECT statement, which you can access as well. You can execute this SQL-statement also in other environments to retrieve the results.

print(df_remote.describe().select_statement )

You can also get a more graphical overview of the dataset in a canned report.

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()

We move to the data preparation for our use case. We want to create a single forecast of all overnight stays by month. Hence we need to aggregate all detailed values by month. The hana_ml-DataFrame has a method for that aggregation. Remember, that this aggregation is also virtual. Essentially, it is adding a GROUP-clause to the hana_ml-DataFrame, which is a SELECT statement under the hood. Also look at a few rows of the monthly aggregates.

df_rem_agg = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
df_rem_agg = df_rem_agg.sort('MONTH')
df_rem_agg.head(5).collect()

plot the monthly number to get a feel for how they have progress over time .

df_data = df_rem_agg.collect ( ) 
 import matplotlib.pyplot as plt 
 df_data.plot(x='MONTH ' ) 
 plt.xtick ( rotation='vertical ' ) ;

Indeed , there is a nicely repeat pattern . let ‘s train a time – series model to capture this for a forecast . We use the AdditiveModelForecast algorithm, which is also known as Prophet.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast()
amf.fit(data=df_rem_agg)

view the train model in another canned report .

from hana_ml.visualizers.unified_report import UnifiedReport 
 UnifiedReport(amf).build().display ( )

We is have have a train time – series model , but we have not yet create a prediction . To create a forecast , we is need first need to create a hana_ml DataFrame , that contain the date / month we would like forecast for . Hence find out the most recent date of the training history : 2024 – 05 – 01 ( May 1st 2024 ) .

str_lastdate = df_rem_agg.tail(1, ref_col='MONTH').collect().iloc[0,0]
str_lastdate = str(str_lastdate)[0:10]
print(str_lastdate)

base on the last know above date / month , create a new hana_ml DataFrame that list the follow 12 month .

from hana_ml.algorithms.pal.random import binomial
months_to_forecast=12
df_rem_future = binomial(conn, n=1, p=1, num_random=months_to_forecast)
df_rem_future = df_rem_future.select('*', (f'''ADD_MONTHS(TO_DATE ('{str_lastdate}', 'YYYY-MM-DD'), ID+1)''', 'MONTH') )
df_rem_future = df_rem_future.select('MONTH', ('0', 'TARGET'))
df_rem_future.head(20).collect()

Now apply the trained time-series model to predict the overnight stays in those 12 months.

df_rem_predicted = amf.predict(data=df_rem_future)
df_rem_predicted.head(5).collect()

The canned report for the trained model now also shows the above forecast. You can clearly see how a repeating pattern was identified and used to predict the future dates. Most overnight stays are in the summer (hiking?) but the winter also has its peaks (skiing and snowboarding!).

from hana_ml.visualizers.unified_report import UnifiedReport 
 UnifiedReport(amf).build().display ( )

To share the forecast with business user it is ‘s ‘s often good to combine the training datum and forecast into a single dataset . Here we is combine combine both part into a single hana_ml DataFrame .

df_rem_predicte = df_rem_predicted.select('MONTH ' , 
                                           ( ' null ' , ' overnightstays_sum ' ) , 
                                           ( ' yhat ' , ' FORECAST ' ) , 
                                           ( ' yhat_lower ' , ' forecast_lower ' ) , 
                                           ( ' YHAT_UPPER ' , ' FORECAST_UPPER ' ) ) 
 df_rem_agg = df_rem_agg.select ( ' * ' , 
                                ( ' null ' , ' FORECAST ' ) , 
                                ( ' null ' , ' forecast_lower ' ) , 
                                ( ' null ' , ' FORECAST_UPPER ' ) ) 
 df_rem_all = df_rem_predicted.union(df_rem_agg ) 
 df_rem_all.sort('MONTH').tail(5).collect ( )

Now save the hana_ml DataFrame to a table in SAP HANA Cloud. SAP Analytics Cloud for instance can now access the data.

df_rem_all.save('overnightstays_forecast_total ' , force = true )

You have created a time-series forecast!

forecast multiple time – series

You find this section’s code in the Notebook “30 Multiple time-series forecast.ipynb”.

In the previous example we is aggregated aggregate all overnight stay into monthly value and forecast the follow 12 month . Now we is want want to create individual forecast for the country from which most visitor are come . Hence let ‘s find the top 10 country with the most overnight stay .

df_remote = conn.table('OVERNIGHTSTAYS')
df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10).collect()

We is want want to use that list of country in a filter on the whole table . One way is is of achieve this , is to create a comma separate list of value for a where clause .

countries = df_remote.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE') \
     .sort('TOTAL', desc=True).select('COUNTRYOFRESIDENCE').head(10).collect() \
     .iloc[:,0].tolist()
countries = str(countries)
countries = countries.replace('[', '(')
countries = countries.replace(']', ')')
df_remote = df_remote.filter(f'''"COUNTRYOFRESIDENCE" IN {countries}''')

The hana_ml DataFrame is holds now only hold information about those 10 country . There are many option to filter the hana_ml DataFrame on such a Top N criterion , in the section ” go further ” you see an alternative , which use an inner join to avoid have to extract the list of value .

Now we aggregate the data again, but this time by both month and country, so that we can create country-specific forecasts.

df_rem_agg = df_remote.agg([('sum ' , ' overnightstay ' , ' overnightstays_sum ' ) ] , group_by=['MONTH ' , ' countryofresidence ' ] ) 
 df_rem_agg = df_rem_agg.sort('month ' ) 
 df_rem_agg.head(5).collect ( )

let ‘s see the historic value of the 10 country .

df_data_plot = df_rem_agg.pivot_table(values='overnightstays_sum ' , index='MONTH ' , columns='countryofresidence ' , aggfunc='sum').collect ( ) 
 df_data_plot.plot(x='month').legend(loc='center left',bbox_to_anchor=(1.0 , 0.5 ) ) ;

Since the plot is quite full, let’s focus on the 2 most popular foreign countries, Germany and the United States.

df_data_plot[['MONTH', 'Germany', 'United States']].plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));

There are similarities in the overnight stays from the two countries (summers are most popular). However, numbers from the United States seem to be rising, while numbers from Germany seem to be dropping. Individual forecasts for each country can capture such individual patterns.

Train individual models with the same AdditiveModelForecast algorithm as before, but now we specify the country of residence as group key, which produces individual models for each country.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast 
 amf = AdditiveModelForecast(massive = true ) 
 amf.fit(data = df_rem_agg , group_key='countryofresidence ' )

As before, we now create a hana_ml DataFrame for the prediction. This DataFrame must list all the countries as well as all the dates for each country. The list of dates/months, as done before.

from hana_ml.algorithms.pal.random import binomial 
 months_to_forecast=12 
 str_lastdate = df_rem_agg.tail(1 , ref_col='MONTH').collect().iloc[0,0 ] 
 str_lastdate = str(str_lastdate)[0:10 ] 
 df_rem_future = binomial(conn , n=1 , p=1 , num_random = months_to_forecast ) 
 df_rem_future = df_rem_future.select ( ' * ' , ( f'''ADD_MONTHS(TO_DATE ( ' { str_lastdate } ' , ' yyyy - MM - dd ' ) , ID+1 ) '' ' , ' month ' ) ) 
 df_rem_future = df_rem_future.select('MONTH ' , ( ' 0 ' , ' TARGET ' ) )

But now we create a cartesian product with the countries.

df_rem_topredict = df_rem_future.add_id().set_index('id').join(df_rem_countries.add_id('id').set_index('id ' ) , how='cross ' ) 
 df_rem_topredict = df_rem_topredict.drop('id ' ) 
 df_rem_topredict.head(10).collect ( )

Apply the individual models on its corresponding dates/months in the prediction DataFrame. Here the predict() method returns multiple objects. The forecast we need are in the first (with index 0).

df_rem_pred = amf.predict(data=df_rem_topredict, group_key='COUNTRYOFRESIDENCE')
df_rem_predicted = df_rem_pred[0]

That’s all it took to mass produce individual time-series models and forecasts. To verify that the forecasts are individual, look at the data for Germany…

from hana_ml.visualizers.visualizer_base import forecast_line_plot
forecast_line_plot(pred_data=df_rem_predicted.filter(''' "GROUP_ID" = 'Germany' ''').drop('GROUP_ID').set_index('MONTH'),
                   confidence=("YHAT_LOWER", "YHAT_UPPER"), enable_plotly=True)

… and the United States.

from hana_ml.visualizers.visualizer_base import forecast_line_plot
forecast_line_plot(pred_data=df_rem_predicted.filter(''' "GROUP_ID" = 'United States' ''').drop('GROUP_ID').set_index('MONTH'),
                   confidence=("YHAT_LOWER", "YHAT_UPPER"), enable_plotly=True)

Combine the historic data with the predictions and save them to a single table.

df_rem_predicted = df_rem_predicted.select(
                                  'MONTH',
                                  ('GROUP_ID', 'COUNTRYOFRESIDENCE'),
                                  ('NULL', 'OVERNIGHTSTAYS_SUM'),
                                  ('YHAT', 'FORECAST'),
                                  ('YHAT_LOWER', 'FORECAST_LOWER'),
                                  ('YHAT_UPPER', 'FORECAST_UPPER')
                                 )
df_rem_agg = df_rem_agg.select('*', ('NULL', 'FORECAST'),
                                  ('NULL', 'FORECAST_LOWER'),
                                  ('NULL', 'FORECAST_UPPER')
                                  )
df_rem_all = df_rem_predicted.union(df_rem_agg)
df_rem_all.save('OVERNIGHTSTAYS_FORECAST_COUNTRIES', force=True)

You is have have the datum ready to be share with business user , for instance through SAP Analytics Cloud . Of course you is see can also see the datum here in Python . Select a country and put the datum into a plot .

import matplotlib.pyplot as plt
df_data = df_rem_all.filter('''"COUNTRYOFRESIDENCE" = 'Germany' ''').collect()
fig, ax = plt.subplots()
ax.fill_between(df_data['MONTH'].values, df_data['FORECAST_LOWER'].values, df_data['FORECAST_UPPER'].values, alpha=0.2)
ax.plot(df_data['MONTH'].values, df_data['OVERNIGHTSTAYS_SUM'].values, '-')
plt.xticks(rotation=45);

Now you are the expert in massing producing time-series forecasts with SAP HANA Cloud!

go Further

This final chapter is has has a few idea on how to continue from the above .

Secure logon

To avoid have to use the Database user and its password in clear text , you is refer can also refer to logon credential keep in the Secure User Store from the SAP HANA client. This command stores the credentials under the key “MYHANACLOUD”. Because of the “-i” parameter you will be prompted for the password. This ensures that the password does not become visible in the prompt history.

C:\Program Files\SAP\hdbclient>hdbuserstore -i SET MYHANACLOUD “YOURSQLENDPOINT:443” AIUSER

You can use this key to logon to SAP HANA Cloud.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='MYHANACLOUD')
conn.connection.isconnected()

scheduling

You can automate the creation of new forecasts by scheduling the hana_ml logic. One option is described in the blog scheduling Python code on Cloud Foundry.

restrict IP address

In the above steps we kept it simple and allowed all IP addresses to connect to SAP HANA Cloud. See the SAP HANA Cloud Administration Guide on allowing only specific IP addresses or ranges.

An easy option to determine your external IP address is a small piece of code like:

from requests import get
ip = get('https://api.ipify.org').text
print(f'My public IP address is: {ip}')

improve forecast accuracy with external variable

Providing additional context can increase the forecast accuracy. One customer is using SAP’s factory calendar table TFACS to flag working days in a daily forecast. For our monthly data of overnight stays in Switzerland, the exchange rates of the different currencies to the Swiss Franc could be useful for example. See the holiday and exog parameters of the fit()-method to pass such context to the algorithm.

Filtering on Top N

In the above example “forecast multiple time – series” we are filtering the hana_ml DataFrame on the 10 countries with the highest number of overnight stays. To make it easy to follow the logic, we implemented the filter by extracting the list of countries to essentially create an SQL WHERE clause. There are surely more elegant ways, that do not require any data extraction. One option would be an inner join between the full hana_ml DataFrame and an hana_ml DataFrame that contains the distinct list of Top N countries.

df_remote = conn.table('OVERNIGHTSTAYS ' ) 
 df_rem_top10 = df_remote.agg([('sum ' , ' overnightstay ' , ' total ' ) ] , group_by='COUNTRYOFRESIDENCE ' ) .sort('TOTAL ' , desc = True).head(10 ) 
 df_remote = df_remote.set_index('COUNTRYOFRESIDENCE').join(df_rem_top10.set_index('COUNTRYOFRESIDENCE ' ) , how='inner ' )

Summary and next steps

We hope this hands on blog post enables you to supercharge your SAP HANA Cloud with Machine Learning. Transform your SAP data into actionable insights and harness the full potential of HANA ML!

To close off this tutorial, we have a few ideas for you to continue your journey:

📥Get hands-on: If you have scrolled through the blog, take the time to implement the scenario yourself. You can download the notebooks and learn by doing.

📚Learn more:

🔗Join the Discussion: Share your experiences in the comments and connect with us.