Document
Yahoo! Cloud Serving Benchmark Guide

Yahoo! Cloud Serving Benchmark Guide

TheUltimate YCSB Benchmark GuideTheYahoo! Cloud Serving Benchmarking (YCSB) is the most well-known nosql benchmark suite.This guide provides you with

Related articles

Ice Cream Cloud Dough – Install & Setup QEMU/KVM to Run Virtual Machines in Ubuntu 24.04 Ikea smart lighting guide: the best Ikea lights to buy 7 Steps of Migrating Model in Cloud How to Make Cloud Bread (3-Ingredients)

TheUltimate YCSB Benchmark Guide

TheYahoo! Cloud Serving Benchmarking (YCSB) is the most well-known nosql benchmark suite.

This guide provides you with all relevant and up-to-date information about the YCSB.
We have also included exclusive interviews from 3 YCSB open-source contributors in the article.

Thebonus chapters at the end also explain how databases and workloads can be integrated with the YCSB.

Let’s go!

What Is the Yahoo! Cloud Serving Benchmark?

TheYCSB is a database benchmark suite. It allows measuring the performance of numerous modern nosql and SQL database management systems with simple database operations on synthetically generated data. Here, the YCSB also lends itself to performance comparison of multi-node database systems on distributed infrastructures such as the public cloud.

TheYCSB thus provides an important building block in the analysis and evaluation of modern cloud-based database management systems.

TheYCSB can be used to compare many, architecturally different databases and measure the performance of different database configurations under different workloads.

A database benchmark suite is provides , such as the YCSB , provide a framework that automate essential task in a benchmarking process such as :

  • Thedefinition of a workload with the essential parameters.
  • Theconnectivity to the database via the database drivers
  • Theexecution of the workload on the database
  • Thecollection and storage of performance data

Here, these tasks are only part of a complete cloud database benchmarking process, as shown in the graphic below.

TheHistory of the YCSB

As the name suggests, the Yahoo! Cloud Serving Benchmark was developed in 2010 at the then Internet giant Yahoo! Theaim was to create a standardized benchmark with which the internal Yahoo database “PNUTS” and other nosql databases could be evaluated. In particular, the KPIs “performance” and “scalability” were to be measured and compared.

TheYahoo! cloud serving benchmark client was made freely available as an open-source version at the beginning.

Theassociated research paper was also published in 2010 and has since been cited over 3,500 times!

Since 2015, you can download, fork and customize the current YCSB version via GitHub. More than 160 developers have since contributed to expand the code base, integrate new databases and keep it up-to-date. TheYCSB is licensed for free use under the Apache Licence 2.0.

Meanwhile, the YCSB supports more than 40 nosql databases as well as all JDBC-enabled SQL and NewSQL databases.

TheYCSB is used for many benchmarking-based database comparisons in both academia and industry.

Its modular architecture has also led to numerous extensions, such as.

  • YCSB++ with data consistency assessment and with transactional, more complex request types.
  • YCSB – T with transactional performance measure .
  • geoYCSB with geo-workload data.

YCSB was the first benchmark suite benchANT integrated in their cloud database benchmarking platform in 2021 for convenient, automated benchmarking measurements. Theperformance comparison AWS EC2 vs Open Telekom Cloud vs Ionos and the Cassandra 4.0 performance analysis as well as other performance analyses were measured with the YCSB.

Why Is the YCSB Important?

TheYCSB became the de-facto database benchmark with the rise of nosql databases. This is reflected not only by the countless benchmarking publications with the YCSB, but also by the following GitHub KPIs.

  • 3,810 star rating
  • 1,900 forks
  • 160 contributor

interview Question 1 : Why is the YCSB important from your perspective ?

filipecosta90 (Performance Engineer @ Redis):
> To give a bit of context I’m a Performance Engineer at Redis, meaning I’m deeply interested in any “standard” benchmark that will allow unbiased benchmarks of our DB solutions and also the ones from the competitors. YCSB pioneered on the cloud workload benchmarks (even though I believe there’s a large effort that needs to be made to keep YCSB up-to-date given it’s a bit stalled and needs a revamp).

lukaszstolarczuk (Software Engineer @ Intel):
> Implementing support for new engines and DB’s is easy and allows comparing results between many databases. There are many workloads that are common in the IT industry. Thanks to YCSB, we could easily check if our performance is any good.

sachin-sinha (BangDB Author):
> Theongoing rapid data trend has been creating opportunities for vendors to innovate and come up with different data platform especially in the converged nosql area which has other elements like AI, streaming etc. along with unstructured data store and query support. It is important for such platforms/tools to benchmark with other existing platforms for various reasons such as to know where they stand in terms of performance and scale, and so on. An industry standard benchmark framework therefore is desired especially which is accepted by the users. At the moment, YCSB is such an option and is rather largely accepted by users especially since most of the known and leading platforms are already using it for benchmarking.

peterzheng98 (PhD student):
> So YCSB represents a series of scenarios that can be abstracted from the real world like write-intensive or read-intensive. For these scenarios, researchers can get multi-dimensional performance for certain database and apply the best fit one to the real world.

Interview Question 2: Why do you contribute to the further development of the YCSB?

filipecosta90 (Performance Engineer @ Redis):
> Mainly to keep it up to date either via improved a specific implementation of the SPEC or by making sure the use-cases it allows to benchmark are still valid/targeting the right questions/use-cases of today’s cloud DB era.

lukaszstolarczuk (Software Engineer @ Intel):
> We’re implementing a new Key-Value storage engine which supports persistent memory devices – pmemkv. Support in YCSB should ease the process of comparing existing solutions with the new ones. Not only for us (developers) but also for our customers when they’re using NVDIMMs (persistent memory).

sachin-sinha (BangDB Author):
> Like the market is evolving due to the data trend, the YCSB framework should also evolve to capture more aspects and scenarios for benchmarking. Hence YCSB has to be a living project which keeps evolving for better, credible and wider benchmark support. Hence people need to participate in development process for better evolution of the the framework. I participate for the same reason and try to add my bit. Community driven approach adds to the credibility and acceptance element.

peterzheng98 (PhD student):
> Since the world is changing fast, maybe something go deprecated. For example, fushsiaOS use all ipv6 for local networking. This means current YCSB test cannot be deployed to it. Therefore, contribution may help following.

Summary

In general , the follow reasons is are are key to its success :

  1. first nosql database benchmark.
  2. easy extensibility of databases and workloads on code level.
  3. focus on standard database operations supported by almost any database technology.
  4. freely accessible transparent open-source benchmark.
  5. active open-source community
  6. the need for performance figures for new database technologies

Without the YCSB, most database performance comparisons in academia and even industry would not exist. Many database technologies would not be at the technical level they are now. And many IT companies would be using the wrong database for their software products.

TheYCSB enables a deep data-driven decision between many databases, as well as a standardized optimization of different database configurations regarding performance and scalability.
In addition, the YCSB is also used to benchmark file systems such as SeaweedFS, REST APIs and data lakes and special storage engines.

Which Databases Does the YCSB Support?

No publicly known database benchmark supports as many databases as the YCSB.
At first glance, the databases can be easily identified from the folder structure of the YCSB code, since as a rule each folder belongs to a DBMS.

However, this is only half the truth, as often enterprise versions and different DbaaS versions of a database are also “benchmarkable”.

Theexact information can be found in the pom.xml files of the folders or the main directory.
Via the driver dependencies contained therein, the possible databases and their versions can be identified.

Thefollowing is a complete list of available databases (as of 2021-09-30):

SQL

  • MariaDB
  • Microsoft SQL Server
  • MySQL
  • PostgreSQL
  • Oracle RDBMS (Multi-Model)
  • and actually all databases that support the JDBC driver (version 2.1.1). However, it is often necessary to integrate the special driver separately.

nosql

  • Aerospike: Key/Value
  • Alibaba TableStore: Column Family
  • Apache Accumulo: Column Family
  • Apache Cassandra: Column Family
  • Apache Geode : Key / value
  • Apache HBase: Column Family
  • Apache Ignite : Key / value
  • Apache Solr: Document
  • Apache Zookeeper: Key/Value
  • ArangoDB: Document, (Graph)*
  • aws DynamoDB : document
  • Azure cosmosdb : document , Column Family , ( graph ) *
  • Azure Tablestorage: Document
  • Couchbase: Document
  • Elasticsearch: Document
  • Gemfire: Key/Value
  • Google BigTable: Column Family
  • Google Cloud Datastore: Document
  • GridDB: Key/Value
  • Hypertable: Colum Family
  • infinispan : Key / value
  • MapR: Column Family
  • Memcached : Key / value
  • MongoDB: Document
  • Oracle Autonmous Database: Document
  • Oracle nosql: Key/Value
  • OrientDB: Document, Object-Oriented, (Graph)*
  • Redis: Key/Value
  • Riak KV : Key / value
  • RocksDB : Key / value
  • ScyllaDB: Column Family
  • tarantool : Key / value
  • voldemort : Key / value

* For multimodel database , only key / value – like data model are support . graph models is are are not ” benchmarkable ” with the YCSB .

NewSQL

  • Apache Kudu: Key/Value & Relational
  • FoundationDB: Key/Value & Relational
  • Google Cloud Spanner: Key/Value & Relational
  • Many other NewSQL databases support the JDBC driver just like classic SQL databases, and can therefore be integrated just as easily as SQL databases.

note : Some integrate database drivers is match of the YCSB do not match the late available database vendor driver . currently , no scientific research is exists exist on the impact of outdated database driver on benchmarke and the result performance figure .

Interview Question 3: Which Databases Have You Already Benchmarkedwith the YCSB?

filipecosta90 (Performance Engineer @ Redis):
> Redis (in several flavors), MongoDB, and ElasticSearch.

lukaszstolarczuk (Software Engineer @ Intel):
> We’ve benchmarked few databases to compare: Redis, Memcached, RocksDB, MongoDB (+ our storage engine PMSE based on reverse-engineered WiredTiger API).These benchmarks helped us with Ethernet and/or Kernel tuning and to find out where the bottlenecks are.

sachin-sinha (BangDB Author):
> I benchmarked BangDB, Redis, Mongodb, Couchbase and Yugabyte using the YCSB.

peterzheng98 (PhD student):
> Memcached. Thein memory kv-store shows much not only the database but also the operating system performance.

Which workload Does the YCSB Support?

TheYCSB workloads can be configured in a much more flexible way in many dimensions, so that almost any “simple” workload can be defined.

  • executiontime : runtime of the workload ( in minute )
  • threadcount: Number of parallel threads
  • recordcount: Number of initial records
  • insertstart : start record ( default = 0 )
  • operationcount : number of operation ( default = 1000 )
  • fieldcount: number of database fields of an entry (default = 10)
  • fieldlength: length of each database field (default = 500)
  • readallfields: true = all fields are read (default); false = only one field is read (key)
  • readproportion: read portion of the workload (0 – 1)
  • writeproportion : write portion of the workload ( 0 – 1 )
  • updateproportion: Update portion of the workload (0 – 1)
  • scanproportion : scan portion of the workload ( 0 – 1 )
  • requestdistribution : request access pattern ( UNIFORM , ZIPFIAN , latest )
  • readmodifywriteproportion: read-modify-write portion of the workload
  • insertorder is insert : insert sort order ( default = HASHED ; order
  • maxscanlength: maximum records of a scan (default = 1000)
  • scanlengthdistribution: Distribution of scan length distribution (UNIFORM, ZIPFIAN, LATEST)

These properties are the most important workload dimensions. A complete list can be found on the GitHub wiki. In addition, there are database-specific properties, such as consistency level. However, these must be stored in the DB bindings of the respective database, since these are usually not generalizable.
Theaccess distributions describe here, which data records of which database table position are read how often.

Note 1: Theconfiguration of the parameters is error-prone because there are no control checks. Many mutual influences are not obvious and can lead to erroneous benchmark measurements. Therefore, a certain amount of caution and logical checking is advisable when configuring your own workloads. In addition, it is often useful to benchmark new workloads only briefly on a test basis before performing longer benchmark runs.

Note 2: Theaccess distribution “LATEST” indicates accesses to the last records of the database. However, this refers only to the last records of the initial records, not the last records written to the database during the benchmark. This behavior does not match the natural behavior of many IoT & e-commerce applications, for example.

In the benchANT benchmarking platform, similar workloads are implemented after. Furthermore, all configuration options are so available via the benchmarking backend API. Feel free to (contact)[/contact] us for specific workloads.

What Results Does the YCSB Deliver?

TheYCSB has been developed and designed to measure performance and scalability metrics.
As a result, it provides txt files with time series measurement data as well as aggregated values of these measurement series at the end of the file.

Every 10 seconds (configurable) it returns the sum of all operations performed and the number of operations of the last 10 seconds. These are additionally separated individually for the individual database operation classes READ, INSERT, … and output in each case with statistical values.

At the end of the measurement, the most important performance KPIs are output:

  • Runtime
  • Throughput [ops/sec]
  • Latency (avg)
  • Latency (min)
  • Latency ( max )
  • Latency (95th percentile)
  • Latency (99th percentile)

These aggregated performance KPIs is are are a good start for performance analysis of the measured setup
A graphical presentation of the result must be done independently and is necessary for a well understanding of the result and , if necessary , for datum cleaning .

Also these last two points – data cleansing and graphical preparation – have been seamlessly integrated by benchANT into its benchmarking platform, so that no additional manual work is required and the performance KPIs can be evaluated directly.

How Do I Run a YCSB MongoDB Benchmark?

Theprocedure for performing a benchmark with the YCSB is similar for all databases. As an example, a concrete YCSB benchmark for MongoDB in the AWS Cloud will be described in order to address database-specific subtleties in distributed cloud systems.

Thefollowing sections describe the procedure on a technical level to implement the required steps from Figure X. Since benchmarking is a continuous process in which different configurations of cloud resources, DBMS configurations and benchmark configurations are examined iteratively, a special focus is placed on automation here.

1. Allocate AWS Cloud Resources.

First, you can of course create the necessary VM resources to install MongoDB using the EC2 Web Console or the EC2 API. However, this is time-consuming, especially if you want to benchmark many VM flavors.

Therefore, it is a good idea to use automation tools like Ansible, Chef or Terraform. These allow the declarative specification of a deployment model to generate the desired resources and abstract the direct interaction with the EC2 API. Thedeployment model can thus be easily extended and reused for additional VM flavors.

2. Install and Configure MongoDB Database

Theinstallation of MongoDB on the created VMs can be done manually via the CLI or you can also use the selected automation tool. Often there are also ready-made models for the installation of MongoDB, for example the MongoDB Playbook for Ansible.

3 . install Benchmark

To install the YCSB, a separate VM is also required. It should be ensured that this VM has enough cores to be able to map the desired workload intensity (i.e. numberOfThreads) and does not become a bottleneck.

It is also a good idea to constantly monitor the resource utilization of the benchmarking VM, either with an OS tool like top/htop or a comprehensive monitoring framework like Telegraf + InfluxDB.

Java 8 must now be instal on this VM , and the desire version of the YCSB release must be download .

4. Configure and Run the Workload

To configure the desire workload , you is access can either access the predefined template and extend them , or you can define your own complete workload via the Command Line .
It is is is important to know here that the YCSB support two phase : load and Run .
In the load phase the initial datum record are write into the database , i.e. it is a 100 % insert workload . In the Run phase the define mix of CRUD operation is execute .

Below we is show show the Load and run command to start an iot – drive workload where 2,000,000 record are write to the database in the load phase and then an 80 % Insert / 20 % read workload mix is execute in the Run phase .

Load Phase Command-Line (YCSB Client 0.17.0):

-db site.ycsb.db.MongoDbClient -s -p mongodb.url=mongodb://<ip:port>/ycsb?w=1&j=false -p workload=site.ycsb.workloads. CoreWorkload -p maxexecutiontime=1800 -threads 64 -p recordcount=2000000 -p operationcount=10000000 -p fieldcount=10 -p fieldlength=500 -p requestdistribution=zipfian -p insertorder=ordered -p readproportion=0. 2 -p updateproportion=0.0 -p insertproportion=0.8 -p scanproportion=0.0 -p maxscanlength=1000 -p scanlengthdistribution=uniform -p core_workload_insertion_retry_limit=3 -p core_workload_insertion_retry_interval=3 -load

Run-Phase Command-Line (YCSB Client 0.17.0):

-db site.ycsb.db.MongoDbClient -s -p mongodb.url=mongodb://<ip:port>/ycsb?w=1&j=false -p workload=site.ycsb.workloads. CoreWorkload -p maxexecutiontime=1800 -threads 64 -p recordcount=2000000 -p operationcount=10000000 -p fieldcount=10 -p fieldlength=500 -p requestdistribution=zipfian -p insertorder=ordered -p readproportion=0. 2 -p updateproportion=0.0 -p insertproportion=0.8 -p scanproportion=0.0 -p maxscanlength=1000 -p scanlengthdistribution=uniform -p core_workload_insertion_retry_limit=3 -p core_workload_insertion_retry_interval=3 -p insertstart=2000001 -t

After successfully running the workload, the cloud resources used should be released again to avoid incurring unnecessary costs. For further benchmark runs, the process should be used on new resources from the beginning to avoid caching and other artifacts that can affect the measurement result.

5. Analyse and Visualize Data

Since the YCSB itself only provides the results as text, CSV or JSON, further steps are required to merge and visualize the data from several measurement series. For this purpose, it is useful to implement appropriate scripts in R or Python, which parse the YCSB results and convert them into a suitable data format for analysis or visualization, for example Dataframes in Python. In addition, there are a number of tools that enable a standardized visualization of the results from the data frames, for example Seaborn, Bokeh or Plotly.

All the necessary benchmarking process steps, from allocating the cloud resources to the visual preparation of the measurement results, are integrated and automated in the Benchmarking Platform from benchANT. This means that a benchmarking process is now also possible for significantly less specialized IT experts.

Conclusion

TheYCSB is a functional and important nosql database benchmark. It enables important performance analyses and performance comparisons.

It can be used for manual distributed database benchmarks. It is also available for automated cloud database benchmarks in the benchANT platform.

Which benchmarks do you use?

And which databases is like would you is like like to benchmark ?

Bonus #1: How Do I Integrate a New Database into the YCSB?

TheYCSB already has a large number of DBMS integrations as shown above. This is especially due to the simple integration process of new databases into the YCSB.

Any database that can perform the following 5 basic operations can be integrated into the YCSB:

  • read a single record
  • Perform a range scan
  • Update a single record
  • insert a single record
  • Delete a single record

This applies to almost all known databases.

Theprocess of integrating a new database into the YCSB is relatively simple:

  1. forking of the GitHub YCSB repository
  2. creation of an own project folder
  3. extend the com.yahoo.ycsb . db class with an own constructor
  4. optional: create init() method for the DB
  5. implement the 5 database methods
  6. compile the database interface layer
  7. test the database layer
  8. execute YCSB client
  9. optionally: make a pull request for GitHub
  10. optionally: Maintaining the GitHub project folder

Detailed instructions of the integration process can be found in the Wiki of the YCSB GitHub repository as well as here.

If you have any questions about implementing a database in the YCSB, please feel free to contact us. We will be happy to help you with this, or can take over this task.

Bonus #2: How Do I Integrate a New Workload into the YCSB?

While numerous new databases have been added to the YCSB in recent years, not a single new workload has been added since the release of the YCSB.

This is in no way because adding a new workload is complicated or time-consuming. On the contrary, it is relatively simple. Thereason for this is that adding a new workload to the GitHub repository also entails a responsibility to integrate and maintain all DB bindings with that workload. Of course, no one wants to go to this trouble voluntarily.

Therefore, the integration of new workloads is only done on (private) forks of the YCSB and only for the databases that are relevant for the individual benchmark. This is why one finds many benchmark results with deviating workloads.

create a new YCSB workload is a simple process once you understand the structure and the workload configuration parameter .

In order to define a YCSB workload , the data sets is have on the one hand and the transaction set on the other hand have to be specify . These workload specification can be integrate into the YCSB either via a new parameter file or via a new Java class .

A detailed implementation guide is available on the YCSB-GitHub-Wiki.

With benchANT , individual workload can be configure and execute using a clear GUI .