Everything you need to know about Koalas! – analytics

- Advertisement -


This article was published as a part of the Data Science Blogathon.

- Advertisement -

introduction

- Advertisement -

A key aspect of big data is the data frame. Pandaks are the two most popular types. However, Spark is more suitable for handling scaled distributed data, whereas Pandas is not. In contrast, the Pandas API and syntax are easy to use. What if the user gets the most from both worlds? A library called Koala makes it possible for the user to take advantage of both worlds, thus saving them from having to choose between the two. Hence, this article!

Here an article describes the logic behind using Koala and then talks about the different library usage with different Spark versions. It then discusses the differences between koalas and pandas and tests to verify those differences. Its purpose is to help the reader establish a strong foundation in the past. As the foundation is established, the next chapter will discuss what data scientists should consider when using it and end with a summary and key takeaways. let’s get started.

- Advertisement -

Why Koala?

The following issues caused the introduction of the Koala library in the current Spark system [2],

Apache Spark is missing many features often required in data science. In particular, creating and drawing charts is an essential feature that almost all data scientists use every day. Generally, data scientists like pandas APIs, but if they need to increase their workloads, it is hard to convert them to PySpark APIs. This is because the Pyspark API is difficult to learn as compared to Pandas and has many limitations. Koalas Library Variants

To use Koala in Spark Notebook, you need to import the library. There are two options available in libraries. The first option is “databricks.koalas” and prior to PySpark version 3.2.x, this is the only option available. But after 3.2.x, another library called “pyspark.pandas” was introduced and this name aligns more realistic with the pandas API [3], Spark suggests using the latter as the former will soon stop supporting.

koala vs panda

Simply put, you can think of Koala as a pyspark data frame wrapped under a pandas wrapper. You have all the advantages of the Spark data frame in Koala and the way Pandas interacts. The Koalas API brings together both the Spark speed and the superior usability of Pandas to create a powerful and versatile API. The same concept is illustrated in the figure below.

The main similarity between pandas and cuckoo is that the APIs used in both the libraries are similar. That is, if pandas use pd.DataFrame() , then even in Koala, the API used will be the same, i.e. kl.DataFrame(). However, the difference between the two data frames (panda and koala) is what makes koalas really special. That is, Koala’s data frame is distributed – in a way similar to a Spark data frame [1], Unlike other Spark libraries, pandas runs in a single node on the driver rather than on all worker nodes, so it cannot scale. Unlike Pandas, the Pandas API (aka Koala) works in much the same way as the Spark library.

To understand the two terms, let’s use a sample program (look at GitHub) to do some tests that confirm the above differences.

The first test conducted was to test how computation operations performed in a Data Bricks environment. Furthermore, to test whether the three counts operations are compared for the Spark, Koala and Pandas data frames, respectively, something different is observed. The output is depicted graphically in the figure below. This confirms that pandas’ data frame does not use worker threads. This is because you will not see any spark jobs when performing operations on pandas’ data frame (see the diagram below). Spark and Koala data frames work differently on this. His spark has created jobs to complete the count operation. These jobs are created on two different workers (aka machines). This test confirms two things:

Firstly spark and cuckoo are no different in terms of working. Secondly, pandas is not scalable as the data load increases (i.e. it always works under one node on the driver, regardless of the size of the data). On the other hand, the “koala” is found to be distributed in nature and can be increased when the size of the data changes.

The second test performed in the sample program is the performance check for different data frames. Here the count operation execution time was calculated. The table below clearly shows that unlike pandas the record counting operation takes a good amount of time in spark and cuckoo. This justifies that at the bottom they are nothing but a Spark data frame. Another important point to note here is that pandas’ count performance is well ahead of the other two.

Third test again confirms that both are same as there are not many operations which need to be performed by Data Bricks if below two entities are same. To test this, a performance test was conducted in a Data Bricks notebook, which tested the completion time of operations when converting Spark data frames to koala and pandas, respectively. The output shown here shows that the conversion time of Koala is negligible compared to the conversion time of Pandas. This is because there is not much to do for Spark due to the similar structure of the data frame of Koala.

Evaluate Koala Read API with complex data structure in delta table

The most common way of maintaining data in a modern delta lake is in the form of a delta format. Azure Databricks delta tables support ACID properties similar to transactional database tables. It’s worth checking out how Koala (Pandas API) works with delta tables that have a complex JSON nested structure. So let’s get started.

sample data structure

The sample data shown here has two columns. First, the bank branch ID (simple data) and second the department description (a complex nested JSON structure). This data is stored as a delta table.

sample data – code

Sample data can be created using the code below. The complete codebase is available on GitHub.

# create payload payload_data1 = { “EmpId”: “A01”, “IsPermanent”: true, “Department”: [{“DepartmentID”: “D1”, “DepartmentName”: “Data Science”}]} payload_data2 = { “EmpId”: “A02”, “IsPermanent”: False, “Department”: [{“DepartmentID”: “D2”, “DepartmentName”: “Application”}]} payload_data3 = { “EmpId”: “A03”, “IsPermanent”: true, “Department”: [{“DepartmentID”: “D1”, “DepartmentName”: “Data Science”}]} payload_data4 = { “EmpId”: “A04”, “IsPermanent”: False, “Department”: [{“DepartmentID”: “D2”, “DepartmentName”: “Application”}]} # create data structure data =[
{“BranchId”: 1, “Payload”: payload_data1},
{“BranchId”: 2, “Payload”: payload_data2},
{“BranchId”: 3, “Payload”: payload_data3},
{“BranchId”: 4, “Payload”: payload_data4}
]# dump data to json []jsonDataList.append(jsonData) # parallelize json data jsonRDD = sc.parallelize(jsonDataList) # store the data to spark data

store temporary data in delta table

Now maintain the temporary employee data created above in a delta table using the code shown here.

table_name = “/testautomation/EmployeeTbl” (df.write .mode(“overwrite”) .format(“delta”) .option(“overwriteSchema”, “true”) .save(table_name)) dbutils.fs.ls(” ./testautomation/”)

Read complex nested data using Koala’s data frame

import pyspark.pandas as ps pdf = ps.read_delta(table_name) pdf.head()

On executing the above code the below output appeared.

Read complex nested data using Spark Data Frame

df = spark.read.load(table_name) display(df)

Here is the output I see on executing the above code.

Figure 7: Complex JSON data – show through display function

The above result shows that the pandas API (collas) does not work well when calling the HEAD function to display a complex JSON nested structure. This is against the core principle of using Koala library as its ultimate purpose is to provide distributed mechanism in pandas library using functions of pandas. However, it can be achieved through a workaround where you can use the DISPLAY function in the koala’s data frame.

conclusion

This article has laid a strong foundation on the Koala library for the readers. Furthermore, various tests conducted to verify the difference between pandas and koalas have indicated that koalas are nothing but Spark data frames paired with pandas’ APIs. In this post, we discussed the limitation that prevented pandas’ “head” function from properly displaying nested JSON data. In short, it is wrong to say that as the primary method for analyzing and transforming big data, Koalas is a good choice, but be aware of its limitations, so that you can have a fallback plan in case some APIs don’t work properly. Are.

key takeaways

By allowing data engineers and data scientists to interact with big data more efficiently, koalas improve productivity. Make sure you do some research before using Koala for complex nested JSON structures as its API may not produce the desired results. Koala bridges the gap between using the Pandas API and a single node and has distributed data to a large extent.

The media shown in this article is not owned by Analytics Vidya and is used at the sole discretion of the author.

related



Source link

- Advertisement -

Recent Articles

Related Stories