Simplify Grouping with 'Group By All' in Databricks

Author: Muhamamd Umar Amanat
: 28^th September 2023

While analyzing our data we often used "group by" to aggregate our data and get valuable insight from it. "Group by" clause is one of the widely used clauses for aggregating our data.

Why "Group by All"?

The first question that came to mind is why to use "group by all" when there is already a "group by" clause present in Spark.
To understand better let's take a simple scenario where I have to do "sum" aggregation on groups of column names as c1, c2, and c3 from table t1 as shown below.

Figure 1. Group by in Databricks, Source: Author

Above 'group by' can be replaced easily by 'group by all' and this will do the same but with minimal and readable code. This is depicted in Figure 2.

Figure 2. 'Group by all' in Databricks, Source: Author

There is another syntax alternate to the code shown in Figure 2. with the same functionality. This syntax is kind of more explicit as shown in Figure 3.

Figure 3. 'Group by all with except' in Databricks, Source: Author

Advantage

Well 'group by all' has a few advantages as listed below:

- No need to specify every column.
- Remove repetitive columns or fields from query
- The feature automatically aggregates using all non-aggregated fields from the select list ¹

Conclusion

Group by is a clause in Databricks and 'ALL' is a parameter of that clause. This parameter is shorthand to add all fields except the aggregation field. This allows us to reduce the repetition of the same fields from the query and generate the same result. This will not increase or decrease the functionality. It's up to the choice of users whichever syntax the user wants to go with. All syntax works the same and has no difference in functionality.

Reference

Contact Me

Let's setup a call for an expert advice and solutions for your challenges.

Call Me

+923178965910

Email Me

umaramanat66@gmail.com

Location

Lahore - Pakistan