# Analyzing the algorithmic complexity of the Kotlin API’s distinctBy function

A few times when I’ve conducted technical interviews with candidates for *Software Engineer* roles at the company where I was working, I’ve come across people who actually know how they might filter or sort a list, however when I delve into the depth of the problem, I start to find difficulties.

Many times, as software developers, we get used to using libraries and APIs that have most complex solutions already implemented in a single function, making our lives easier. This doesn’t mean that we put aside our curiosity and ability to investigate how things work, at least in its most basic form.

In this post, we are going to specifically analyze a very used function of the Kotlin API, this is `distinctBy`

.

You can find the official documentation by clicking at this link.

# Applying an algorithm in real life

Suppose you have a list of products in which, for reasons not relevant to this post, there are repeated elements. We clearly don’t want to show the user a list of duplicate items. So we know that we should filter the list.

We have a couple of alternatives:

# Filtering the list with a loop

We could instantiate a new empty list and looping through the original list, adding element by element only if the current *item *of the iteration is not found in this new list.

Here we have a slight problem, that “only if it is not found in this new list” implies a search. Let’s assume that we haven’t used a *hash table* and are using a simple *ArrayList*.

We would have something like this:

`fun filterList(products: List): List {`

val newList = emptyList()

products.forEach { product ->

if(findProduct(products, product) == false) {

newList.add(product)

}

}

return newList

}

fun findProduct(list: List, productToFind: Product): Boolean {

list.forEach { product ->

if(product.id == productToFind.id) {

return true

}

}

return false

}

Doing a quick analysis, the `filterList`

function will call the `findProduct`

function N times, where N is the number of elements in the original list.

The `findProduct`

function will take **up to N times** to complete. Assuming that the element to search for is in the last position of the list, the worst case (*worst case scenario*) will be **O(n)**.

This means that the complexity of our algorithm will be **O(n²)**. That is, it could take **n²** units of time to finish. And if we talk about Memory Consumption (space complexity), we will have a new list, that is **O(n)**.

# Filtering the list with the distinctBy function

Great. It turns out that we know of the existence of a function called `distinctBy`

in which we must pass it as a parameter, a function that indicates the “key” which will serve as an indicator of whether an element is repeated or not. This can be a first name, a last name, or a code.

We just need to write:

`val newList = products.distinctBy { product -> product.code }`

¡And, *voilá!*

Ready, one line of code was enough and we already have a new list that does not contain repeated products with the same code.

But… do we know what he does inside?

Let’s analyze the source code of this function.

As we can see, it is an `inline`

extension function applicable to any data structure that implements the `Iterable`

interface.

This function depends on two generic classes that it will need for its operation: one that determines the data types of the list to return (the same as the original list) denoted by the letter **T**, and another that determines the differentiator of the objects, denoted with the letter **K**.

In the first two lines, two new objects are instantiated. The first is a *hash table* or dictionary data structure that at the same time implements the *Set* interface (HashSet): it will not allow repeated elements thanks to the fact that it contains a *hash table* inside. The second is a simple list that will serve as the resulting filtered list.

Iterates over the original list (shown as *this* in the *for* statement) and executes the `selector`

function that is passed as a parameter. This `selector`

function is passed the current element of the iteration.

That means that in each iteration, the function `{ product -> product.code }`

will be returning what we want to serve as a unique identifier, that’s why its value is assigned to a variable named “key”.

Once we get the unique identifier (which in our example is the product code), we proceed to insert it into our *hash table*.

As we can see, this insertion occurs inside a conditional *if*.

The Set interface states that the *add* function will return *true* if, and only if, the element was inserted. And at what point is the element **not** **inserted**? Well, when it already exists (it is verified with a hash table!).

Once we know that this identifier did not exist in our identifier *hash* table (`set.add(key)`

returns *true*), then we proceed to insert the element (the product) in our new list (`list.add(e)`

).

No lookup is performed to see if the iteration item already exists in the list. We only try to insert its identifier (the product code) in the hash table and if this operation is successful (returns *true*) then we just add this product to the filtered list.

Since the whole original list is iterated anyway, the time complexity will be **O(n)**, however, the verification of whether or not an element already exists in the list is returns in constant time **O(1)**, due to our *hash* table inside our *HashSet* which will tell us if the identifier was inserted or not .

Of course, when creating two new objects, the memory space used will be a little larger, but not as relevant since the purpose of the *HashSet* is to store only the identifiers to know if an element has already been created. exists or not in the original list.

# Summary

- The
`distinctby`

function uses a*HashSet*structure to speed up the cost of time with the verification of the existence of an element in our list. - Unlike a
*HashMap*, a*HashSet*needs only one element at insert time and it cannot be repeated, it acts as key and value at the same time (actually, if we look at the internal implementation, a generic static`Object`

object is used as value). - Of course,
`distinctBy`

offers us better performance than if we opted for an implementation of two iterations, one inside the other. However, we could also have used a*HashMap*and inserted on top of it using the*product code*as the key and the*product*as the value. The result would be a clean product structure.

What did you think? Do you also often analyze the functions that we use on a daily basis and that make our lives easier? I would like to know if you know of any other functions or APIs that also make use of these data structures efficiently and simplify our day-to-day developments 😄.