Collections in Scala

Note: The features image of this blog has been borrowed from Scala documentation

A Collection refers to data structures which group together data (values) and allow access to the data. The individual values in a collection are called elements of the collection. The elements are accessed by ‘traversing’ the collection. An analogy could be a physical file folder. A file folder holds individual papers. Thus a file folder could be termed as a collection (of papers). The individual papers are elements of the file folder. To get to a particular paper in the folder, we need to ‘traverse’ the folder by turning from one page to the next till we have got to the paper we want.

Scala has a rich library of collections like Lists, Map, Set, Array etc. These collections are part of a class hierarchy and thus inherit many common methods using which you can traverse the collection (access individual elements), change the collection (add, remove  elements), perform test if the collection contains an element, check if a collection is empty or not etc. This blog will cover some aspects of the class hierarchy and will also touch upon, at a high level, some of the collections available in scala library.

Mutable and Immutable collections in conjunction with val and var

Collections in Scala can be broadly categorised as mutable or immutable. Mutable means that the collection can be changed, most likely by adding, removing or modifying the elements of the collection. Thus a mutable collection can have side effects and you should refer to the documentation to check if your collection will be modified by the method called. An example of such a collection is an Array.

Immutable collections do not allow any changes to the collection. Whenever an operation is performed on a collection which will change the collection then a new collection is created which has the desired changes. The original collection is not affected. The immutable collections are more suitable for functional programming style. A collection would most likely have a mutable and an immutable version (with different names). This allows flexibility to the programmers to write code in their preferred style of programming.

Depending on the collection you use, the collection could be found in one of the following packages scala.collection (root package), scala.collection.mutable, scala.collection.immutable and scala.collection.generic.  Collections in scala.collection.immutable are used by default by scala. If you want to use a mutable version of the collection, you’ll have to import scala.collection.mutable in your code or give fully specified name.

In the blog on functional programming, we discussed that variables declared as vals cannot be reassigned and variables declared as vals can be reassigned. Thus following rules apply when using vals and vars with mutable and immutable collections

//immutable Set and val. cannot reassign sim and cannot change elements of sim
val sim = Set(1,2,3) //you cannot use 'add' or += 

// immutable set and var. Cannot change elements of sim2 but can reassigne sim2
var sim2 = Set(1,2,3)
sim2+= 4 //same as sim2 = sim2+4
println(sim2)
Set(1, 2, 3, 4)

//mutable Set and val. cannot reassign sm1 but can change elements of sm1
val sm1 = scala.collection.mutable.Set(1,2,3)
sm1.add(4)
println(sm1)
Set(1, 2, 3, 4) //sm1 got changed because sm1 is mutable
res5: Unit = ()

//immutable Set and val. cannot reassign sm1 but can change elements of sm1
import scala.collection.mutable
var sm2 = mutable.Set(1,2,3) //mutable Set
sm2+=4 
println(sm2)
Set(1, 2, 3, 4)

The collections in scala.collection package is the usually same as the immutable version. As seen in the following code, if we use scala.collection.Set, we get immutable behaviour

val sc = scala.collection.Set (1,2,3)
sc+4
//sc.add(4) //add (which is available in mutable version) is not available in root package
println(sc)
Set(1, 2, 3)

Note: There is another package, collection.generic but as per Scala documentation, we should not find a need to use it.

Class hierarchy of collections

Scala documentation has provided following diagram which depicts class hierarchy for collections for scala.collection, scala.collection.mutable and scala.collection.immutable packages.

You’ll notice a common pattern for collections. The base class for each collection is Traversable. Traversable is extended by Iterable. After that, the collections could be broadly divided into one of 3 categories – Sequence (Seq), Set and Map. As most collections inherit from Traversable and Iterable, they share much functionality. For example, you will be able to use functions like map, filter, foldLeft, foldRight, count, reduce, view, iterator, zip etc.

Traversable and Iterable are abstract classes. They cannot be instantiated. However, you could do following with them to get a collection:

//this will not compile as Traversable cannot be instantiated
val t1 = new Traversable[Int](1,2,3)

//but this works and gives us a List
val t1 = Traversable[Int](1,2,3)
t: Traversable[Int] = List(1, 2, 3)

In case this confuses you, the difference between the two is that when we use new, we create a new object of type Traversable. This is not allowed because Traversable is an abstract class. In the second scenario, we are not creating an object of type Traversable. We are actually calling ‘apply’ method on Traversable’s companion object which gives us a list. Companion objects and apply are discussed in this blog

Seq, Set and Map are traits which inherit from Iterable. Each of these adds specific methods and override methods from Traversable and Iterable to improve performance. For example, Seq trait add methods for indexing, search, sorting, update (if elements of the collection can be changed), comparison etc. A special category of mutable sequences called Buffers allows addition and removal of elements in the collection. Set adds methods for operations on sets like intersection, union and difference.

You’ll notice that methods names also follow a pattern:

  • element +: collection – element is added in front of collection. Notice the ‘:’. It is right associative.
  • collection +: element – element is added at the end of the collection
  • Mutable collections have methods with ‘=’ sign. This is because mutable collections can be reassigned the new value. Eg +=, ++=, -+

Sequence Collections: These type of collections store data in a specific order. Elements of Seq collection can be  accessed either by specifying an index (IndexedSeq) if you know the location of the element you need or by linearly traversing the Seq till you find the element you need (LinearSeq). Both IndexedSeq and LinearSeq are fixed in size. Their size is specific at the time of creation of the corresponding collection. The mutable version of Seq also contains a third category called Buffers whose size can be changed after initial definition. Following are some common collections of each category

  • IndexedSeq – Array, String
  • LinearSeq – List
  • Buffer – ListBuffer, ArrayBuffer
//create Array (subclass of IndexedSeq). 
val a = Array(1,2,3)
//you can access an element by specifying an index of the element.
a(1)
res0: Int = 2

//Array is immutable but not its elements. 
// once we define an array of a specific size and type, we cannot change it
//but we can change its elements
a(1) = 4

a.map(println _)
1
4 // we changed 2nd element from 2 to 4
3

//String is also a subclass of IndexedSeq and thus support indexing
val s = new String("Hello")
s(0)
res1: Char = H

//create List (subclass of LinearSeq)
val l = List(1,2,3)

//though you can specify index for List, it is not efficient
//because the performance of accessing elements using index in list
//is dependent on size of the list. Accessing 4th element takes more time
//than accessing 1st element
l(1)
res2: Int = 2

//efficient way to access a list is using head and tail methods
l.head
res3: Int = 1
l.tail
res4: List[Int] = List(2, 3)

//a List is immutable both w.r.t. its size, type and elements. 
//You cannot run following operation on list to change its element
//l(1) = 4 // will not compile

//ListBuffer is mutable. It can change the size of the collection
import scala.collection.mutable.ListBuffer
val b = ListBuffer(1,2,3)
//index an element
b(1)
res5: Int = 2
//add a new element in the end in the same collection
b.append(4)
println(b)
ListBuffer(1, 2, 3, 4)

b(3)
res8: Int = 4

Note: Both Array and List are immutable. You cannot change their size or type once they are defined. However, with Arrays, you can change the elements of the array. With Lists, you cannot change the elements of the List. Thus using List is closer to functional programming as you code cannot have the side effect of changing the element of a list (you can write a function passing it an array and your function can change the elements of the array (side effect). This cannot happen with a List)

def printCollection(s:Array[Int]) {
s(1)= 0 //malicious code. Changes our collection
s.map(println _)
}

a.map(println _)
1
4
3

printCollection(a)
1
0 //side effect of using Array whose elements are mutable even though Array itself isn't.
3

Note: In Seq, using a collection like c(1) gives access to the element at the index. As we will discuss later, similar constructs for Map and Set have different results.

Set Collections: A Set stores values like a Sequence but with two important differences. The values are unique in a Set and the values are not stored in order. While in a seqence, if you add value 1 then 2 then 3, then while traversing the list, you’ll get the 1st value as 1, 2nd value as 2 and 3rd value as 3. In Set however, this order is not guaranteed.

//repetition is not allowed
val s = Set(1,2,5,5)
s: scala.collection.immutable.Set[Int] = Set(1, 2, 5)

Map Collections: A map stores values in key-value pairs. The only constraint is that key values need to be unique.  You could create a Map either by using ‘arrow’ (->) or using Tuple2 (a,b)

val m1 = Map((1,"1"), (2,"2"), (3,"3"))
m1: scala.collection.immutable.Map[Int,String] = Map(1 -> 1, 2 -> 2, 3 -> 3)

val m2 = Map((1->"1"), (2->"2"), (3->"3"))
m2: scala.collection.immutable.Map[Int,String] = Map(1 -> 1, 2 -> 2, 3 -> 3)

//keys need to be unique
val m2 = Map((2->"1"), (2->"2"), (3->"3"), (3->"4"))
m23: scala.collection.immutable.Map[Int,String] = Map(2 -> 2, 3 -> 4)

You’ll notice in last example that when keys were repeated, scala compiler picked only one key and the latest value. Thus “2”was picked over “1” for key 2 because “2” was inserted after “1”. Similarly “4”was picked over “3”.  Think of  it as an equivalent of an UPSERT: update if the key exists or insert into the Map if it does not.

‘apply’ method in colletions

In addition to inheriting common methods from Traversable and Iterable, Seq, Set and Map also implement PartialFunction trait. This gives them two functions from the trait, apply and isDefinedAt. This ‘apply’ method is different from the ‘apply’ method which scala compiler uses to use an object as a function call (discussed here). The  ‘apply’ function in PartialFunction trait has the following syntax

abstract def apply(v1: A): B

It takes an argument of type A and returns a result of type B (types A and B could be same)

In Sequence collections, ‘v1’ denotes the position within a collection. If the position is valid, the value at that position is returned. If the position is not valid, an exception is thrown. Thus in Sequence, apply is used for positional indexing

val t = List[Int](1,2,3)

//return element at position 2 (index starts from 0, thus index value 1 for position 2)
t(1)
res0: Int = 2

t(4)
java.lang.IndexOutOfBoundsException

In Set, ‘apply’ is used to check if a value is a member of the set or not. The value we want to test for is specified in ‘v1’. If the value is a member of the set, apply returns true, else false. Thus apply in Set is membership test.

val t = Set[Int](1,2,3)
t(3)
res0: Boolean = true
t(4)
res1: Boolean = false

In Map, the value v1 specifies the key. If the key is valid, the value associated with the key is returned. Else an exception is returned.

val t = Map(1->"1",2->"2",3->"3")
t(3)
res0: String = 3
t(4)
java.util.NoSuchElementException: key not found: 4

Thread safe collections

To create thread-safe maps and sets, you’ll need to mix in SynchronizedMap and SynchronizedSet traits when creating new map or set.

import scala.collection.mutable._

object ThreadSafeMap {
  def giveSafeMap[A,B]:Map[A,B] = {new HashMap[A,B] with SynchronizedMap[A,B]}
}

val m1 = ThreadSafeMap.giveSafeMap[Int,String]
m1: scala.collection.mutable.Map[Int,String] = Map()

m1 + (1->"1")
res0: scala.collection.mutable.Map[Int,String] = Map(1 -> 1)
import scala.collection.mutable._

object ThreadSafeSet {
  def giveSafeMap[A]:Set[A] = {new HashSet[A] with SynchronizedSet[A]}
}

val m1 = ThreadSafeSet.giveSafeMap[Int]
m1: scala.collection.mutable.Set[Int] = Set()

m1 + (1)
res0: scala.collection.mutable.Set[Int] = Set(1)

Note: The recommended way for create concurrent code is to use java.util.concurrent library.

  • View

You could consider view as a lazy twin of a collection. The concept of ‘lazy’ was discussed in blog on functions. Like lazy people, anything ‘lazy’ in Scala  doesn’t like to do the job immediately but rather postpones it.

When we use collections and call a method on them, the methods is executed instantly. In the following example, the map method immediately increments all elements of the list, l1 returning a new List, l2

val l1 = List(1,2,3,4)
l1: List[Int] = List(1, 2, 3, 4)

val l2 = l1.map(_+1)
l2: List[Int] = List(2, 3, 4, 5)

Calling map again on the list l2 returns a new list l3 after applying the function specified in map to elements of l2

val l3 = l2.map(_+2)
l3: List[Int] = List(4, 5, 6, 7)

In general, if we want to apply a function f1 and then function f2 to all elements of a list using map then every time, the 1st map will create a List to store the intermediate result. The creation of this intermediate list is an overhead as we are not interested in the intermediate list.

You could avoid the creation of the intermediate list by using a view. We can convert a collection into a view by calling ‘view’ on the collection. When we do so and call map method for example on the view, the compiler will not instantly execute the function mentioned in the map method but will create an intermediate light weight sequence which tells the compiler that it has  to execute the function in the map on the list.

val l2 =  l1.view.map(_+1)
l2: scala.collection.SeqView[Int,Seq[_]] = SeqViewM(...)

In above example, _+1 has not been executed on the elements of the list yet. SeqViewM is an intermediate sequence telling the compiler that it has to execute _+1 on each element of the list (M in the name stands for map).

We can now call the 2nd map on this sequence. As l2 is already a view (SeqView is a type of view), the map will not be executed but another intermediate sequence will be created which will tell the compiler that it has to execute the function in the 2nd map on this sequence (just like what happened during 1st map).

val l3 = l2.map(_+2)
l3: scala.collection.SeqView[Int,Seq[_]] = SeqViewMM(...)

The 2Ms denote 2 maps need to be executed on this sequence.

Finally, when we have collated everything we need to do on our collection, we can call ‘force’to do the execution (execute two maps) and give us the final result

l3.force
res0: Seq[Int] = List(4, 5, 6, 7)

The advantage of using view is that we avoided creating intermediate collections when we executed _+1 and _+2. This happens because l3 has the information that it needs to execute the two functions in two maps on the elements of the sequence. It can pick individual elements of the list and apply the two functions and store the final value instead of picking an element, executing the first function, creating an intermediate collection and then again picking the element from the intermediate collection, executing the second function and creating the final collection.

View are particularly useful when you do not need all elements of the intermediate collection which gets created (thus creating a proper collection would not be optimal). As an example, say you want to find first 10 even numbers from a list of 1000 random numbers. You could first create a list of 1000 random numbers, then find all even numbers in that list and then take only 10 elements from the new list. In this process, the step in which we find all even numbers in the list is an overhead as we know that we will use only 1st 10 of those even numbers.

val r = scala.util.Random
//create 1000 random numbers
val l:List[Int] = List.tabulate(1000)(x=>r.nextInt())
l: List[Int] = List(1524601704, -699286875, 860762285, 754182362, -1982826258, ...)

//find all even numbers. We know we will use only 1st 10 of these numbers
val l1 = l.filter(x=>(x%2 == 0))
l1: List[Int] = List(1524601704, 754182362, -1982826258, ...)

//take 1st 10 even numbers
val l2 = l1.take(10)
l2: List[Int] = List(1524601704, 754182362, -1982826258, ...)

We could optimise the above by using a view instead

//use a view instead of creating the proper list of even numbers.
//Instead of creating a proper list, a light weight sequence is created 
//print is to explain concept of lazy processing. See explanation below
val lv1 = l.view.filter(x=>{println x; (x%2 == 0)})
lv1: scala.collection.SeqView[Int,List[Int]] = SeqViewF(...)

//now pick only 10 elements. Again we are not indulging by creating proper list
val lv2 = lv1.take(10)
lv2: scala.collection.SeqView[Int,List[Int]] = SeqViewFS(...)

// now use force to get proper list 
lv2.force
1524601704
754182362
-1982826258
res5: List[Int] = List(1524601704, 754182362, -1982826258, ...)

You can check that a view is a collection by using ‘size’ method on the view. In above example

//l1 is list of even numbers. The
l1.size
res1: Int = 485 //your count could be different depending on the random numbers generated 

lv1.size
res3: Int = 485

The above example demonstrate the optimisation we achieve by not creating intermediate Lists and also by not executing the function inside the filter method. Before we discuss explanation of lazy processing, consider another example, this time using map.

val l:List[Int] = List.tabulate(10)(x=>r.nextInt())

//using map without view, all elements get printed
l.map(x=> {println("Iam "+x);(x%2 == 0 )})
Iam -1536815810
Iam -897797241
Iam 1018386965
Iam 1354621000
Iam -1096988436
Iam -1475325747
Iam 668143914
Iam -421197613
Iam 2130521837
Iam -1689853857
res6: List[Boolean] = List(true, false, false, true, true, false, true, false, false, false)

//using view, nothing gets printed
val l3 = l.view.map(x=> {println("Iam "+x);(x%2 == 0 )})
l3: scala.collection.SeqView[Boolean,Seq[_]] = SeqViewM(...)

//print happens when the element is actually used
l3(0)
Iam -1536815810
res7: Boolean = true

l3(1)
Iam -897797241
res8: Boolean = false

When we use a view, the processing of the elements gets postponed until the time we actually use the element as demonstrated both in filter and map methods in which the print statement didnt execute till we called the force method (in filter example) and accessed individual elements (in map method)

So you may term a View as a sequence of delayed operations. It has a reference to the collection, and a bunch of operations to be applied when it is forced. The way operations to be applied are recorded is rather complicated, and not really important.

  • Iterator

Iterator provides the mechanism for traversing a collection. An Iterator itself is not a collection. An object of type Iterator  has two methods hasNext and next which are used to traverse a collection. ‘hasNext’ tells you whether the collection (to which Iterator belongs) has more elements or not. Calling ‘next’ gives you the next element. For example, say you have a file which you want to read and print on console. You could use Source which is an Iterator provided in scala.io package. Source provides a convenient way to read the file. To see how Source works, create a simple text file, greetings.txt with following content “greetings my friends to world of Scala”

Now start scala from the same directory where you saved greeting.txt and type following code on console.

scala> def readFile (f:String) {
| val si = scala.io.Source.fromFile(f)
| while (si.hasNext) print(si.next)
| }
readFile: (f: String)Unit

//call our function

scala> readFile("greeting.txt")
greetings my friends to world of Scala

An important aspect to keep in mind when using iterators is that they get exhausted as they are used and cannot be used again

scala> def readFile (f:String) {
| val si = scala.io.Source.fromFile(f) //create iterator
| println("iterating first time")
| while (si.hasNext) print(si.next)
| println
| println("iterating second time")
| while (si.hasNext) print(si.next) //this will not print anything
| }
readFile: (f: String)Unit

scala> readFile("greeting.txt")
iterating first time
greetings my friends to world of Scala
iterating second time

So if we create an iterator to traverse a collection and use it, we will have to create another new iterator to again traverse the collection

scala> def readFile (f:String) {
| val si = scala.io.Source.fromFile(f) //create iterator
| println("iterating first time")
| while (si.hasNext) print(si.next)
| println
| val si2 = scala.io.Source.fromFile(f) //create iterator again 
| println("iterating second time")
| while (si2.hasNext) print(si2.next)
| }
readFile: (f: String)Unit

scala> readFile("greeting.txt")
iterating first time
greetings my friends to world of Scala
iterating second time
greetings my friends to world of Scala

Here is another example to show that the iterator cannot be traversed more than once.

val l = List(1,2,3,4,1)
val l1 = l.iterator
//as we call size (which traverses the iterator), the iterator gets empty and cannot
//be used again. Thus the foreach will not print anything
l1.size
res0: Int = 5

l1.foreach(println)
res1: Unit = ()


But if we do not use size before foreach, print statements work

val l = List(1,2,3,4,1)
val l1 = l.iterator

l1.foreach(println)
1
2
3
4
1
res0: Unit = ()

//but not size is empty because iterator is exhausted now
l1.size
res1: Int = 0

Another example of Iterator is provided in combination method of Lists in List blog.

Ideally, you shouldn’t use an Iterator directly. Instead, you’ll use other methods like map or filter to traverse a collection. To demonstrate, you could get an instance of the iterator from a collection as follows:

scala> val l = List(1,2,3)
l: List[Int] = List(1, 2, 3)

//not the right way

scala> val it = l.iterator
it: Iterator[Int] = non-empty iterator

scala> while (it.hasNext) println(it.next)
1
2
3

//correct way is to use methods like map

scala> l.map(x=>println(x))
1
2
3
res1: List[Unit] = List((), (), ())

 

3 thoughts on “Collections in Scala

Leave a comment