Happy Bear Software

Implementing a dynamically sized array in C

One of the biggest stumbling blocks on moving to C from a high-level scripting language comes when you realize you need to know how big an array needs to be before you use it. In this article I'm going to go through a quick implementation of a dynamically sized array in C.

Not only will this give us a useful bit of library code in C, it will also help to illustrate the sort of performance/memory overhead in higher level languages have to pay down to give users access to built-in variable-length arrays.

Basic C arrays

Here's a quick refresher on what a basic C array looks like, allocated on the stack:

int main() {
  // Declare an array of 3000 integers
  int my_array[3000];
}

This bit of code does the following:

Based on the above, it's clear that C arrays are for the most part syntactic sugar working with its underlying memory handling operations. Some other notable factoids:

There's no easy way to automatically resize an array as its contents expand. You can call realloc on an array you've allocated on the heap to make it larger (more on this later) but ideally the resizing of the array would be automatic.

You can index out of array bounds. There is no bounds checking in C arrays. In other words, there's nothing stopping you from reading a value out of my_array[5000]. Since all this code does is read the memory 5000 elements2 after the location of the pointer my_array, indexing out of an arrays bounds will simply retrieve some as-yet unallocated location in virtual memory. This is most likely not what you intended, and can have potentially dangerous consequences.

When we can afford a few extra clock cycles and the requisite space, it would be nice to abstract these problems away in some sort of underlying data structure. A dynamic array a.k.a a vector does exactly that.

N.B. A vector doesn't solve all of the problems that we have when dealing with collections. It's quite good for lists which are append-only, but if insertion/deletion is a common operation then it's possible that a linked list is a better choice. Linked lists come with their own set of problems that won't quite fit into this article.

Defining a Vector interface

In this example we're just going to create a dynamically sized array of integers. Here's what the definition of a vector interface might look like:

// vector.h

#define VECTOR_INITIAL_CAPACITY 100

// Define a vector type
typedef struct {
  int size;      // slots used so far
  int capacity;  // total available slots
  int *data;     // array of integers we're storing
} Vector;

void vector_init(Vector *vector);

void vector_append(Vector *vector, int value);

int vector_get(Vector *vector, int index);

void vector_set(Vector *vector, int index, int value);

void vector_double_capacity_if_full(Vector *vector);

void vector_free(Vector *vector);

Blow by blow:

This is a very simple plan for the implementation and glosses over a lot of issues like error handling, but it should be enough to get us to a basic working version of our vector.

Implementing the Vector

Here's what an implementation of the interface we defined above might look like:

// vector.c

#include <stdio.h>
#include <stdlib.h>
#include "vector.h"

void vector_init(Vector *vector) {
  // initialize size and capacity
  vector->size = 0;
  vector->capacity = VECTOR_INITIAL_CAPACITY;

  // allocate memory for vector->data
  vector->data = malloc(sizeof(int) * vector->capacity);
}

void vector_append(Vector *vector, int value) {
  // make sure there's room to expand into
  vector_double_capacity_if_full(vector);

  // append the value and increment vector->size
  vector->data[vector->size++] = value;
}

int vector_get(Vector *vector, int index) {
  if (index >= vector->size || index < 0) {
    printf("Index %d out of bounds for vector of size %d\n", index, vector->size);
    exit(1);
  }
  return vector->data[index];
}

void vector_set(Vector *vector, int index, int value) {
  // zero fill the vector up to the desired index
  while (index >= vector->size) {
    vector_append(vector, 0);
  }

  // set the value at the desired index
  vector->data[index] = value;
}

void vector_double_capacity_if_full(Vector *vector) {
  if (vector->size >= vector->capacity) {
    // double vector->capacity and resize the allocated memory accordingly
    vector->capacity *= 2;
    vector->data = realloc(vector->data, sizeof(int) * vector->capacity);
  }
}

void vector_free(Vector *vector) {
  free(vector->data);
}

Using our Vector

In this usage example, we keep things simple and just pass around the pointer to a variable called vector allocated on the stack:

// vector-usage.c

#include <stdio.h>
#include "vector.h"

int main() {
  // declare and initialize a new vector
  Vector vector;
  vector_init(&vector);

  // fill it up with 150 arbitrary values
  // this should expand capacity up to 200
  int i;
  for (i = 200; i > -50; i--) {
    vector_append(&vector, i);
  }

  // set a value at an arbitrary index
  // this will expand and zero-fill the vector to fit
  vector_set(&vector, 4452, 21312984);

  // print out an arbitrary value in the vector
  printf("Heres the value at 27: %d\n", vector_get(&vector, 27));

  // we're all done playing with our vector, 
  // so free its underlying data array
  vector_free(&vector);
}

Tradeoffs

Working at this low a level helps to highlight a number of tradeoffs that we don't necessarily see when coding at break-neck speed in high-level languages like Ruby. For example:

Resizing the vector (specifically, calling realloc()) might be expensive. From the man page:

The realloc() function tries to change the size of the allocation pointed to by ptr to size, and returns ptr. If there is not enough room to enlarge the memory allocation pointed to by ptr, realloc() creates a new allocation, copies as much of the old data pointed to by ptr as will fit to the new allocation, frees the old allocation, and returns a pointer to the allocated memory.

If we ever find ourselves in the situation where there's no room for the data array to expand into, we may be hit with the expensive copy operation. In order to avoid excessive calls to realloc then, we double the capacity of the array each time we expand it.

This strategy has the downside of potentially allocating space we never use, so there's a trade-off to be made between wasted space and performance overhead.

Also, Vector only ever holds integers. If we were to hold an array of void pointers instead of integers, our vector could be used with values of arbitrary types. This would of course incur the overhead of dereferencing a pointer every time we wanted to read a value out of the array, so there's a trade-off there between flexibility and an (albeit small) performance hit.

Related Posts

Footnotes

1 "Allocating memory on the stack" really just means taking the current position of the stack pointer, assigning it to the variable name and then incrementing the stack pointer by the size of the type of the variable. It's an extremely quick operation compared with allocating memory on the heap with malloc and friends. Unlike with malloc however you lose any stack allocated memory once you pop your current stack frame (i.e. execution reaches the end of the current function).

2 By elements we mean the size of the type of the array. If an array holds elements of type int which each occupy four bytes, then integer_array[5000] means the location in memory of integer_array offset by (5000 * 4) bytes.

EDIT: Corrections based on feedback.