Back
Featured image of post Building a word counter efficiently

Building a word counter efficiently

Learn how to efficiently read and process information from files using Go standard library utilities

Table of Content

In this small article, I dig into some caveats that make your software development better when working with files. We talk about, how file reading can be done in Go using buffered readings to speedup data access and improve our overall programm experience.

I will show you a basic example of file reading based on real example.

A real example of file reading

Most unix based systems, contains a small utility called wc with stands for word count and according to man documentation, states:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
WC(1)                            User Commands                           WC(1)

NAME
       wc - print newline, word, and byte counts for each file

SYNOPSIS
       wc [OPTION]... [FILE]...
       wc [OPTION]... --files0-from=F

DESCRIPTION
       Print newline, word, and byte counts for each FILE, and a total line if
       more than one FILE is specified.  A word is a non-zero-length  sequence
       of characters delimited by white space.

       With no FILE, or when FILE is -, read standard input.

       The  options  below  may  be  used  to select which counts are printed,
       always in the following order: newline, word, character, byte,  maximum
       line length.

       -c, --bytes
              print the byte counts

       -m, --chars
              print the character counts

       -l, --lines
              print the newline counts

       --files0-from=F
              read  input  from the files specified by NUL-terminated names in
              file F; If F is - then read names from standard input

       -L, --max-line-length
              print the maximum display width

       -w, --words
              print the word counts

       --help display this help and exit

       --version
              output version information and exit

AUTHOR
       Written by Paul Rubin and David MacKenzie.

REPORTING BUGS
       GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
       Report wc translation bugs to <http://translationproject.org/team/>

COPYRIGHT
       Copyright © 2017 Free Software Foundation, Inc.   License  GPLv3+:  GNU
       GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This  is  free  software:  you  are free to change and redistribute it.
       There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       Full documentation at: <http://www.gnu.org/software/coreutils/wc>
       or available locally via: info '(coreutils) wc invocation'

GNU coreutils 8.28               January 2018                            WC(1)

A small utility that given a file, counts the number of words existing on it.

Running wc over Moby Dick.

To test the output of wc commmand, we will use the Moby Dick by Herman Melville book downloaded from The Project Gutenberg as txt file. You can download the text file from here.

To count the number of words existing in our moby.txt file, we run following command:

1
wc -w moby.txt

The result of executing the wc command is

1
214112 moby.txt

Which means, wc returns that file moby.txt contains $214112$ words, according to wc word count algorithm. We also measure the execution time, so that we can compare later with of designed version.

Execution time

1
2
3
time wc -w moby.txt

wc -w moby.txt 0,03s user 0,00s system 99% cpu 0,034 total

The time required to count all words by wc was 0.034 seconds.

Building a simple wc in Go

To understand how efficiently read and process information from files in Go, we will implement a simple word count in Go using following algorithm:

  1. Read file by chunks using a buffer based reader.
  2. On each chunk, we will detect how many word are, considering a word a space separated text string.
  3. Increment the word counter by one.
  4. Move to the next data chunk.
  5. Go to 2 until chunks are ended.

Open the file for reading

The initial step in all programs that read data from files, it to get read access to the required file. In Go, we can get a file access with next code snippet.

1
2
3
4
5
6
// open the file for reading
f, err := os.Open(name)
if err != nil {
    log.Fatal(err)
}
defer f.Close()

Create a buffer reader

Next step is to create a buffered reader to handle our input data efficienty.

1
2
3
// create a buffered reader
reader := bufio.NewReader(f)
var part chunk

The value of chunksize parameter with define the number of bytes that will be filled on each data chunk. In order to choose a good value, we could set arbitrary values and measure the performance. However, in this case, we will set this value according to our system memory pagesize value. This will allow the programm to fill memory pages with our chunks data. In my current laptop, the pagesize is 4096.

We also define a custom data type called chunk as follows

1
2
3
4
5
const (
    chunksize = 4096
)

type chunk [chunksize]byte

Loop while chunks exists

Next, we need to handle incoming data on each chunks until no more data is feeded by the file. The code block we use for this processing is as follows

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
for {
    var totalread int
    if totalread, err = reader.Read(part[:]); err != nil {
        break
    }
    // TODO word count algorithm here
}
if err != io.EOF {
    log.Fatal("error reading ", name, ": ", err)     
} else {
    err = nil
}

Implementing a word count algorithm

Now that we know how to read all file chunks, the most important part is the algorithm that detects and counts how many words are in the data file. For simplicity purposes, the word count algorithm we implement is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
previousWord := 0
for i := 0; i < totalread-1; i++ {
    c := part[i]
    n := part[i+1]
    isSep := c == ' ' || c == '\n' || c == '\r' || c == '\t'
    isWord := n != ' ' && n != '\n' && n != '\r' && n != '\t'
    if isSep && isWord {
        wordCount++
        previousWord = i
    }
}
if previousWord != totalread {
    wordCount++
}

The complete word count example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
package main

import (
    "bufio"
    "fmt"
    "io"
    "log"
    "os"
    "testing"
)

const (
    chunksize = 4096
)

type chunk [chunksize]byte

func countWords(name string) (wordCount int) {
    // open the file for reading
    f, err := os.Open(name)
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    // create a buffered reader
    reader := bufio.NewReader(f)
    var part chunk
    wordCount = 0
    for {
        var totalread int
        if totalread, err = reader.Read(part[:]); err != nil {
            break
        }
        previousWord := 0
        for i := 0; i < totalread-1; i++ {
            c := part[i]
            n := part[i+1]
            isSep := c == ' ' || c == '\n' || c == '\r' || c == '\t'
            isWord := n != ' ' && n != '\n' && n != '\r' && n != '\t'
            if isSep && isWord {
                wordCount++
                previousWord = i
            }
        }
        if previousWord != totalread {
            wordCount++
        }
    }
    if err != io.EOF {
        log.Fatal("error reading ", name, ": ", err)
    } else {
        err = nil
    }
    return
}

As an extra we can create a simple execution example to run the designed countWords function

1
2
3
4
func TestExample(t *testing.T) {
    count = countWords("moby.txt")
    fmt.Println(count)
}

Compile the counter

As always, to compile we run go build command.

1
go build -o wc wc.go

Run the counter

And to execute the binary on selected sample moby.txt file, we run

1
./wc moby.txt

And returns the result

1
214364

with the execution time of 0.015 seconds.

1
2
time ./wc moby.txt
./wc moby.txt  0,01s user 0,01s system 103% cpu 0,015 total

Is there any major difference?

Without checking the source code of wc utility and just taking a look to syscalls made by both applications this is the information we can extract

strace of wc

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
openat(AT_FDCWD, "moby.txt", O_RDONLY)                = 3
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL)             = 0
read(3, "**The Project Gutenberg Etext of"..., 16384) = 16384
.
.
trimmed read(3...) calls
.
write(1, "214112 moby.txt\n", 16214112 moby.txt)      = 16
close(3)                                              = 0
exit_group(0)                                         = ?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        78           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         7           close
  0.00    0.000000           0         5           fstat
  0.00    0.000000           0         7           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           fadvise64
  0.00    0.000000           0         5           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   117         3 total

strace of our custom wc

1
2
3
4
5
6
7
8
9
openat(AT_FDCWD, "moby.txt", O_RDONLY|O_CLOEXEC)      = 3
read(3, "**The Project Gutenberg Etext of"..., 4096)  = 4096
.
.
trimmed read(3...) calls
.
close(3)                                              = 0
write(1, "214364\n", 7214364)                         = 7
exit_group(0)                                         = ?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 89.33    0.000159           2        78           read
  6.74    0.000012          12         1           write
  3.93    0.000007           4         2           close
  0.00    0.000000           0        19           mmap
  0.00    0.000000           0       113           rt_sigaction
  0.00    0.000000           0         8           rt_sigprocmask
  0.00    0.000000           0         3           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         3           fcntl
  0.00    0.000000           0         2           sigaltstack
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         3         2 epoll_ctl
  0.00    0.000000           0         2           openat
  0.00    0.000000           0         1           readlinkat
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         1           pipe2
------ ----------- ----------- --------- --------- ----------------
100.00    0.000178                   242         2 total 

Apart from syscalls made by each of the applications, the major difference regarding to our code, is that wc code buffer size is 16384 instead of our selected buffer size of 4096 bytes.

Conclusion

We show you how you can efficiently read and process files input data using bufio.NewReader with a very tiny toy example algorithm of a word count. Even if our implemention return a result of 214364 comparing to the result of wc of 214112, meaning there is an error of 252 added words; the goal of this article was to introduce a good file reading techniques and not to focus on result accuracy. Hope you can start using this new approach when processing files data.



💬 Share this post in social media

Thanks for checking this out and I hope you found the info useful! If you have any questions, don't hesitate to write me a comment below. And remember that if you like to see more content on, just let me know it and share this post with your colleges, co-workers, FFF, etc.

You are free to use the content, but mention the author (@MrSergioAnguita) and link the post!
Last updated on Oct 14, 2021 17:52 CEST
Please, don't try to hack this website servers. Guess why...