Learn how to efficiently read and process information from files using Go standard library utilities
Table of Content
In this small article, I dig into some caveats that make your software development better when working with files. We talk about, how file reading can be done in Go using buffered readings to speedup data access and improve our overall programm experience.
I will show you a basic example of file reading based on real example.
A real example of file reading
Most unix based systems, contains a small utility called wc with stands for word count and according to man documentation, states:
A small utility that given a file, counts the number of words existing on it.
Running wc over Moby Dick.
To test the output of wc commmand, we will use the Moby Dick by Herman Melville book downloaded from The Project Gutenberg as txt file. You can download the text file from
here.
To count the number of words existing in our moby.txt file, we run following command:
1
wc -w moby.txt
The result of executing the wc command is
1
214112 moby.txt
Which means, wc returns that file moby.txt contains $214112$ words, according to wc word count algorithm.
We also measure the execution time, so that we can compare later with of designed version.
Execution time
1
2
3
time wc -w moby.txt
wc -w moby.txt 0,03s user 0,00s system 99% cpu 0,034 total
The time required to count all words by wc was 0.034 seconds.
Building a simple wc in Go
To understand how efficiently read and process information from files in Go, we will implement a simple word count in Go using following algorithm:
Read file by chunks using a buffer based reader.
On each chunk, we will detect how many word are, considering a word a space separated text string.
Increment the word counter by one.
Move to the next data chunk.
Go to 2 until chunks are ended.
Open the file for reading
The initial step in all programs that read data from files, it to get read access to the required file. In Go, we can get a file access with next code snippet.
1
2
3
4
5
6
// open the file for reading
f,err:=os.Open(name)iferr!=nil{log.Fatal(err)}deferf.Close()
Create a buffer reader
Next step is to create a buffered reader to handle our input data efficienty.
1
2
3
// create a buffered reader
reader:=bufio.NewReader(f)varpartchunk
The value of chunksize parameter with define the number of bytes that will be filled on each data chunk. In order to choose a good value, we could set arbitrary values and measure the performance. However, in this case, we will set this value according to our system memory pagesize value. This will allow the programm to fill memory pages with our chunks data. In my current laptop, the pagesize is 4096.
We also define a custom data type called chunk as follows
1
2
3
4
5
const(chunksize=4096)typechunk[chunksize]byte
Loop while chunks exists
Next, we need to handle incoming data on each chunks until no more data is feeded by the file. The code block we use for this processing is as follows
1
2
3
4
5
6
7
8
9
10
11
12
for{vartotalreadintiftotalread,err=reader.Read(part[:]);err!=nil{break}// TODO word count algorithm here
}iferr!=io.EOF{log.Fatal("error reading ",name,": ",err)}else{err=nil}
Implementing a word count algorithm
Now that we know how to read all file chunks, the most important part is the algorithm that detects and counts how many words are in the data file. For simplicity purposes, the word count algorithm we implement is:
packagemainimport("bufio""fmt""io""log""os""testing")const(chunksize=4096)typechunk[chunksize]bytefunccountWords(namestring)(wordCountint){// open the file for reading
f,err:=os.Open(name)iferr!=nil{log.Fatal(err)}deferf.Close()// create a buffered reader
reader:=bufio.NewReader(f)varpartchunkwordCount=0for{vartotalreadintiftotalread,err=reader.Read(part[:]);err!=nil{break}previousWord:=0fori:=0;i<totalread-1;i++{c:=part[i]n:=part[i+1]isSep:=c==' '||c=='\n'||c=='\r'||c=='\t'isWord:=n!=' '&&n!='\n'&&n!='\r'&&n!='\t'ifisSep&&isWord{wordCount++previousWord=i}}ifpreviousWord!=totalread{wordCount++}}iferr!=io.EOF{log.Fatal("error reading ",name,": ",err)}else{err=nil}return}
As an extra we can create a simple execution example to run the designed countWords function
Apart from syscalls made by each of the applications, the major difference regarding to our code, is that wc code buffer size is 16384 instead of our selected buffer size of 4096 bytes.
Conclusion
We show you how you can efficiently read and process files input data using bufio.NewReader with a very tiny toy example algorithm of a word count. Even if our implemention return a result of 214364 comparing to the result of wc of 214112, meaning there is an error of 252 added words; the goal of this article was to introduce a good file reading techniques and not to focus on result accuracy. Hope you can start using this new approach when processing files data.
Subscribe, donate or become premium
You like it? Help making this blog better and subscribe to get advantages or make a one-time donation.
Thanks for checking this out and I hope you found the info useful! If you have any questions, don't hesitate to write me a comment below. And remember that if you like to see more content on, just let me know it and share this post with your colleges, co-workers, FFF, etc.