Back
Featured image of post How to build a markov chain in Python

How to build a markov chain in Python

Part 2 of the series about Markov chain in where we explain how to build and code a markov chain using Python and analyze the results

Table of Content

Introducing markov chains in Python

So far, we read about how a Markov Chain works, the concept of transition matrix and how we can calculate a future state probability. However, we need to be able to create our own Markov Chains from our input data. This post will show you, how you can create your own markov chain using Python 3+

Working with Markov Chains, our first approach

So, what markov chain implementation should I use to build my own chain?. To be honest, I have no idea of most used frameworks, so lets find them in Github: https://github.com/search?q=markov+chain

most relevant opensource projects related to markov chains
most relevant opensource projects related to markov chains

As you can see there is a good candidate for Python called markovify. Lets try to use it and see how it works by running the provided example with a custom corpus

Downloading a NLP corpus for training

After a rapid search on Internet looking for some Spanish free text training data, I found a small corpus at https://www.corpusdata.org/spanish.asp

Instead of downloading full corpus, I just downloaded a sample $1/100$ for testing purposes. The sample itself contains 2 Million words according to corpusdata.org. If you want to use this same corpus text, you can download it from https://www.corpusdata.org/span/samples/text.zip

Creating a virtualenv for our Python project

As a recommended way of development, we first create a virtualenv for Python and set it as active.

1
2
python3 -m venv venv
source venv/bin/activate

In this case, the created virtualenv is called venv.

Installing Markovify

To install markovify we just need to create a virtualenv and install it as a project dependency with pip.

1
pip install markovify
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Collecting markovify
  Using cached https://files.pythonhosted.org/packages/80/d2/e773267ac227a500d17224dc60c4a3b8e7015f843c467ebab925a5aa15c3/markovify-0.9.3.tar.gz
Collecting unidecode (from markovify)
  Using cached https://files.pythonhosted.org/packages/9e/25/723487ca2a52ebcee88a34d7d1f5a4b80b793f179ee0f62d5371938dfa01/Unidecode-1.2.0-py2.py3-none-any.whl
Building wheels for collected packages: markovify
.
.
.
Installing collected packages: unidecode, markovify
  Running setup.py install for markovify ... done
Successfully installed markovify-0.9.3 unidecode-1.2.0

A successful install must show you the message Successfully installed markovify-0.9.3 unidecode-1.2.0.

Building our markov chain with Markovify

To build our markov chain, we need to write some code (obviously). In following script, I am telling that:

  • I want to create a markov chain
  • I want to use as input data the content of the file called corpus.txt
  • I want to create an order 2 markov chain
  • I want to measure the allocated memory while running
  • I want to create a random sentence
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import markovify
import tracemalloc
import os

trace_enabled = False
if trace_enabled:
    # starting the monitoring
    tracemalloc.start()

# Get raw text as string.
with open("markov-test/data/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text, retain_original=True, state_size=2)
text_model = text_model.compile()

# Print a randomly-generated sentence
print(text_model.make_sentence())

if trace_enabled:
    # displaying the memory
    print(tracemalloc.get_traced_memory())
    # stopping the library
    tracemalloc.stop()

After building our script, its time to get some measurements and see if the markov chain works as expected or not. To do so, we will run our script example.py

1
python3 example.py

The output of running example.py is

1
Mejoras continuas 360 - El grado en que acá no pasa nada ; que quieren cobrar medio dignamente el largo y cerdoso mucho más grave cuando esas tierras permanecen indebidamente cultivadas . Y ciertamente a veces 5 en las acciones emprendidas para lograr una meta a . Relato b.

So far, it seems it works and generates some meaninful arbitrary length texts.

Measuring execution time of markovify

Now that we check, that markovify generates some output from input data, lets see how long does it take to generate the markov chain and compute an output.

To do so, we use time command of Unix system, and we run our test program several times. (remember the more times the more significant data you get)

1
2
3
4
5
python3 example.py  4,57s user 0,06s system 99% cpu 4,645 total
python3 example.py  4,59s user 0,06s system 99% cpu 4,644 total
python3 example.py  4,52s user 0,06s system 99% cpu 4,583 total
python3 example.py  4,56s user 0,07s system 99% cpu 4,639 total
python3 example.py  4,64s user 0,02s system 99% cpu 4,664 total

So, measuring markovify we discover that markov chain building time (data reading + generation + predicting) is about $4,55$ seconds in average for our example corpus text, taking into account that example corpus text has $2344418$ words according to wc and weights 12Mb

Markovify CPU usage

During execution, we also take a look to CPU usage and we realize that only a single core was used while computing. However, this single core was set at $100\%$ during execution, overloading the core instead of distributing the work among all available cores.

CPU usage of markovify
CPU usage of markovify

markovify conclusions

After testing markovify as the most starred library on Github for building markov chains, here are my own conclusions about it

AdvantagesDisadvantages
Easy to download and install
Configurable state size (chain order)
results can be exported as JSON
easy to create random sentences
does not return next probable state
does not return transition probability information
cannot get most probable state
performance bottleneck to a single core processing
no multithread support
no recommended for high performance demanding applications

Conclusions

At the end, we learn how to build our own markov chain for our tiny toy projects. However, as seen by our simple analysis this is not enough for high performance demanding applications or high data use environments.

In the Part 3 of this series, I will design a Markov Chain implementation in Go to fulfill the disadvantages found in this example project and see if there is some room for improvements. Stay tuned!

Continue reading

You can continue reading more post about markov chains:

References



💬 Share this post in social media

Thanks for checking this out and I hope you found the info useful! If you have any questions, don't hesitate to write me a comment below. And remember that if you like to see more content on, just let me know it and share this post with your colleges, co-workers, FFF, etc.

Please, don't try to hack this website servers. Guess why...