Table of Content
Introducing markov chains in Python
So far, we read about how a Markov Chain works, the concept of transition matrix and how we can calculate a future state probability. However, we need to be able to create our own Markov Chains from our input data. This post will show you, how you can create your own markov chain using Python 3+
Working with Markov Chains, our first approach
So, what markov chain implementation should I use to build my own chain?. To be honest, I have no idea of most used frameworks, so lets find them in Github: https://github.com/search?q=markov+chain
As you can see there is a good candidate for Python called markovify
. Lets try to use it and see how it works by running the provided example with a custom corpus
Downloading a NLP corpus for training
After a rapid search on Internet looking for some Spanish free text training data, I found a small corpus at https://www.corpusdata.org/spanish.asp
Instead of downloading full corpus, I just downloaded a sample $1/100$ for testing purposes. The sample itself contains 2 Million words according to corpusdata.org. If you want to use this same corpus text, you can download it from https://www.corpusdata.org/span/samples/text.zip
Creating a virtualenv for our Python project
As a recommended way of development, we first create a virtualenv for Python and set it as active
.
|
|
In this case, the created virtualenv is called venv
.
Installing Markovify
To install markovify
we just need to create a virtualenv and install it as a project dependency with pip.
|
|
|
|
A successful install must show you the message Successfully installed markovify-0.9.3 unidecode-1.2.0
.
Building our markov chain with Markovify
To build our markov chain, we need to write some code (obviously). In following script, I am telling that:
- I want to create a markov chain
- I want to use as input data the content of the file called
corpus.txt
- I want to create an order 2 markov chain
- I want to measure the allocated memory while running
- I want to create a random sentence
|
|
After building our script, its time to get some measurements and see if the markov chain works as expected or not. To do so, we will run our script example.py
|
|
The output of running example.py
is
|
|
So far, it seems it works and generates some meaninful arbitrary length texts.
Measuring execution time of markovify
Now that we check, that markovify generates some output from input data, lets see how long does it take to generate the markov chain and compute an output.
To do so, we use time
command of Unix system, and we run our test program several times. (remember the more times the more significant data you get)
|
|
So, measuring markovify we discover that markov chain building time (data reading + generation + predicting) is about $4,55$ seconds in average for our example corpus text, taking into account that example corpus text has $2344418$ words according to wc
and weights 12Mb
Markovify CPU usage
During execution, we also take a look to CPU usage and we realize that only a single core was used while computing. However, this single core was set at $100\%$ during execution, overloading the core instead of distributing the work among all available cores.
markovify conclusions
After testing markovify as the most starred library on Github for building markov chains, here are my own conclusions about it
Advantages | Disadvantages |
---|---|
Easy to download and install | |
Configurable state size (chain order) | |
results can be exported as JSON | |
easy to create random sentences | |
does not return next probable state | |
does not return transition probability information | |
cannot get most probable state | |
performance bottleneck to a single core processing | |
no multithread support | |
no recommended for high performance demanding applications |
Conclusions
At the end, we learn how to build our own markov chain for our tiny toy projects. However, as seen by our simple analysis this is not enough for high performance demanding applications or high data use environments.
In the Part 3 of this series, I will design a Markov Chain implementation in Go to fulfill the disadvantages found in this example project and see if there is some room for improvements. Stay tuned!
Continue reading
You can continue reading more post about markov chains:
References
Subscribe, donate or become premium
💬 Share this post in social media
Thanks for checking this out and I hope you found the info useful! If you have any questions, don't hesitate to write me a comment below. And remember that if you like to see more content on, just let me know it and share this post with your colleges, co-workers, FFF, etc.