Back
Featured image of post Crawling Ethereum Blockchain for automated Smart Contract Extraction

Crawling Ethereum Blockchain for automated Smart Contract Extraction

Summarized conclusions about the process of crawling a complete Blockchain network in order to download and collect all existing DApps

Table of Content

In this article I’m going to show you how an automated data extraction can be dome for massive data analysis processes. The goal is to download all interesting Blockchain information as a processable file like a CSV. I’ll guide you throw the process of making it possible.

Requirements

To be able to crawl the Blockchain data, we must have first a ledger peer syncronized and running. If you dont have one, you can check this guide about how to setup one.

  • Linux/Ubuntu computer
  • Python 3 Installed
  • A working Ethereum node
  • Basic Git knowledge
  • Some brain

Download Blockchain ETL tool

Step 1: Install PIP and configure the virtualenv

1
2
3
4
5
6
7
python3 -m pip install --upgrade pip

Collecting pip
  Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 538kB/s 
Installing collected packages: pip
Successfully installed pip-21.3.1

Once installed, we need to figure out the exact location of virtualenv application. We can find the exac path with which virtualenv command

1
2
which virtualenv
/usr/local/bin/virtualenv

Create the virtualenv

1
virtualenv -p /usr/bin/python3.6 venv

Activate the virtualenv

To activate the new virtual environment, run the following:

1
source venv/bin/activate

The name of the current virtual environment appears to the left of the intent. For example: (venv)

Version check

To verify the correct version of Python, run the following:

1
python -V

Any package you install using pip is now located in the virtual environment project folder, isolated from the global Python installation.

Deactivate

When you finish your work in your virtual environment, you can deactivate it by running the following:

1
deactivate

Delete your virtual environment

To delete your virtual environment, simply delete the project folder. Using the above example, run the following command:

1
rm -rf venv

Installing Ethereum ETL

To install the required tool used to download the data from the Blockchain in CSV format, we need to request pip to install it with pip install command.

1
pip3 install ethereum-etl

A successful installation should install following packages:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
Collecting ethereum-etl
  Downloading ethereum-etl-1.10.1.tar.gz (334 kB)
     |████████████████████████████████| 334 kB 2.1 MB/s            
  Preparing metadata (setup.py) ... done
Collecting web3==4.7.2
  Downloading web3-4.7.2-py3-none-any.whl (126 kB)
     |████████████████████████████████| 126 kB 41.7 MB/s            
Collecting eth-utils==1.10.0
  Downloading eth_utils-1.10.0-py3-none-any.whl (24 kB)
Collecting eth-abi==1.3.0
  Downloading eth_abi-1.3.0-py3-none-any.whl (21 kB)
Collecting python-dateutil<3,>=2.8.0
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
     |████████████████████████████████| 247 kB 32.0 MB/s            
Collecting click==7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
     |████████████████████████████████| 82 kB 685 kB/s             
Collecting ethereum-dasm==0.1.4
  Downloading ethereum_dasm-0.1.4-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 793 kB/s             
Collecting base58
  Downloading base58-2.1.1-py3-none-any.whl (5.6 kB)
Collecting requests
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 732 kB/s             
Collecting parsimonious<0.9.0,>=0.8.0
  Downloading parsimonious-0.8.1.tar.gz (45 kB)
     |████████████████████████████████| 45 kB 1.8 MB/s             
  Preparing metadata (setup.py) ... done
Collecting eth-typing<3.0.0,>=2.0.0
  Downloading eth_typing-2.3.0-py3-none-any.whl (6.2 kB)
Collecting eth-hash<0.4.0,>=0.3.1
  Downloading eth_hash-0.3.2-py3-none-any.whl (8.8 kB)
Collecting cytoolz<1.0.0,>=0.10.1
  Downloading cytoolz-0.11.2.tar.gz (481 kB)
     |████████████████████████████████| 481 kB 47.4 MB/s            
  Preparing metadata (setup.py) ... done
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting evmdasm
  Downloading evmdasm-0.1.8-py3-none-any.whl (15 kB)
Collecting websockets<7.0.0,>=6.0.0
  Downloading websockets-6.0-cp36-cp36m-manylinux1_x86_64.whl (88 kB)
     |████████████████████████████████| 88 kB 3.5 MB/s             
Collecting lru-dict<2.0.0,>=1.1.6
  Downloading lru-dict-1.1.7.tar.gz (10 kB)
  Preparing metadata (setup.py) ... done
Collecting eth-account<0.4.0,>=0.2.1
  Downloading eth_account-0.3.0-py3-none-any.whl (18 kB)
Collecting hexbytes<1.0.0,>=0.1.0
  Downloading hexbytes-0.2.2-py3-none-any.whl (6.1 kB)
Collecting six>=1.5
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 45.9 MB/s            
Collecting certifi>=2017.4.17
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
     |████████████████████████████████| 149 kB 23.5 MB/s            
Collecting charset-normalizer~=2.0.0
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting idna<4,>=2.5
  Downloading idna-3.3-py3-none-any.whl (61 kB)
     |████████████████████████████████| 61 kB 3.1 MB/s             
Collecting toolz>=0.8.0
  Downloading toolz-0.11.2-py3-none-any.whl (55 kB)
     |████████████████████████████████| 55 kB 1.7 MB/s             
Collecting eth-rlp<1,>=0.1.2
  Downloading eth_rlp-0.2.1-py3-none-any.whl (5.0 kB)
Collecting eth-keys<0.3.0,>=0.2.0b3
  Downloading eth_keys-0.2.4-py3-none-any.whl (24 kB)
Collecting eth-keyfile<0.6.0,>=0.5.0
  Downloading eth_keyfile-0.5.1-py3-none-any.whl (8.3 kB)
Collecting attrdict<3,>=2.0.0
  Downloading attrdict-2.0.1-py2.py3-none-any.whl (9.9 kB)
Collecting pycryptodome<4,>=3.6.6
  Downloading pycryptodome-3.14.1-cp35-abi3-manylinux2010_x86_64.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 44.7 MB/s            
Collecting rlp<3,>=0.6.0
  Downloading rlp-2.0.1-py2.py3-none-any.whl (20 kB)
Building wheels for collected packages: ethereum-etl, cytoolz, lru-dict, parsimonious
  Building wheel for ethereum-etl (setup.py) ... done
  Created wheel for ethereum-etl: filename=ethereum_etl-1.10.1-py3-none-any.whl size=436719 sha256=c2c9639fc6cb24b60320fe794eb62f7db5ff9c9f5951a92f73222e63b73d9bcf
  Stored in directory: /home/sergio/.cache/pip/wheels/d0/d1/45/da3b3e227bd0e30cc39940703976d27ffd759669b68bd5093f
  Building wheel for cytoolz (setup.py) ... done
  Created wheel for cytoolz: filename=cytoolz-0.11.2-cp36-cp36m-linux_x86_64.whl size=1233238 sha256=5b6969b4f5403d57a4e079d56d453e5a07382537d9c09c2ced9c62c88bef7121
  Stored in directory: /home/sergio/.cache/pip/wheels/83/c8/80/3663b26cb65ea0add681ebbf422874089a085bd2bff6d97b25
  Building wheel for lru-dict (setup.py) ... done
  Created wheel for lru-dict: filename=lru_dict-1.1.7-cp36-cp36m-linux_x86_64.whl size=27493 sha256=d1a6574bd8c9134cde5ddc2c801238425300eef52e434102f4d28734069f454b
  Stored in directory: /home/sergio/.cache/pip/wheels/ae/61/6d/2c1544021f8e787b602ed799d88e0d1ab4437ffb09a04102a0
  Building wheel for parsimonious (setup.py) ... done
  Created wheel for parsimonious: filename=parsimonious-0.8.1-py3-none-any.whl size=42723 sha256=50f7e8c9189d8f09faadcb04ffdbcb5a2126cdbe71d8e1e2ae74f737484dc587
  Stored in directory: /home/sergio/.cache/pip/wheels/43/95/c9/c9f7a3f9dc34ebd851739148bd5b42ab35618ea0808388647c
Successfully built ethereum-etl cytoolz lru-dict parsimonious
Installing collected packages: toolz, eth-typing, eth-hash, cytoolz, eth-utils, six, rlp, pycryptodome, hexbytes, eth-keys, urllib3, parsimonious, idna, eth-rlp, eth-keyfile, charset-normalizer, certifi, attrdict, websockets, tabulate, requests, lru-dict, evmdasm, eth-account, eth-abi, colorama, web3, python-dateutil, ethereum-dasm, click, base58, ethereum-etl
Successfully installed attrdict-2.0.1 base58-2.1.1 certifi-2021.10.8 charset-normalizer-2.0.12 click-7.1.2 colorama-0.4.4 cytoolz-0.11.2 eth-abi-1.3.0 eth-account-0.3.0 eth-hash-0.3.2 eth-keyfile-0.5.1 eth-keys-0.2.4 eth-rlp-0.2.1 eth-typing-2.3.0 eth-utils-1.10.0 ethereum-dasm-0.1.4 ethereum-etl-1.10.1 evmdasm-0.1.8 hexbytes-0.2.2 idna-3.3 lru-dict-1.1.7 parsimonious-0.8.1 pycryptodome-3.14.1 python-dateutil-2.8.2 requests-2.27.1 rlp-2.0.1 six-1.16.0 tabulate-0.8.9 toolz-0.11.2 urllib3-1.26.8 web3-4.7.2 websockets-6.0

Checking Ethereum ETL tool is successfully installed

To check ethereumetl tool is successfully installed, we just print tool version to stdout.

1
ethereumetl --version

which report us the tool version number as expected.

1
ethereumetl, version 1.10.1

Data extraction process

Now that the crawling tool is working, we can start our data extraction process. In order to extract the contract (dapps) information there are some steps that needs to be done:

  1. Fetch all existing Blocks.
  2. For each block, extract transaction information.
  3. For each transaction, check whether contains contract installation instruction or not.
  4. Store to CSV all found contract data for further analysis.

1. Crawling Ethereum blocks and transaction information

I configured my node to connect via IPC file located at file://$HOME/.ethereum/rinkeby/geth.ipc. If your *.ipc files is located in other path, update it accordingly.

1
2
3
4
5
6
7
ethereumetl export_blocks_and_transactions \
   -w 2 -b 3 \
   --end-block 10337980 \
   --start-block 0 \
   --provider-uri file://$HOME/.ethereum/rinkeby/geth.ipc \
   --blocks-output blocks.csv \
   --transactions-output transactions.csv

2. Crawling Ethereum transactions and logs

Next step, it to extract transaction from readed blocks.

Note: if you encounter issues when requesting transaction information, make sure your node is running with --txlookuplimit=0 flag. This will index all transactions' hash.

1
2
3
4
ethereumetl extract_csv_column \
   --input transactions.csv \
   --column hash \
   --output transaction_hashes.txt

Then export receipts and logs:

1
2
3
4
5
ethereumetl export_receipts_and_logs \
   --transaction-hashes transaction_hashes.txt \
   --provider-uri file://$HOME/.ethereum/rinkeby/geth.ipc \
   --receipts-output receipts.csv \
   --logs-output logs.csv

3. Crawling Ethereum contract addresses

First extract contract addresses from receipts.csv

1
2
3
4
ethereumetl extract_csv_column \
   --input receipts.csv \
   --column contract_address \
   --output contract_addresses.txt

3.1 Crawling Ethereum contract bytecode

Once we have all contract addresses in our contract_addresses.txt file, we can crawl their bytecode.

1
2
3
4
ethereumetl export_contracts \
   --contract-addresses contract_addresses.txt \
   --provider-uri file://$HOME/.ethereum/rinkeby/geth.ipc \
   --output contracts.csv

Remember you can tune --batch-size, --max-workers for performance.

4 Store All contract data as CSV

At this point, you should have all existing contracts downloaded and stored into contracts.csv.

Conclusion

We learn a new way we can use to fetch data from Ethereum ledger (Rinkeby network) being applicable to other Testnet and networks. Consider this way as another option when looking for data for your projects just instead of using some sort of third party provider APIs like Etherscan. You can get faster and cheaper results, if you know how to handle them.

Drawbacks

After running all this entire process these are the disadvantages seen:

  • The process of synchronizing a node requires time and many SSD space.
  • Installing blockchain-etl and use it is very easy.
  • blockchain-etl is a very slow tool that also requires many time for data extraction.
  • blockchain-etl seems to generate reasonable well results. However, you must take into account that duplicate contracts may exists if they are found on different transactions.
  • You must take into account that duplicate contracts may exists if they are found on different transactions.

Advantages

  • Installing blockchain-etl and use it is very easy.
  • blockchain-etl seems to generate reasonable well results.
  • It can work with any Geth compatible network.

References



💬 Share this post in social media

Thanks for checking this out and I hope you found the info useful! If you have any questions, don't hesitate to write me a comment below. And remember that if you like to see more content on, just let me know it and share this post with your colleges, co-workers, FFF, etc.

Please, don't try to hack this website servers. Guess why...