MongoDB to Apache Drill to Apache Superset
When it comes to BI tools, there are only few open source projects which can help you with data visualization. Most of them even if they are free are not so helpful and you might end up getting a premium version eventually. However these premium versions are not cheap and will cost a fortune if you go for enterprise solutions.
Apache Superset, in such current scenario, comes up to resuce. It is an open-source project and so completely free to use. It can also be easily hosted on-premises and on cloud servers. The stable version was released in late 2020. It provides almost all modern visualization techniques and dashboards. Apache superset is also very intuitive to use. It uses SQL queries which can be adapted to any database types with the help of their multiple drivers for various databases such as Postgres, MySQL, Redshift and Elasticsearch.
However for some reason, Apache superset does not provide any driver for MongoDB (one of the most popular document-oriented NoSQL database). Therefore in this blog we will try to address this issue. For this, we will use Apache Drill. So what is Apache Drill. Apache Drill is a simple low latency query engine but however adapted for large datasets as well. It can be easily scaled to multiple nodes and can handle petabytes of data comfortably.
Installation of Apache Drill
- Follow the instructions step by step as described in Apache Drill docs.
- Initiate the Apache Drill web UI
$bin/drill-embedded # As background process, run the following: $nohup sh -c "bin/drill-embedded" >/dev/null 2>&1 &
- Go to "Storage" section
- Enable "mongo" from the list of disabled storage plugins.
- Update "mongo" with your MongoDB url settings
- Go to "Options" for optional settings:
- drill.exec.functions.cast_empty_string_to_null : true
- store.mongo.all_text_mode : true
- store.mongo.bson.record.reader : false
- Embedded mode:
- Error: Failure in connecting to Drill: org.apache.drill.exec.rpc.RpcException : Go to conf/drill-env.sh and uncomment/edit the line export DRILL_HOST_NAME=localhost .
- Embedded mode:
Installation of Apache Superset
- Follow the instruction step by step as described in Apache superset docs.
- Install/reinstall missing python libraries (which might be displayed as errors such as PyJWT and Pillow).
- Initiate superset
- Go to "Data" and then "Databases" section and add database using the following url: drill+sadrill://localhost:<port e.g. 8047>/mongo?use_ssl=False
That's it. Now you may run you queries directly from your new BI Tool and create new dashboards based on your MongoDB database.