In this second blog post concerning Hadoop, we are going to focus on the link between Hadoop and BIRT, the Eclipse-based reporting system. We will first present Hive, a component of Hadoop’s ecosystem, then we will see how it can be connected to BIRT, and we will finish with an example that will illustrate more concretely how things work. If you missed the first episode, which is an introduction to Hadoop, its file system, and MapReduce, you can follow this link, to read the first part of this Hadoop blog post series.
1. Introduction to Hive :
Hive is an Hadoop sub-project which lets you store data and make queries with a SQL-like language : HQL (Hive Query Language). From data (stored on the HDFS, but also local data), we can create tables, through Hive, that will be stored on the HDFS (this underlines the « data warehouse » role of Hive), and then make HQL queries. One of the fundamental interest of Hive is MapReduce : Hive is part of the Hadoop ecosystem, hence it uses MapReduce for every HQL query. Some queries will not require any Reducing task, but in every case this is suited for distributed applications. Furthermore, Hive also lets you create your own HQL queries, as long as you provide the corresponding Mappers and Reducers, in a similar way as what has been done in the previous part of this blog series. Here is a simplified scheme of the way Hive is integrated to Hadoop :
2. The link between Hive and BIRT :
BIRT (« Business Intelligence and Reporting Tools ») is the Eclipse project for reporting. Since its version 3.7.2, it is possible to link it to Hive, which means that we now can make HQL queries directly from BIRT, on tables stored on the HDFS. (see here for the official release statement).
Thus, we do not make the query from Hive anymore, but from BIRT : then we can do operations/treatments on the imported data, and in the end export the final report in the BIRT supported file formats (.pdf, .html, .doc, etc….) The connection uses JDBC and the hiveserver, defined on port 10000 by default, as we can see on this expanded scheme :
3. A use case : wikipedia dumps :
To illustrate the way this BIRT connector works, we are going to use the dumps provided by Wikipedia. We decided to download the complete list of the titles of wikipedia.com articles on the american website (a 200MB text file, we will call it « wikidata.txt »), in order to treat this data with BIRT, and then export it as a report.
Let’s detail the different stages of the process :
First, we have to put the data file on the HDFS, in order to create the data tables afterwards :
hadoop fs -put /local_directory/wikidata.txt /hdfs_directory
Then, we can launch Hive, create the table and load it with the wikipedia dump :
hive> CREATE TABLE donnees (nom STRING);
hive> LOAD DATA INPATH 'hdfs_directory/wikidata.txt' INTO TABLE donnees;
Thus we can start the Hive server :
#hive --service hiveserver
We can now connect to the Hive server through JDBC, by creating a new BIRT data source, and as this example is un in local, the URL is : jdbc:hive://localhost:10000/default
We now need to create a new data set, this is done using a HQL query : for example we can look for all the articles containing « Kennedy » in our table :
BIRT can then be used as usual to make all the required treatments/aggregations,etc… During the previewing or exporting tasks, we can see the MapReduce tasks in real time with the Jobtracker, provided by Hadoop :
When the process is over, the report is ready to be published, in one of the BIRT supported file formats.
4. Conclusion :
In a nutshell, we have been able to establish the connection between hive and BIRT, and we can see two major advantages :
– This connector is very useful in the import process : all we have to do is to connect to Hive, everything else is done directly from BIRT, as long as the tables have already been created.
– The MapReduce implementation, which follows the Hadoop
However, some limitations remain :
– Some operations, like aggregation, are processed through BIRT but do not use the MapReduce : the best solution, if you want to really exploit the distributed processing, is to reduce the data through a precise MapReduce HQL query, then process it through BIRT.
– The tables have to be created before the BIRT importing process, (with Hive).
I hope that you could see the interest of this feature, and I look forward to presenting you another Hadoop-related article, for the next post of this series.