Solr on top of HBase for dashboards

July 27th, 2016 by mltrampi

Purpose of this post is to provide insight, how to integrate Solr’s near real time indexing(through Lily-HBase indexer) on top of HBase. In order to be able to provide infrastructure for dashboards through some other tools (such as Cloudera Search, Kibana).

cloudera

Lets grasp the concept!

One interesting answer on Quora about Elasticsearch on top of HBase for real time analytics!

HBase runs on top of HDFS which is typically use for very large data and batch processing. Elasticsearch is for real time online search. Data in HBase can be synced with elasticsearch to provide real-time search and also extend the functionalities to charting using the ELK stack.

Well, thats nice! But this solution is commercial, and we are interested in OpenSource solutions! One of its replacements is proposed solution composed of Lily-HBase indexer, Solr as backbone for Dashboards, and as for visualization OpenSource Kibana or Cloudera Search through Hue.

Ok, where in this story fits Apache Phoenix ?! The Phoenix is a SQL layer between applications and HBase, it provide ad-hoc queries in real time, very useful tool for Data Scientists, SQL Developers(SQL like syntax, in world of NoSQL) and brings additional features to HBase.

Ok, so we have following tools to integrate:

Where does Cloudera fits inside this list of tools?

This tools are part of Big Data world, that hlep us build solutions. Cloudera is one of Distributors for Big Data solutions. In our post we will focus on Cloudera’s distribution of Apache Hadoop, for two reasons.

  1. Cloudera distribution provides almost all of this tools. Almost!? Well, Apache Phoenix is yet part of Cloudera incubating projects, available through cloudera-labs. It is not yet supported by Cloudera for production environments.
  2. It is open source!? ( tools which are not, are free to use with specific limitations, such as Cloudera Manager express)

In this post, I don’t mean to underestimate, nor reduce significance of other Hadoop ( I would say Big Data ) Distributions. Such as Hortonworks, MapR, IBM, Pivotal. Each one is specific in it’s own way. What I mean by calling them Big Data Distributions is that not all of them are based on Apache Hadoop, eg MapR.

Currently MapR is rising very fast, and offers some benefits over regular Apache Hadoop, Coming out from it’s MapR-FS, which we can imagine as HDFS on steroids, supporting random read/write, while Hadoop doesn’t, and it’s written in C, while Hadoop is written in Java. MapR is patented, not open source, but there is an community license, which makes it free to use.

Relevant to this tutorial is version CDH 5.5.X, aka Cloudera’s distribution of Hadoop version 5.5.X. (with Cloudera Manager Express version – free version).

In this post, I won’t go into details of Installing and configuring cluster from scratch. At this moment we consider scenario where we have Cluster running and stable without following Services(Apache HBase, Apache Phoenix, Apache Solr).

( For purposes of this post, in order to understand this tutorial, we can use Cloudera’s VM. You are free to pick your preference VM ware / VirtualBox no differences, link leads to Virtual Machine with  preconfigured  CDH version 5.5 which is relevant to this post)

Cloudera’s product Cloudera Manager helps us with administering cluster adding services, configuring services and controlling them.

In time of writing this post, I am working on PoC cluster (Cloudera distribution) of the Data Science Lab of Engineering Group’s Big Data Competency Center. It is virtualized cluster hosted in company server farm based on Azure.

Assuming you are taking interest to follow this tutorial using Virtual Machine, import it and Boot up your super uber VM 🙂 and let’s get going!

Little heads up. In order to use Cloudera Manager, and Parcel installations on your Virtual Machine(cluster):

(If you have cluster already installed and using Parcels as way to install CDH, skip this part of tutorial regarded VM necessary steps. But, if your cluster has Package way of Installation, you will face difficulties using this quickstart tutorial, since Phoenix, provided by Cloudera is only available as Parcel, and mixing Parcels and Packages is not recommended. Even though I believe everything is possible but we have to stick to some standards and recommendations, besides the fact that something that is simple could turn out to be a multiple day task causing headaches aka wasting time. Bottom line if your cluster is package installed, just download VM and follow this steps stated bellow, or migrate cluster from package to parcels ( yes it’s a link 🙂 ) if you are allowed to mess with cluster.)

  • Your VM should have at least 8GB of RAM and 2vCPUs for Express version or 10GB of RAM and 2vCPUs for Enterprise-trial version.(Computer with capability of giving min. 8GB of RAM and 2vCPUs to Virtual Machine).

**If your computer doesn’t allow dedicating 8 GB of RAM to Virtual Machine, you can manually force script to start by typing in shell of virtual machine: sudo /home/cloudera/cloudera-manager –pause –express –force . Machine will struggle, fight, but to test should work(Can’t guarantee though, never tried).

  • After booting Virtual Machine, on desktop, there is an Icon(Script) that you need to start, in order to start Cloudera Manager service, two options Express – free, or Enterprise – trial.image_tutorial_Mladen_1
  • Again if you are more into shell commands and scripts :-), for Express Version: sudo /home/cloudera/cloudera-manager –pause –express , or for Enterprise Version: sudo /home/cloudera/cloudera-manager –pause –enterprise
  • Next step, after starting Cloudera Manager service, in order to simplify process of adding services, installing services, lets use Parcel way of installation instead of Package way. Parcel way brings benefits in distributed environment, and simplify life. Cloudera was thinking on us and provided us simple script to migrate this Virtual Machine from Package way to Parcel way, and yes miraculously shortcut is again on your Desktop 🙂 ( Migrate to Parcels ) or, again shell command sudo /home/cloudera/parcels –pause. Keep in Mind, you have to do Cloudera Manager step first! This step will take some time, to download CDH Parcel 1.3 GB, distribute parcel, delete packages, activate Parcels. Arm yourself with patience and use this time to take your coffee, read some book, or do whatever makes you happy :).

Ok, our cluster/vm is ready. What next ?!

#1# Let’s add Phoenix Parcel!

Open web browser inside your vm  and log inside cloudera manager:

http://quickstart.cloudera:7180 username cloudera, password cloudera.

**again small hint :), if you used default settings when you imported Virtual Machine, your guest machine will use NAT network as network interface, and within default settings, ports are mapped. So on your host system you can also log inside cloudera manager by typing in browser address http://localhost:7180, in order to simplify everything we can edit hosts file which Operating system use for DNS resolving, mapping quickstart.cloudera to localhost as well. My OS is windows, hosts file is located at : C:\Windows\System32\drivers\etc\hosts, run Notepad++ as administrator, open specified file, and replace line:

127.0.0.1       localhost

with:

127.0.0.1       localhost       quickstart.cloudera.

Once logged in cloudera manager, in upper right corner there is an parcel icon.

image_tutorial_Mladen_2

 

 

 

Then, we edit settings for parcels.

image_tutorial_Mladen_3

In part Remote Parcel Repository URLs, we press + and add following url: https://archive.cloudera.com/cloudera-labs/phoenix/parcels/{latest_supported}/

image_tutorial_Mladen_4

save changes, and return to the parcel status page, clicking parcel icon or typing http://quickstart.cloudera:7180/cmf/parcel/status.

Download parcel CLABS_PHOENIX, distribute it, activate it.

#2# Lets add services!!

  1. HBase
  2. Solr
  3. Key-Value Store Indexer
  4. No we don’t need to add Phoenix, it is not cloudera manager service, we just need to make sure that parcel is there! 🙂

If you are working on Virtual Machine, all of this services are probably there, so you don’t have to add them, but you will have to check for some configurations in order to make them work together. Just jump in this tutorial to part #3# configuring services.

If service, by some chance is not present, eg if you have just installed new cluster, then follow next steps on how to add services.

In cloudera manager home page, there is drop-down menu next to our cluster name, from which we select add service.

image_tutorial_Mladen_5

From services menu, we select HBase service, in first step if we are asked to choose dependencies we select HDFS and ZooKeeper. In second step, we need to specify HBase Master, and RegionServer locations, if it is single node cluster, it is very simple, we have only 1 option :-). In step 3, during adding HBase service, we have to select  HDFS Root Directory, usually /hbase, like stated it is root directory for HBase on HDFS, on this page it is also important for Phoenix that we enable hbase.replication, and indexing. Violla, click next step, and our configuration will be deployed.

Next service we add, is Solr!

Repeat process, add service then step one, same like for HBase(if it asks, sometimes just skips step 1), from dependencies we select HDFS and ZooKeeper. In step two, we select nodes on which we plan to run Solr servers. In step three, we state ZooKeeper znode directory, by default /solr(ZooKeeper is improtant for coordinating service and it is not point of this tutorial, so lets stick to default settings), and default HDFS Data Directory, also /solr. If by some case, creating this service fails, due to permissions on HDFS. Simply with shell and HDFS commands, create directory on hdfs and change owner to user solr, and then re-add service.

eg. sudo -u hdfs hdfs dfs -mkdir /solr && sudo -u hdfs dfs -chown -R solr:solr /solr

Final service we add, Key-Value Store Indexer! This one is coordinating NRT between HBase and Solr!

Same old procedure, from CM home, add service, select Key-Value Store Indexer Service, and violla, step one, probably will skip again, if not from dependencies, we select HBase, HDFS, Solr, Zookeeper. In step two, we select nodes on which we plan to run service, and after that, in following steps, service is added.

Now we have successfully added this services, installed phoenix, lets jump to next step, to make sure they are configured properly.

#3# Lets configure services!

1.HBase!

From cloudera manager home, select service HBase, and then from menu select configuration.

image_tutorial_Mladen_6

image_tutorial_Mladen_7

In search field, type replic, and then look for following. Make sure that Enable Indexing is checkcked, and Enable Replication.

image_tutorial_Mladen_8

After that, in search field, type  HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml. Then in the safety valve, insert following code:

<property>
<name>hbase.regionserver.wal.codec</name> <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>

image_tutorial_Mladen_9

Save Changes. And you are basically ready!

2.Solr!

Nothing specific within cloudera manager, that we need to configure regarding Solr. But we need to generate schema, and collection in order to solr index. For this we need to open shell of VM, or ssh to one of the nodes in cluster where we have access to solrctl command.

First step, we generate directory with some default configurations, by issuing following command in shell:

solrctl instancedir –generate $HOME/hbase-collection1  #we can name collection with any other name off course this was just an example 🙂

Then we need to modify file inside conf directory schema.xml, also through shell. According to which, Solr will be behaving.

vi $HOME/hbase-collection1/conf/schema.xml

In our case, lets add following fields:

<field name=”addr” type=”string” indexed=”true” stored=”true” multiValued=”true”/>
<field name=”order” type=”string” indexed=”true” stored=”true” multiValued=”true”/>

One thing to note, in this simple example we will use HBase row key as id, which will be unique key in Solr. In some more complex scenarios, this can be changed. But this is not point of this simple quickstart.

After modifying schema.xml, we need to “upload” configuration to solr, and using this configuration files we need to create solr collection.

1.solrctl instancedir –create hbase-collection1 $HOME/hbase-collection1

2.solrctl collection –create hbase-collection1 #optionally if we have more solr servers, we can run collection on multiple shards by adding param. -s and then number of shards eg (-s 3).

Now after we configured Solr, we can step to configure HBase-Indexer.

3.Hbase-Indexer

Two files are important for HBase-indexer, first one .xml, which will hold configuration parameters for service, and .conf file which will be Morphline configuration file, sort of “ETL” which will be converting file from HBase cell into something that Solr can understand(metaphorically speaking) again we can put morphline.conf inside cloudera manager and save us some permission, missing file trouble. But for our case, we are using VM so we know file will be there, or if you are in cluster environment, ssh to machine where Key-Value Store Indexer service is installed.

Let’s make HBase-indexer .xml file. Again shell(vm shell, or ssh to node with hbase-indexer command available)!

vi morphline-hbase-mapper.xml

with following content:

<?xml version=”1.0″?>
<indexer table=”CUSTOMER” mapper=”com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper”>
<param name=”morphlineFile” value=”/tmp/morphlines.conf”/>
</indexer>

*Notice,  that table name is CUSTOMER, in our scenario, with capital letters, since when we create table with phoenix, table is created with uppercase if we don’t state table name with quotas, eg “customer”. We will get there!

Now, lets create file morphlines.conf in /tmp directory.

vi /tmp/morphlines.conf

with following content:

morphlines : [
{
id : morphline
importCommands : [“org.kitesdk.**”, “com.ngdata.**”,”com.cloudera.**”, “org.apache.solr.**”]

commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : “addr:*”
outputField : “addr”
type : string
source : value
}

{
inputColumn : “order:*”
outputField : “order”
type : string
source : value
}
]
}
}

{ logDebug { format : “output record: {}”, args : [“@{}”] } }
]
}
]

lets make sure, that file is accessible by everyone!! eg. chmod 777 /tmp/morphlines.conf  !!

After creating this config files, we are only left to register Indexer into Key-Value service. We can do that by issuing following command:

hbase-indexer add-indexer -n myIndexer -c morphline-hbase-mapper.xml -cp solr.zk=localhost/solr -cp solr.collection=hbase-collection1 -z localhost:2181

To verify that indexer, has been registered

hbase-indexer list-indexers -z localhost:2181

and we should see output, with myIndexer name, stating if process is running or failed! If by some chance process is failed, first thing to verify is if you have made file /tmp/morphlines.conf available to service(permissions).

**this command works, assuming you are on Virtual Machine, on real cluster, you need to specify Zookeeper quorum, instead of localhost.

We have configured HBase, service, we have configured Solr service, we have configured HBase-indexer service! What are we missing!? We are missing HBase table that we want to index!! 🙂

For this we will use Phoenix!!

Apache Phoenix is a layer on top of HBase, that allows you to use “SQL”  syntax! *with some limitations of course! 🙂

#4# Creating HBase table through Phoenix, and verifying that services working!

Solr schema.xml and morphlines.conf are preconfigured for HBase columns with qualifiers addr, and order. So we need to create table with following column qualifiers if we want to see some results in solr :-). In complex scenarios and your use cases, I forward you to solr documentation, and morphlines documentation.

Lets, get started creating table!

Apache Phoenix:

To start phoenix shell, and connect to HBase we need to type following in command line interface of VM, or ssh 🙂

phoenix-sqlline.py localhost:2181/hbase

*remember, for real cluster, localhost is zookeeper quorum! If all goes well on your shell you should see following:

jdbc:phoenix:localhost:2181/hbase>

and you are ready to type your “SQL”-like syntax! Hmmm, remamber! Limitations! But of course there is a good documentation, and phoenix grammar book(yet another link)! OK, lets create table CUSTOMER! since we declared in indexer .xml config file, to look for table CUSTOMER.

CREATE TABLE CUSTOMER ( rowKey VARCHAR PRIMARY KEY, “addr” VARCHAR, “addr”.”city” VARCHAR, “order” VARCHAR, “order”.”quantity” VARCHAR);

Now, I don’t know about you! But I would wonder, hey how can I verify that table was created!?? Thing that came to my mind is HBase shell! Don’t be afraid, I was also very unfamiliar with HBase syntax, but there are plenty of examples, plus I will give you an answer 🙂 it’s a quickstart tutorial! Let me also give you a hint, we will need an HBase shell very soon!

We can open new tab in shell, on our VM, or another ssh connection, and in CLI type hbase shell, you will see warnings it is deprecated etc, but we just need it to verify that our table CUSTOMER is there, and to alter something later on, so it is good to have hbase shell, and phoenix shell in two separate windows.

Ok, we are inside HBase shell, simply type list, output of that command should be all tables inside hbase.

image_tutorial_Mladen_10

let’s describe table customer. ( describe ‘CUSTOMER’ ), to make sure we have column qualifiers that we need for our morphline configuration(“addr”,”order”)!

image_tutorial_Mladen_11

Ok, simple commands, all should be working! Whew that was easy! We are ready to put some data!!

Lets get back, into Phoenix shell, much more simpler syntax! According to created table syntax lets insert:

upsert into CUSTOMER values (‘Mladen Trampic’,’Corso Stati Uniti 23/c’,’Padova’,’some long boring tutorial’,’1′);

verify, file is there: select * from customer;

image_tutorial_Mladen_12

We can also verify this in HUE, another Cloudera tool ! (panic attack) Point your (virtual machine browser to http://quickstart.cloudera:8888/hbase/#HBase/CUSTOMER)

image_tutorial_Mladen_13

Ok, file is there, so we should now see file indexed in solr! 🙂 (Browser http://quickstart.cloudera:8983/solr/, and from core selector menu, select hbase-collection1, select field query and execute querry).

image_tutorial_Mladen_14

You did all, file is in HBase, but it is not indexed in Solr! Why? Lets first verify that our indexer configuration, and morphline file are configured properly!

in shell we issue following command:

hadoop –config /etc/hadoop/conf jar /opt/cloudera/parcels/CDH/jars/hbase-indexer-mr-*job.jar –conf /etc/hbase/conf/hbase-site.xml –hbase-indexer-file /morphline-hbase-mapper.xml –hbase-indexer-zk localhost:2181 –hbase-indexer-name myIndexer –dry-run

**localhost in real cluster to be replaced with ZooKeeper quorum.

image_tutorial_Mladen_15

Output of command should be something like this!

Ok, I must admit, I cheated, I knew what is problem from the beginning but I wanted to make sure, that everything else is configured properly. You will notice some fields are missing, that is also true since we declared fields addr, and order, not subfields.

The problem is following, when we created table through Phoenix, we needed to alter table, column family, enable REPLIPCATION_SCOPE in order to allow lily HBase indexer to be notified about changes and new inserts, we need to get back again to HBase shell.

disable ‘CUSTOMER’

alter ‘CUSTOMER’, {NAME => ‘addr’, REPLICATION_SCOPE => 1}, {NAME => ‘order’, REPLICATION_SCOPE => 1}

enable ‘CUSTOMER’

now we can reenter upsert query in phoenix-shell, and verify that hbase cell is indexed in solr.

image_tutorial_Mladen_16

**Another hint, this way hbase-indexer can only pick up new inserts, or upserts. If you for example want to index some table that is already populated with data, you can run hbase-indexer batch job, similar to the one I used to verify that my configuration is proper, but instead of ending quote “–dru-run” we use “–go-live”, but keep in mind this is a map reduce job, writing to hdfs, there fore it needs to be submited by user with HDFS permissions, eg

sudo -u hdfs hadoop –config /etc/hadoop/conf jar /opt/cloudera/parcels/CDH/jars/hbase-indexer-mr-*job.jar –conf /etc/hbase/conf/hbase-site.xml –hbase-indexer-file /morphline-hbase-mapper.xml –hbase-indexer-zk localhost:2181 –hbase-indexer-name myIndexer –reducers 0 –go-live

In this post,tutorial I’ve shown you, how to integrate this tools, using some more additional tools to test, verify configurations. In some parts, I instructed you to make some configurations without deeply explaining them, since they were not relevant for this post. I sincerely hope it was useful to you and that it was straight forward. If you are facing any difficulties or looking for some advice , don’t hesitate to contact me at mladen.trampic@eng.it.

Tags: , , , , , , , ,

Categories: Big Data

Comments are closed.