In recent days, companies are increasingly using graph database applications in respective domains. Graph databases, such as Amazon Neptune, Janusgraph, Neo4j, IBM Graph, Anzograph, etc. are good for several kinds of applications involving highly connected data sets, such as providing recommendations based on the social graph, performing fraud detection, and providing knowledge graph-based product recommendations. This is where traditional SQL joins on huge dataset becomes inefficient on the relational database system.
Real entities and relations between entities can be mapped with nodes(i.e vertices) and edges respectively.
I believe, just having a look at the above two images, you are smart enough to identify what vertices, edges, and properties are. So, I am not going into details here.
As Janusgraph is found out to be an open-sourced & scalable🙂transactional database that supports the property graph model, we can build a social graph, knowledge graph representing data in context in a manner that machines and humans can readily understand. Now, it gives us enough reasons to get started with Jansugraph.
Let’s talk about Janusgaph :)
Janusgraph is open-source, distributed graph database with pluggable storage and indexing backends
Architecture for JanusGraph
Janusgraph modular architecture supports third-party adapters.
In the Storage Backends section, we can plug and play with Cassandra, Hbase, or Bigquery, etc. into our needs. For example, Apache Cassandra is generally used for real-time cases when needed scalability and high availability without compromising performance and Hbase is for analytics stuff, etc. Here in this tutorial, we will set up with Cassandra.
Here as Index-Backend, we will go with Lucene. Later incoming tutorial, we will plug and play with the Elasticsearch setup which is required for indexing on multiple properties, full-text, geo-mapping, and string-search, etc.
The next question comes how are we going to interact with the graph? The easiest way is Gremlin Console. We can connect with python too, but that’s not our goal now. Usually, getting Janusgraph connected with python comes later when a basic setup is made and to develop an application in python that will execute queries against Janusgraph.
Think of Gremlin Console as a tool working with any TinkerPop enabled server. Janusgraph is TinkerPop enabled database engine. Gremlin Console is an interactive shell that gives you access to the data managed by the Janusgraph server also commonly known as Gremlin Server.
First, we will run apache Cassandra and Janusgraph on our local machine. After that, we will use Gremlin Console to connect with the Janusgraph server running on the local machine.
Here we will go with the Cassandra-3.11.0. Download from this link. Extract the tar file and go inside the /apache-cassandra-3.11.0/bin directory. Fire up a terminal window and execute the command below in the same directory. Now Cassandra server will be up and listening on port 9042.
To download Janusgraph, you can jump directly into this link and grab it. Download from the link and extract the zip file in a directory. Go inside /janusgraph-0.5.3/ directory which consists of multiple folders like bin, conf, data, and db, etc.
To configure Cassandra as storage backend and Lucene as index backend, we need to change the gremlin-server.yaml file, which sits inside /janusgraph-0.5.3/conf/gremlin-server directory.
Inside /janusgraph-0.5.3/conf/ directory you will see files with .properties extension which is basically used to configure storage backend and index-backend. For example janusgraph-cassandra-es.properties file, what it means Janusgraph provides cassandrathrift storage backend protocol and elasticsearch (for indexing purpose) for use with cassandra. cassandrathrift is the outdated protocol now. We will use Cassandra's newer communication protocol( at the time I am writing) cql to work on.
We will work with the janusgraph-cql-lucene.properties configuration file. Unfortunately, this file would not be available there. So let’s create that one and place it on the /janusgraph-0.5.3/conf/ directory.
gremlin.graph=org.janusgraph.core.JanusGraphFactorystorage.backend=cql#The hostname or comma-separated list of hostnames of storage #backend servers. This is only applicable to some storage backends, #such as cassandra and hbase.storage.hostname=127.0.0.1#This is the keyspace name where janusgraph will store the tables #and if this keyspace does not exist janugraph will create itstorage.cql.keyspace=janusgraphcache.db-cache = truecache.db-cache-clean-wait = 20cache.db-cache-time = 180000cache.db-cache-size = 0.5index.search.backend=luceneindex.search.directory=../db/searchindex
Janusgraph server does not use janusgraph-cql-lucene.properties file directly, in fact, it will use gremlin-server.yaml configuration file to point to added janusgraph-cql-lucene.properties file. We will have to edit the gremlin-server.yaml file to do this.
Edited section of gremlin-server.yaml
Now we have to fire up another terminal window (we already kept Cassandra server running using one terminal window) and execute the following command.
Now Janusgraph will be up and listening on port 8182. Now we have Cassandra and Janusgraph servers running on the same machine.
Connect to the JanusGraph Server
In a new terminal window start up a Gremlin Console.
It will open a gremlin console and now we will connect it to Gremlin Server by executing the following commands on the same console.
:remote connect tinkerpop.server conf/remote.yaml session
To check everything is good so far, write “graph” in Gremlin Console and hit enter, it will show the following details.
What this means graph uses cql as storage backend protocol and Cassandra server running on 127.0.0.1.
Explore Janusgraph a bit
Let’s try creating a vertex if a Janusgraph is able to communicate with Cassandra because when you create a vertex, Janusgraph will store the vertex in Cassandra table. Execute the following command in Gremlin Console.
Executing the command, a node has been stored in the Cassandra database. To have a look at how Cassandra is storing data, fire up another terminal window, go inside the /apache-cassandra-3.11.0/bin directory, and execute the following command in the terminal.
It will open up Cassandra client.
PS- If it shows an error, make sure you install python because cqlsh is a python tool.
Now execute the next command on the opened-up client console to see keyspaces available.
If everything goes fine so far, you will find one of the keyspaces named janusgraph as well. Let’s use this keyspace.
Then, we would be able to see all the tables created under the janusgraph keyspace. we can see one of the most important tables called edgestore where graph info, vertices, and edges are stored. If you execute the command below, you will have an idea of how Cassandra stores data. Though, this comes under Cassandra expertise area.
SELECT * FROM edgestore;
PS- We just used index backend Lucene here but not done with indexing. Once data will be loaded into the graph, we will do indexing
I originally planned to show load-batch-data practices in this tutorial only but it became so lengthy, I guess I will have to add more articles to include topics like load-batch-data practices, elasticsearch set up as an indexing backend, and the use of ConfiguredGraphFactory, etc.
Oh yeah! you made it this far, Kudos to you😎!
If you find this article helpful then please hit the clap button and feel free to catch up in case you need help regarding this topic.