A Full Javascript Architecture, Part Three - MongoDB


This is the last part in a series of three showing how to build a simple JavaScript architecture.

After seeing how to create a NodeJS application to track tweets and send them to a Google Chrome Extension in real time, we are going to see how to store and query them using a MongoDB database.

For a better understanding, this post starts by explaining the basics of MongoDB and then deals with its integration in a NodeJS application.

NodeJS - MongoDB

Introduction to MongoDB


mongodb

Presentation

MongoDB presents itself as a scalable, high-performance, open source and document-oriented database written in C++.

Each of the concepts behind the first three characteristics are well known so let's focus on the fourth one : document-oriented.

To clearly understand this concept we need some basic MongoDB terminology. Since we are going to see a lot of JSON let's start right away :

{
	"terminology" : {
		"Database" : "Database",
		"Table" : "Collection",
		"Row" : "Document"
	}
}

A document-oriented database is a database where each document in a same collection may have a totally different structure. With this orientation, any number of fields of any length can be added to a document even after its creation.

Because all the information of the same entity can be dynamically stored within a single document, joins operations are no longer needed in this type of database.

Joins operations are really expensive, they require a strong consistency and a fixed schema. Avoiding them results in a great capacity of horizontal scaling.

Data Types

MongoDB uses BSON as the data storage and network transfer format. BSON stands for "Binary JSON" and is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. According to the bsonspec.org, it was de­signed to be lightweight, traversable and efficient.

Its key advantage over XML and JSON is efficiency in term of space and compute time.

BSON documents can be used to store several data types like ''string, integer, boolean, double, null, array, object, date, binary data, regular expression and source code''. You can find more by exploring the specification.

Advanced Features

MongoDB has a set of advanced features like full-index, replication, sharding and map/reduce. Unfortunately they won't be approached in this post.

Why MongoDB ?


Choosing MongoDB for our use case was motivated by a set of features that characterize this database.

JavaScript

Requests in MongoDB are written in JavaScript and that is a perfect fit for our architecture. In addition, these requests use a RDBMS style query that reduce the gap for developers used to traditional structured query language.

JSON Document-Oriented

MongoDB uses parsed JSON document (BSON) as data structure. Since the Twitter API send us data in JSON, tweets can be stored immediately and without any pretreatment. The key/value approach of other NoSQL databases like Redis is not suitable for us because we need to store more than a simple value.

Schemaless

It's capital for our application to store data in a flexible manner, indeed tweets we are going to receive can have a different JSON structure. For instance, a retweet can have more fields than a regular one.

Real-Time

MongoDB is very good at real-time inserts by keeping transaction support extremely simple.

NodeJS driver

MongoDB has a native and open source driver written by Christian Amor Kvalheim .

Getting started

Installation

To install MongoDB check the latest recommended version on http://www.mongodb.org/downloads and wget it. Debian and Ubuntu users can try using apt packages, check out this page for more information.

$ wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-1.8.1.tgz

When the download is done unpack the archive :

$ tar xzf mongodb-linux-x86_64-1.8.1.tgz

Before starting the database we need to create and set the rights on the folder where the data will be stored, by default MongoDB configuration uses /data/db.

$ mkdir -p /data/db/
$ chown `id -u` /data/db

Then we can start the database by calling Mongo's deamon :

$ /opt/mongodb/bin/mongod

The MongoDB Interactive Shell

MongoDB comes with a JavaScript shell used to manage and query the database.

To start the shell use the mongo executable :

$ /opt/mongodb/bin/mongo
MongoDB shell version: 1.8.1
connecting to: test

As we can see, once in the shell we are automatically connected to the test database.

Basic Commands

  • help

We can get help at anytime by using this command.

> help

Help is also a method we can call on database or collection objects to get all the functions associated to them. The db object always refer to the database currently used and collections are its attributes.

> db.help()
> db.my_collection.help()
> db.another_collection.help()
  • show

With show you can list all the databases or collections.

To list all the the databases use :

> show dbs
admin   (empty)
local   (empty)
test    (empty)

To list all the the collections in a database use :

> show collections
  • use

use allows to connect to a database :

> use database_name
switched to db database_name

There is no need to create a database or a collection before connecting to it, MongoDB will automatically do it when the first insert is made.

JavaScript Functions

Since we are in JavaScript, we can enhance objects in the shell. For example we can add a method to a collection that we will use later.

This method will print the percentage of documents for the tweets collection corresponding to the provided parameter.

> db.tweets.percentage = function(value){ print((value*100) / (this.count()) + "%") }

Requests


Requests are made using a set of methods on database's collections objects.

For a better understanding on how to use these methods we will see specific examples for each one of them.

Inserting

To feed our collection with data the insert() method is used.

The document we want to insert need to be JSON formatted and passed as a parameter.

As an example we can create a collection containing all the speakers who participated in the What's Next.

db.speakers.insert({ firstname : "Shay"         , lastname : "Banon"        , twitter : "kimchy" })
db.speakers.insert({ firstname : "Neal"         , lastname : "Gafter"       , twitter : "gafter" })
db.speakers.insert({ firstname : "Adrian"       , lastname : "Colyer"       , twitter : "adriancolyer" })
db.speakers.insert({ firstname : "Boris"        , lastname : "Bokowski"     , twitter : "bokowski" })
db.speakers.insert({ firstname : "Jonas"        , lastname : "Bonér"        , twitter : "jboner" })
db.speakers.insert({ firstname : "Rob "         , lastname : "Harrop"       , twitter : "robertharrop" })
db.speakers.insert({ firstname : "Kohsuke"      , lastname : "Kawaguchi"    , twitter : "kohsukekawa" })
db.speakers.insert({ firstname : "Howard Lewis" , lastname : "Ship"         , twitter : "hlship" })
db.speakers.insert({ firstname : "Jevgeni"      , lastname : "Kabanov"      , twitter : "ekabanov" })
db.speakers.insert({ firstname : "Theo"         , lastname : "Schlossnagle" , twitter : "postwait" })
db.speakers.insert({ firstname : "Michaël"      , lastname : "Chaize"       , twitter : "mchaize" })
db.speakers.insert({ firstname : "Brad"         , lastname : "Drysdale" })
db.speakers.insert({ firstname : "Jags"         , lastname : "Ramnarayan" })

As you can see, we can insert a document without taking care of its structure or generating an id.

In fact, when we insert a document in a collection and do not provide an explicit id (with the field name "_id ") Mongo will automatically generate a unique ObjectId and store it the "_id" field. Ids must be unique for each documents in a collection and it's the program job to make sure they are otherwise an error will be thrown.

Updating

The update() method as its name suggests allows us to update a document or a set of documents depending on the parameters used.

It can take up to 4 parameters :

  • criteria : The query which selects the document to update
  • newObj : The updated object
  • upsert : A boolean to indicate that the newObj is inserted if the document does not exist (default : false)
  • multi : A boolean that indicate if all documents matching the criteria should be updated (default : false)

Let's update a speaker to add his performed talk name.

> db.speakers.update({ firstname : "Brad" }, { firstname : "Brad" , lastname : "Drysdale", talk : "HTML5 WebSockets" })

If you need to update a single document using its _id you can use the shorthand method save(). It only takes one parameter which is the object to update, the _id field of this object must be filled out correctly so that the update is done.

Let's update another speaker using this method :

var theo = db.speakers.findOne({ firstname : "Theo" })
theo.talk = "Service Decoupling in Carrier-Class"
> db.speakers.save(theo)

Since we don't know the speaker _id, we start by finding it. The findOne() method query the collection and retrieve the first document ( _id included ) with "Theo" as a firstname.

Querying

With MongoDB you can query the data by using several methods. Each one takes a JSON formatted query parameter in which you can use some operators to fetch the data you need. You should always remember that queries are case sensitive.

To sample our queries we are going to use a collection named tweets. The data contained in this collection are all the tweets that have been tracked by our NodeJS application during the two days of the What's Next event. How they were recorded is explained later in this post.

Here is what a tweet stored in the collection looks like (a lot of fields are omitted here on purpose):

{
   "_id":ObjectId("4dde17241b1dc31f5a0000f5"),
   "date":NumberLong("1306400548416"),
   "tweet":{
      "text":"At @adriancolyer's talk #wsnparis  http://t.co/UO3jCBU",
      "retweet_count":0,
      "source":"Twitter for iPhone",
      "entities":{
         "urls":[
            {
               "url":"http://t.co/UO3jCBU",
            }
         ],
         "user_mentions":[
            {
               "screen_name":"adriancolyer",
               "name":"Adrian Colyer"
            }
         ],
         "hashtags":[
            {
               "text":"wsnparis"
            }
         ]
      },
      "user":{
         "screen_name":"jawher",
         "lang":"en",
         "name":"Jawher Moussa",
      }
   }
}

find()

To search for documents in a collection run :

> db.tweets.find()

A call to the find() method display only the first 20 documents in the shell, use the it keyword after the request to display the next results.

By default all the fields of the documents are returned, you can choose what field you want by using the second parameter of the method. For example this request will only return the username and the text field of the documents :

> db.tweets.find( {} , { "tweet.user.screen_name" : true, "tweet.text" : true } )

To limit the number of document returned use the limit method. Using limit() with 1 as parameter equals to using the findOne() method.

> db.tweets.find().limit(100)

To skip the firsts documents returned use the skip() method :

> db.tweets.find().skip(100)

To count the number of documents returned use the count() method :

> db.tweets.find().count()
1699

To filter your request add a query parameter to the find() method. A query parameter is a simple JSON object, for example let's count the number of tweets that have been created from twitter.com and not from a mobile or desktop client application.

> db.tweets.find( { "tweet.source" : "web" } ).count()
242

Let's use this query to try the percentage() method added earlier :

> db.tweets.percentage( db.tweets.find( { "tweet.source" : "web" } ).count() )
14.24%
Operators

These are the operator you can use in your queries :

  • $lt (<), $lte (<=), $gt (>) and $gte (>=)

Let's count how many tweets have been at least retweeted 2 times.

> db.tweets.find({ "tweet.retweet_count" : { $gte : 2 }}).count()
125
  • $ne (!=)

How many tweets have not been tweeted by the official What's Next Paris twitter account (@WsN_Paris)

> db.tweets.find({ "tweet.user.screen_name" : { $ne : "WsN_Paris" }}).count()
1688
  • $exists

This operator allows to test if a field exists in a document structure. We can for example count how many tweets are retweeted ones.

> db.tweets.find({ "tweet.retweeted_status" : { $exists : true }}).count()
448
  • $size

The $size operator allows to check the size of an array. For example let's see how many tweets contain a single link :

> db.tweets.find({ "tweet.entities.urls" : { $size : 1 }}).count()
341
  • $in , $nin

The $in operator allows to specify an array of possible matches thanks to an array. With it we can count how many tweets have been created by an official Twitter client on Apple devices.

> db.tweets.find({ "tweet.source" : { $in : ["Twitter for Mac", "Twitter for iPhone", "Twitter for iPad"] }}).count()
485
  • $or , $nor

The $or operator allow to combine a list of boolean expression. Let's count how many tweets was made by @WsN_Paris or that mentions it :

> db.tweets.find({ $or : [ { "tweet.user.screen_name" : "WsN_Paris" } , { "tweet.entities.user_mentions.screen_name" : "WsN_Paris"} ] }).count()
41

distinct()

Like in standard SQL, we can make a request to return a list of distinct documents based on a key. In order to do that we use the distinct() method on the collection.

As an example let's see how many different twitter clients have been used :

> db.tweets.distinct("tweet.source").length
52

Like the find() method, you can add a second parameter to filter your query.

Let's see how many users have made a tweet mentioning Jevgeni Kabanov (@ekabanov)

> db.tweets.distinct("tweet.user.screen_name", { "tweet.entities.user_mentions.screen_name" : "ekabanov"}).length
45

You noticed that we have used the length property instead of the count() method, that's simply because distinct() returns an array when find() was returning a cursor.

group()

The group() method is used to return an array of grouped item. It takes an object as parameter with the following 3 required properties :

  • key : The fields to group by on
  • initial : The initial value of the aggregation counter object.
  • reduce : The reduce function takes two arguments, the current document being iterated over and the aggregation counter object.

To sample this function let's do a request that group on the users and count their tweets. We will sort the results by count number in descending order. For readability only the first 10 results are shown :

>db.tweets.group
(
	{ 	key : { "tweet.user.screen_name" : true }, 
		initial : { count : 0 }, 
		reduce : function(obj,prev){ prev.count++; } 
	}
).sort(function(a,b){return b.count - a.count})
 
{ "tweet.user.screen_name" : "LostInBrittany" , "count" : 93 }
{ "tweet.user.screen_name" : "bcourtine" , "count" : 60 }
{ "tweet.user.screen_name" : "antoine_sd" , "count" : 53 }
{ "tweet.user.screen_name" : "samklr" , "count" : 40 }
{ "tweet.user.screen_name" : "dadoonet" , "count" : 38 }
{ "tweet.user.screen_name" : "slemesle" , "count" : 37 }
{ "tweet.user.screen_name" : "jawher" , "count" : 33 }
{ "tweet.user.screen_name" : "framiere" , "count" : 31 }
{ "tweet.user.screen_name" : "obazoud" , "count" : 30 }
{ "tweet.user.screen_name" : "nmartignole" , "count" : 27 }

The group() method should only be used in non sharded MongoDB configurations and for a result smaller than 10,000 keys. For large scale aggregation, Map/Reduce is the answer.

Deleting

To delete a document the method remove() is used. It takes a JSON object query to define which argument in the collection will be removed. A call to remove() without any query will remove all the documents in the collection

> db.speakers.remove({ _id : ObjectId("4d9db8c5cac30037d92e39fc") });
> db.speakers.remove({ firstname : "Jonas"});
> db.speakers.remove()

If needed, we can delete a collection from a database by calling its drop() method:

> db.speakers.drop();
[conn1] CMD: drop wsn.speakers
true

Full-Text Search

As I speak MongoDB Production Release is 1.8.1 and doesn't have a real full-text search support.

You can find in the documentation a way to do some basic full-text search using arrays but you will have to code some logic on your own to implement it.

Many discussions shows that todays best option for doing full-text search is using your database with another tool like Sphinx, ElasticSearch or Solr.

The status of the full-text search implementation in MongoDB can be followed here.

MongoDB & NodeJS

Driver

To establish a connection with our database in NodeJS we need to install the node-mongodb-native driver.

Clone it from the git repository and install :

$ git clone https://github.com/christkv/node-mongodb-native.git
$ cd node-mongodb-native
$ make

Database Module

Like what we did back in the first post, we are going to make a simple module that deals with the database and call it from the main program.

Our module will use the driver to connect with the database and get a reference to the collection in which tweets will be stored. To add a tweet to the collection our module will have an insertTweet() method.

Create a new file called database.js and add the following content :

// Loading required librairies
var mongodb = require('/path/to/mongodb'),
        DB_NAME = "wsn",
        DB_HOST = "localhost",
        DB_PORT = 27017,
        COL_NAME = "tweets";
 
var getCollection = function(callback){
        var db = new mongodb.Db(DB_NAME, new mongodb.Server(DB_HOST, DB_PORT, {}, {}));
        // Database connection
        db.open(function(error, client){
                if (error) console.error("-- Error with database : " + error);
                // Getting a reference to the tweets collection 
                db.collection(COL_NAME, function(error, collection) {
                        callback(collection);
                        db.close();
                })
        })
}
 
// Called by server.js
this.insertTweet = function(tweet, callback){
        // Opening the connection and getting the collection
        getCollection(function(collection){
                // Tweet insertion
                collection.insert(tweet);
        })
}

Finally the only thing left is to add our module into the server.js and store each new tweet after sending it to the clients.

Edit the server.js to add it the two new lines :

var http = require('http'),
	io = require('/path/to/socket.io'),
        twitter = require('./twitter'),
        // Loading our database module 
	database = require('./database'); 
 
[ ... ]
 
tracker.track().on('tweet', function(tweet){
        console.log('New tweet from :"' + tweet.user.screen_name + '" -> ' + tweet.text);
	socket.broadcast( JSON.stringify( [ ... ] ) );
        // Tweet insertion, timestamp is added to simplify time based query
        database.insertTweet( { tweet : tweet, date : new Date().getTime() } );
});

Conclusion

MongoDB

In this introduction we really only scratched the surface of the features offered by MongoDB but I hope that it will make you curious enough to learn more about this database.

MongoDB and more generally the NoSQL movement makes us look differently at databases. In Mongo, replacing SQL by Javascript can be a disturbing factor at first sight but it quickly appears easy and flexible.

Thanks to its Javascript shell you can start practicing MongoDB very quickly. In fact you can even start without installing it as try.mongodb.org offers an interactive shell tutorial.

As the large and growing number of drivers available (Java, Scala, C#, Erlang, C, C++, etc..) shows, MongoDB is quickly evolving since its was released by 10gen back in 2009.

Using the node-mongodb-native is pretty simple but it can be harder to work with on more complex projects due to its callbacks accumulation. To solve this issue you can take a look on the Mongoose ORM.

Architecture

While use cases are still a minority on the market for this type of architecture, it's very interesting to see that there is a way of using the same language for each components. JavaScript is now a widely used language and should no longer have to be only confined on web browsers.

Mobile internet, since it has dramatically increased in recent years, is also concerned by this kind of architecture with smartphones or tablets applications that should handle a large number of continuously connected users.

NodeJS and MongoDB are both built with scalability as the main goal and are today one of the best couple for building real time web applications.

Companies feedbacks shows that they are useful for reducing the complexity of existing architecture and that they will be used even more in the future.

Resources




Fil des commentaires de ce billet

Ajouter un commentaire

Le code HTML est affiché comme du texte et les adresses web sont automatiquement transformées.