A Full Javascript Architecture, Part Three – MongoDB
This is the last part in a series of three showing how to build a simple JavaScript architecture.
After seeing how to create a NodeJS application to track tweets and send them to a Google Chrome Extension in real time, we are going to see how to store and query them using a MongoDB database.
For a better understanding, this post starts by explaining the basics of MongoDB and then deals with its integration in a NodeJS application.
Introduction to MongoDB
Presentation
MongoDB presents itself as a scalable, high-performance, open source and document-oriented database written in C++.
Each of the concepts behind the first three characteristics are well known so let’s focus on the fourth one : document-oriented.
To clearly understand this concept we need some basic MongoDB terminology. Since we are going to see a lot of JSON let’s start right away :
{ "terminology" : { "Database" : "Database", "Table" : "Collection", "Row" : "Document" } }
A document-oriented database is a database where each document in a same collection may have a totally different structure. With this orientation, any number of fields of any length can be added to a document even after its creation.
Because all the information of the same entity can be dynamically stored within a single document, joins operations are no longer needed in this type of database.
Joins operations are really expensive, they require a strong consistency and a fixed schema. Avoiding them results in a great capacity of horizontal scaling.
Data Types
MongoDB uses BSON as the data storage and network transfer format. BSON stands for “Binary JSON” and is a binary-encoded serialization of JSON-like documents. According to the bsonspec.org, it was designed to be lightweight, traversable and efficient.
Its key advantage over XML and JSON is efficiency in term of space and compute time.
BSON documents can be used to store several data types like ”string, integer, boolean, double, null, array, object, date, binary data, regular expression and source code”. You can find more by exploring the specification.
Advanced Features
MongoDB has a set of advanced features like full-index, replication, sharding and map/reduce. Unfortunately they won’t be approached in this post.
Why MongoDB ?
Choosing MongoDB for our use case was motivated by a set of features that characterize this database.
JavaScript
Requests in MongoDB are written in JavaScript and that is a perfect fit for our architecture. In addition, these requests use a RDBMS style query that reduce the gap for developers used to traditional structured query language.
JSON Document-Oriented
MongoDB uses parsed JSON document (BSON) as data structure. Since the Twitter API send us data in JSON, tweets can be stored immediately and without any pretreatment. The key/value approach of other NoSQL databases like Redis is not suitable for us because we need to store more than a simple value.
Schemaless
It’s capital for our application to store data in a flexible manner, indeed tweets we are going to receive can have a different JSON structure. For instance, a retweet can have more fields than a regular one.
Real-Time
MongoDB is very good at real-time inserts by keeping transaction support extremely simple.
NodeJS driver
MongoDB has a native and open source driver written by Christian Amor Kvalheim .
Getting started
Installation
To install MongoDB check the latest recommended version on http://www.mongodb.org/downloads and wget it. Debian and Ubuntu users can try using apt packages, check out this page for more information.
$ wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-1.8.1.tgz
When the download is done unpack the archive :
$ tar xzf mongodb-linux-x86_64-1.8.1.tgz
Before starting the database we need to create and set the rights on the folder where the data will be stored, by default MongoDB configuration uses /data/db.
$ mkdir -p /data/db/ $ chown `id -u` /data/db
Then we can start the database by calling Mongo’s deamon :
$ /opt/mongodb/bin/mongod
The MongoDB Interactive Shell
MongoDB comes with a JavaScript shell used to manage and query the database.
To start the shell use the mongo executable :
$ /opt/mongodb/bin/mongo MongoDB shell version: 1.8.1 connecting to: test
As we can see, once in the shell we are automatically connected to the test database.
Basic Commands
- help
We can get help at anytime by using this command.
> help
Help is also a method we can call on database or collection objects to get all the functions associated to them. The db object always refer to the database currently used and collections are its attributes.
> db.help() > db.my_collection.help() > db.another_collection.help()
- show
With show you can list all the databases or collections.
To list all the the databases use :
> show dbs admin (empty) local (empty) test (empty)
To list all the the collections in a database use :
> show collections
- use
use allows to connect to a database :
> use database_name switched to db database_name
There is no need to create a database or a collection before connecting to it, MongoDB will automatically do it when the first insert is made.
JavaScript Functions
Since we are in JavaScript, we can enhance objects in the shell. For example we can add a method to a collection that we will use later.
This method will print the percentage of documents for the tweets collection corresponding to the provided parameter.
> db.tweets.percentage = function(value){ print((value*100) / (this.count()) + "%") }
Requests
Requests are made using a set of methods on database’s collections objects.
For a better understanding on how to use these methods we will see specific examples for each one of them.
Inserting
To feed our collection with data the insert() method is used.
The document we want to insert need to be JSON formatted and passed as a parameter.
As an example we can create a collection containing all the speakers who participated in the What’s Next.
db.speakers.insert({ firstname : "Shay" , lastname : "Banon" , twitter : "kimchy" }) db.speakers.insert({ firstname : "Neal" , lastname : "Gafter" , twitter : "gafter" }) db.speakers.insert({ firstname : "Adrian" , lastname : "Colyer" , twitter : "adriancolyer" }) db.speakers.insert({ firstname : "Boris" , lastname : "Bokowski" , twitter : "bokowski" }) db.speakers.insert({ firstname : "Jonas" , lastname : "Bonér" , twitter : "jboner" }) db.speakers.insert({ firstname : "Rob " , lastname : "Harrop" , twitter : "robertharrop" }) db.speakers.insert({ firstname : "Kohsuke" , lastname : "Kawaguchi" , twitter : "kohsukekawa" }) db.speakers.insert({ firstname : "Howard Lewis" , lastname : "Ship" , twitter : "hlship" }) db.speakers.insert({ firstname : "Jevgeni" , lastname : "Kabanov" , twitter : "ekabanov" }) db.speakers.insert({ firstname : "Theo" , lastname : "Schlossnagle" , twitter : "postwait" }) db.speakers.insert({ firstname : "Michaël" , lastname : "Chaize" , twitter : "mchaize" }) db.speakers.insert({ firstname : "Brad" , lastname : "Drysdale" }) db.speakers.insert({ firstname : "Jags" , lastname : "Ramnarayan" })
As you can see, we can insert a document without taking care of its structure or generating an id.
In fact, when we insert a document in a collection and do not provide an explicit id (with the field name “_id “) Mongo will automatically generate a unique ObjectId and store it the “_id” field. Ids must be unique for each documents in a collection and it’s the program job to make sure they are otherwise an error will be thrown.
Updating
The update() method as its name suggests allows us to update a document or a set of documents depending on the parameters used.
It can take up to 4 parameters :
- criteria : The query which selects the document to update
- newObj : The updated object
- upsert : A boolean to indicate that the newObj is inserted if the document does not exist (default : false)
- multi : A boolean that indicate if all documents matching the criteria should be updated (default : false)
Let’s update a speaker to add his performed talk name.
> db.speakers.update({ firstname : "Brad" }, { firstname : "Brad" , lastname : "Drysdale", talk : "HTML5 WebSockets" })
If you need to update a single document using its _id you can use the shorthand method save(). It only takes one parameter which is the object to update, the _id field of this object must be filled out correctly so that the update is done.
Let’s update another speaker using this method :
var theo = db.speakers.findOne({ firstname : "Theo" }) theo.talk = "Service Decoupling in Carrier-Class" > db.speakers.save(theo)
Since we don’t know the speaker _id, we start by finding it. The findOne() method query the collection and retrieve the first document ( _id included ) with “Theo” as a firstname.
Query
ing
With MongoDB you can query the data by using several methods. Each one takes a JSON formatted query parameter in which you can use some operators to fetch the data you need. You should always remember that queries are case sensitive.
To sample our queries we are going to use a collection named tweets. The data contained in this collection are all the tweets that have been tracked by our NodeJS application during the two days of the What’s Next event. How they were recorded is explained later in this post.
Here is what a tweet stored in the collection looks like (a lot of fields are omitted here on purpose):
{ "_id":ObjectId("4dde17241b1dc31f5a0000f5"), "date":NumberLong("1306400548416"), "tweet":{ "text":"At @adriancolyer's talk #wsnparis http://t.co/UO3jCBU", "retweet_count":0, "source":"Twitter for iPhone", "entities":{ "urls":[ { "url":"http://t.co/UO3jCBU", } ], "user_mentions":[ { "screen_name":"adriancolyer", "name":"Adrian Colyer" } ], "hashtags":[ { "text":"wsnparis" } ] }, "user":{ "screen_name":"jawher", "lang":"en", "name":"Jawher Moussa", } } }
find()
To search for documents in a collection run :
> db.tweets.find()
A call to the find() method display only the first 20 documents in the shell, use the it keyword after the request to display the next results.
By default all the fields of the documents are returned, you can choose what field you want by using the second parameter of the method. For example this request will only return the username and the text field of the documents :
> db.tweets.find( {} , { "tweet.user.screen_name" : true, "tweet.text" : true } )
To limit the number of document returned use the limit method. Using limit() with 1 as parameter equals to using the findOne() method.
> db.tweets.find().limit(100)
To skip the firsts documents returned use the skip() method :
> db.tweets.find().skip(100)
To count the number of documents returned use the count() method :
> db.tweets.find().count() 1699
To filter your request add a query parameter to the find() method. A query parameter is a simple JSON object, for example let’s count the number of tweets that have been created from twitter.com and not from a mobile or desktop client application.
> db.tweets.find( { "tweet.source" : "web" } ).count() 242
Let’s use this query to try the percentage() method added earlier :
> db.tweets.percentage( db.tweets.find( { "tweet.source" : "web" } ).count() ) 14.24%
Operators
These are the operator you can use in your queries :
- $lt (<), $lte (<=), $gt (>) and $gte (>=)
Let’s count how many tweets have been at least retweeted 2 times.
> db.tweets.find({ "tweet.retweet_count" : { $gte : 2 }}).count() 125
- $ne (!=)
How many tweets have not been tweeted by the official What’s Next Paris twitter account (@WsN_Paris)
> db.tweets.find({ "tweet.user.screen_name" : { $ne : "WsN_Paris" }}).count() 1688
- $exists
This operator allows to test if a field exists in a document structure. We can for example count how many tweets are retweeted ones.
> db.tweets.find({ "tweet.retweeted_status" : { $exists : true }}).count() 448
- $size
The $size operator allows to check the size of an array. For example let’s see how many tweets contain a single link :
> db.tweets.find({ "tweet.entities.urls" : { $size : 1 }}).count() 341
- $in , $nin
The $in operator allows to specify an array of possible matches thanks to an array. With it we can count how many tweets have been created by an official Twitter client on Apple devices.
> db.tweets.find({ "tweet.source" : { $in : ["Twitter for Mac", "Twitter for iPhone", "Twitter for iPad"] }}).count() 485
- $or , $nor
The $or operator allow to combine a list of boolean expression. Let’s count how many tweets was made by @WsN_Paris or that mentions it :
> db.tweets.find({ $or : [ { "tweet.user.screen_name" : "WsN_Paris" } , { "tweet.entities.user_mentions.screen_name" : "WsN_Paris"} ] }).count() 41
distinct()
Like in standard SQL, we can make a request to return a list of distinct documents based on a key. In order to do that we use the distinct() method on the collection.
As an example let’s see how many different twitter clients have been used :
> db.tweets.distinct("tweet.source").length 52
Like the find() method, you can add a second parameter to filter your query.
Let’s see how many users have made a tweet mentioning Jevgeni Kabanov (@ekabanov)
> db.tweets.distinct("tweet.user.screen_name", { "tweet.entities.user_mentions.screen_name" : "ekabanov"}).length 45
You noticed that we have used the length property instead of the count() method, that’s simply because distinct() returns an array when find() was returning a cursor.
group()
The group() method is used to return an array of grouped item. It takes an object as parameter with the following 3 required properties :
- key : The fields to group by on
- initial : The initial value of the aggregation counter object.
- reduce : The reduce function takes two arguments, the current document being iterated over and the aggregation counter object.
To sample this function let’s do a request that group on the users and count their tweets. We will sort the results by count number in descending order. For readability only the first 10 results are shown :
>db.tweets.group ( { key : { "tweet.user.screen_name" : true }, initial : { count : 0 }, reduce : function(obj,prev){ prev.count++; } } ).sort(function(a,b){return b.count - a.count}) { "tweet.user.screen_name" : "LostInBrittany" , "count" : 93 } { "tweet.user.screen_name" : "bcourtine" , "count" : 60 } { "tweet.user.screen_name" : "antoine_sd" , "count" : 53 } { "tweet.user.screen_name" : "samklr" , "count" : 40 } { "tweet.user.screen_name" : "dadoonet" , "count" : 38 } { "tweet.user.screen_name" : "slemesle" , "count" : 37 } { "tweet.user.screen_name" : "jawher" , "count" : 33 } { "tweet.user.screen_name" : "framiere" , "count" : 31 } { "tweet.user.screen_name" : "obazoud" , "count" : 30 } { "tweet.user.screen_name" : "nmartignole" , "count" : 27 }
The group() method should only be used in non sharded MongoDB configurations and for a result smaller than 10,000 keys. For large scale aggregation, Map/Reduce is the answer.
Deleting
To delete a document the method remove() is used. It takes a JSON object query to define which argument in the collection will be removed. A call to remove() without any query will remove all the documents in the collection
> db.speakers.remove({ _id : ObjectId("4d9db8c5cac30037d92e39fc") }); > db.speakers.remove({ firstname : "Jonas"}); > db.speakers.remove()
If needed, we can delete a collection from a database by calling its drop() method:
> db.speakers.drop(); [conn1] CMD: drop wsn.speakers true
Full-Text Search
As I speak MongoDB Production Release is 1.8.1 and doesn’t have a real full-text search support.
You can find in the documentation a way to do some basic full-text search using arrays but you will have to code some logic on your own to implement it.
Many discussions shows that todays best option for doing full-text search is using your database with another tool like Sphinx, ElasticSearch or Solr.
The status of the full-text search implementation in MongoDB can be followed here.
MongoDB & NodeJS
Driver
To establish a connection with our database in NodeJS we need to install the node-mongodb-native driver.
Clone it from the git repository and install :
$ git clone https://github.com/christkv/node-mongodb-native.git $ cd node-mongodb-native $ make
Database Module
Like what we did back in the first post, we are going to make a simple module that deals with the database and call it from the main program.
Our module will use the driver to connect with the database and get a reference to the collection in which tweets will be stored. To add a tweet to the collection our module will have an insertTweet() method.
Create a new file called database.js and add the following content :
// Loading required librairies var mongodb = require('/path/to/mongodb'), DB_NAME = "wsn", DB_HOST = "localhost", DB_PORT = 27017, COL_NAME = "tweets"; var getCollection = function(callback){ var db = new mongodb.Db(DB_NAME, new mongodb.Server(DB_HOST, DB_PORT, {}, {})); // Database connection db.open(function(error, client){ if (error) console.error("-- Error with database : " + error); // Getting a reference to the tweets collection db.collection(COL_NAME, function(error, collection) { callback(collection); db.close(); }) }) } // Called by server.js this.insertTweet = function(tweet, callback){ // Opening the connection and getting the collection getCollection(function(collection){ // Tweet insertion collection.insert(tweet); }) }
Finally the only thing left is to add our module into the server.js and store each new tweet after sending it to the clients.
Edit the server.js to add it the two new lines :
var http = require('http'), io = require('/path/to/socket.io'), twitter = require('./twitter'), // Loading our database module database = require('./database'); [ ... ] tracker.track().on('tweet', function(tweet){ console.log('New tweet from :"' + tweet.user.screen_name + '" -> ' + tweet.text); socket.broadcast( JSON.stringify( [ ... ] ) ); // Tweet insertion, timestamp is added to simplify time based query database.insertTweet( { tweet : tweet, date : new Date().getTime() } ); });
Conclusion
MongoDB
In this introduction we really only scratched the surface of the features offered by MongoDB but I hope that it will make you curious enough to learn more about this database.
MongoDB and more generally the NoSQL movement makes us look differently at databases. In Mongo, replacing SQL by Javascript can be a disturbing factor at first sight but it quickly appears easy and flexib
le.
Thanks to its Javascript shell you can start practicing MongoDB very quickly. In fact you can even start without installing it as try.mongodb.org offers an interactive shell tutorial.
As the large and growing number of drivers available (Java, Scala, C#, Erlang, C, C++, etc..) shows, MongoDB is quickly evolving since its was released by 10gen back in 2009.
Using the node-mongodb-native is pretty simple but it can be harder to work with on more complex projects due to its callbacks accumulation. To solve this issue you can take a look on the Mongoose ORM.
Architecture
While use cases are still a minority on the market for this type of architecture, it’s very interesting to see that there is a way of using the same language for each components. JavaScript is now a widely used language and should no longer have to be only confined on web browsers.
Mobile internet, since it has dramatically increased in recent years, is also concerned by this kind of architecture with smartphones or tablets applications that should handle a large number of continuously connected users.
NodeJS and MongoDB are both built with scalability as the main goal and are today one of the best couple for building real time web applications.
Companies feedbacks shows that they are useful for reducing the complexity of existing architecture and that they will be used even more in the future.
Resources
- MongoDB Official : http://www.mongodb.org/
- Documentation : http://www.mongodb.org/display/DOCS…
- Companies using MongoDB : http://www.mongodb.org/display/DOCS…
- 10 gen reference cards : http://www.10gen.com/reference
- MongoDB & NodeJS : http://www.mongodb.org/display/DOCS…
- node-mongodb-native : https://github.com/joyent/node