ElasticSearch, interview with Shay Banon
Shay Banon made us the honor to answer a few questions before his talk at the What’s Next Paris conference, on May 26th – 27th at Le Grand Rex in Paris, France. Shay is the founder of ElasticSearch, an open-source, distributed, RESTful, Search Engine. Previously, Shay was Director of Technology at GigaSpaces Technologies, a leading provider of a new generation of application platforms for Java and .NET environments. He is passionate by data search and indexation in the cloud and blogs at ElasticSearch.org. The interview was conducted by Nicolas Martignole, the brains behind the well-known and influential Touilleur Express blog.
Nicolas: Thanks Shay for the time you took to answer at my questions: To begin, a little challenge. How would you explain, how would you describe Elastic Search and your current job to someone that is not a Geek ? Let’s say you have to describe your everyday’s job to your best friend/your neighbor
Shay: ElasticSearch aims at providing search capabilities to any application or web site easily with the ability to scale to very large data sets. When building a web app, or any application for that matter, the knowledge of what to search on, what matters more than other (your ranking of data), how to integrate it into your app, is usually very specific. ElasticSearch aims to be a general purpose search solution, that can scale from one machine to hundreds.
N: If I understand correctly, we create JSON meta-data that we store to our Elastic Search space. On the query part, what kind of system do you offer ? For instance, if I search for a book that was published in 2010, is it possible to search by date ?
S: Yes, the basic search you do is search on text that is then ranked and returned with the best matched results. But, using the power of Lucene, which is the internal search engine elasticsearch uses, one can certainly query/filter by range. Even more than that, one can have more interesting queries like fuzzy search and “more like this” queries.
N: I’d like to ask you a question about how you index the information. As a developer, is it my job to create a JSON structure from my data, and then push-it to my Elastic Search ? How do we maintain the index ? (River of data)
S: In general, yea. ElasticSearch has a very simple API to index data, all you need to do is represent the data you want to index in json and send it to elasticsearch. This has become very simple for applications, as json is so popular these days and its quite simple to convert your domain model to json.
N: The Elastic Search web site describes a Time Machine feature, can you explain what kind of services it offers ?
S: When writing distributed systems, one of the biggest question is how does one store the data for long term persistency. By default, elasticsearch can recreate the whole cluster state and indices from each node (in the cluster) local state (stored on the machine local file system).
Another option is to store the data in a shared storage. ElasticSearch can persist the cluster state into a shared storage, like Amazon S3, and can then recover it when needed. This process is highly optimized, and only persists changes done to the index, and, it is done in an async manner, not interfering with the indexing operation.
N: Can you tell us a little bit more about the network technology behind ? For instance, what do you use as the network stack ? Apache Mina/JBoss Netty or SimpleFramework ?
S: Sure, ElasticSearch uses Netty as the underlying network library. It is completely built using async/event IO, meaning that, for example, when doing a distributed search across several nodes, there are no threads waiting/blocking on network to receive a response.
N: Does Elastic Search support long-polling/WebSocket client ?
S: No, requests in elasticsearch are not long in nature. One executes a search request and receives a response, or an index request to index new data.
Places where it make sense, like scrolling a large result set, are provided using a scrollable like API, getting back a “scroll id” to continue and scroll the same request.
Actually, one place where long polling, or “pinging back” might make sense is the percolator support. Percolator in elasticsearch allows to register queries, and when indexing docs, find out which of the queries matched on it or not.
Currently, one gets it as a response for the index request, but imagine that as part of the registration, one would say: notify me when a document matching a specific query.
N: Can you explain what is a Facet in Elastic Search ?
S: I love facets. Facet allow to get aggregated data on a search request, that relate to the documents that matched the query (and only them). They are computed in real time, and for example, one can get the most popular categories that correspond to the query.
ElasticSearch has support for several types of facets, my favorite, by far, is the histogram based one. Imagine indexing time related data, and getting back, on top of the search results (top 10 hits matching the query entered) also a information on how many hits occur each month. You can even go a step further and get statistical information per month (either on the time field you use to “bucket” the data, or another field).
N: I launched a Job Board a year ago. Let’s say I want to search for a new Job located near Paris. Would it be possible to do some geolocalisation search ?
S: Yes, elasticsearch has some quite nice Geo capabilities. Those include the ability to filter the results based on a location and distance, for example, show me only results that are 50km from the center of Paris. There is also a bounding box filter. Another, which is related, is to sort the data based on distance from a specific location.
And, of course, there is the geo distance facet, which allows to get aggregated statistical data within each “distance range”. For example, give me the count of docs that matched within 10km from me, 30km from me, and beyond.
Yes, elasticsearch has some quite nice Geo capabilities. Those include the ability to filter the results based on a location and distance, for example, show me only results that are 50km from the center of Paris. There is also a bounding box filter. Another, which is related, is to sort the data based on distance from a specific location.
And, of course, there is the geo distance facet, which allows to get aggregated statistical data within each “distance range”. For example, give me the count of docs that matched within 10km from me, 30km from me, and beyond.
N: Last question, as you know, “What’s Next” ask to each speaker to do an original presentation that hasn’t been shown before. So, can you tell us a little bit more about what you’d like to present in May ?
S: Sure. The things I would like to focus on are a quick intro into elasticsearch, talking a bit about how its architected and its distributed nature, and how the work was done to take lucene and make it distributed. I will also cover some of the NoSQL integration elasticsearch has, for example, I will explain how easily indexing CouchDB into elasticsearch can be done using the CouchDB river.
N: Thanks for your time, and happy to see you soon in Paris
S: Same here, very excited about the conference, Paris is certainly one of my favorite cities in the world.
Thanks to Nicolas Martignole for his interview! You can find the french version on the touilleur express’s blog