Yes, i will agree with this one fact that percolator is a wonderful, powerful tool / utility provided by elasticsearch. Here is a simple scenario that will let you understand and absorb the power of percolator.
PROBLEM
There is a need to automatically map a certain content that is pushed in by user to a set of categories. For each category, i have a set of keyword rules defined. Traditionally one would have done it the old fashioned way. Once the content is been pushed by the user, pick all the keywords, find the ones that exists in the content, allocate the same to categories found through means of keywords. Fair enough logic one would use. But the real life scenarios are not so easy. There are enough complexities along with. Very firstly, imagine there are around 10,000+ records as keywords to be scanned / processed through, how fast will be the processing through each set of content pushed by the user?
Many out there will argue the power of processors and memory cheaply available as for now, great to know that. But then it still takes enough time to process each of the content pushed. Let’s add to another set of complexity now. Lets say, it ain’t just keywords but also a set of conditions (AND / OR / NOT) logic to be processed. As for example:
Tata Power" AND ("Vijayant Ranjan" OR "Mahesh Paranjpe" OR "Hydro")"
Another Real Life Example
"National Association of Software" OR "Nasscom" OR "Ministry of Electronics and IT" OR "Indian Information Technology" OR "H1B1 Visa" OR "E governance" OR "E-governance" OR "Online visa services" OR "Passport services" OR "Sewa Kendras" OR "Commercial and employment perspective" OR "Ethical perspective" OR "Indian Software Firms" OR ("IT services" AND "IT services IT services"~10000) OR ("IT Industry" AND "IT Industry IT Industry"~10000) OR ("IT Company" AND "IT Company IT Company"~10000) OR ("IT companies" AND "IT companies IT companies"~10000) OR ("Indian IT" AND "Indian IT Indian IT"~10000) OR ("IT Sector" AND "IT Sector IT Sector"~10000) OR ("IT Sectors" AND "IT Sectors IT Sectors"~10000) OR ("IT implications" AND "IT implications IT implications"~10000) OR ("IT enterprises" AND "IT enterprises IT enterprises"~10000) OR ("IT firms" AND "IT firms IT firms"~10000) OR ("IT firm" AND "IT firm IT firm"~10000) OR ("IT organizations" AND "IT organizations IT organizations"~10000)"
And one more
"Lenovo" NOT ("Smart phones" OR "Smartphones")"
Now here if i have to deal with the regular parsing logic, the very first, writing the perfect logic is going to be a nasty / time consuming effort. Plus, it will definitely hamper a lot on performance.
The question here is, is there a simpler solution for the same above?
SOLUTION
Well, good news for folks who been looking around for a simpler solution. Yes, it is provided by elasticsearch. Where here we are performing is merely like a reverse search, trying to map the document to categories based on the keywords found in the same. In regular search, we have set of documents we search set of keyword / criteria to get results.
This reverse search mechanism is known as percolator. A very powerful tool indeed for those who can understand and use it. Now how to use it, let me share you a set of example for the same above scenario.
Step 1. We need to create a percolator mapping index for the same.
PUT /keywords
{
"mappings": {
"queries": {
"properties": {
"query": {
"type": "percolator"
}
}
},
"keyword": {
"properties": {
"content": {
"type": "text"
}
}
}
}
}
There goes in the definition for my percolator where i will add up all the set of my keywords that i want to filter the document along with. If you notice, this mapping is bit different then the regular mapping. The additional part in here is mapping of queries. That is going to do the trick for us. The keyword have a property – content which is like a document that will be expected from the user to be pushed in as a structure to search across the query sets.
Step 2. Now lets see how we put up with the entries in percolator.
Method 1.
PUT keywords/queries/4?refresh
{
"query": {
"query_string" : {
"query": """National Association of Software" OR "Nasscom" OR "Ministry of Electronics and IT" OR "Indian Information Technology" OR "H1B1 Visa" OR "E governance" OR "E-governance" OR "Online visa services" OR "Passport services" OR "Sewa Kendras" OR "Commercial and employment perspective" OR "Ethical perspective" OR "Indian Software Firms" OR ("IT services" AND "IT services IT services"~10000) OR ("IT Industry" AND "IT Industry IT Industry"~10000) OR ("IT Company" AND "IT Company IT Company"~10000) OR ("IT companies" AND "IT companies IT companies"~10000) OR ("Indian IT" AND "Indian IT Indian IT"~10000) OR ("IT Sector" AND "IT Sector IT Sector"~10000) OR ("IT Sectors" AND "IT Sectors IT Sectors"~10000) OR ("IT implications" AND "IT implications IT implications"~10000) OR ("IT enterprises" AND "IT enterprises IT enterprises"~10000) OR ("IT firms" AND "IT firms IT firms"~10000) OR ("IT firm" AND "IT firm IT firm"~10000) OR ("IT organizations" AND "IT organizations IT organizations"~10000)""",
"fields": ["content"]
}
}
}
Method 2.
PUT keywords/queries/4?refresh
{
"query": {
"term" : {
"content": "National Association of Software",
}
}
}
In the above, notice the pattern we are to follow.
- Instead of the type – keyword, here we are mentioning queries.
- Along with it, we are also specifying 4 (as an id). Now this is very important. I will come back with explaining as why i recommend using the same here.
- Now, next if you see, we put up the method as how we are going to query to the incoming content. There can be many more but i have come across such 2 methods of querying the incoming content. First one is for more complex scenario, other one is for simpler one – straight way of searching using term methodology.
- One more important thing – here the method used is PUT and not POST. Need to be careful with this. I have not experimented with POST methodology, you guys can surely go ahead and do the same.
Now in case we were to search for matching keywords in a given content – how to go about the same?
POST keywords/_search
{
"query": {
"percolate": {
"field": "query",
"document_type": "keyword",
"document": {
"content": "In Taj Lake Palace there is going to be the meeting for Nasscom India"
}
}
}
}
The above pattern if you notice, we have a POST method with _search action. What we are querying up for – percolate. There we mention the document_type – The same as we defined up in the mapping. And in the document, we are giving in the same structure – with content placed in it.
What elasticsearch will do is parse the document across all the sets of the keywords set in percolator and return the list.
Now – how is this going to be helpful to us – if you remember – we had asked you to specify the index id, that is going to be very handy. Like me, many will still have their master data in RDBMS like MySQL. We have our keywords / categories mapped through the primary key. That is the same we need to set it here as index. So when we get the searched result along with the keywords, we will get the list if ID’s with the same. This are none other then category ids that we want to map the content to. There, now you got it all.
NOTE
One very important thing you need to keep in mind is the size of the result. By default, only 10 rows will be returned by the search query. If you expecting more then 10, you surely need to specify theĀ size param along with the query.
USE CASES
Now where can such stuff be useful. Let me share you a few instances.
- Lets say, there are articles that are being uploaded by the users. These are some articles related to legal cases. Now, we want to automatically identify what category of law this articles fit into. Rather then user specifying the same, we can automatically scan through the content and identify what category it fits into, or what court the case was fought across, or which all sections was mentioned in the same.
- Lets say there is an E-Commerce system where the system is like a SAAS based application. Now the system wants to automatically set tags / search criteria / keywords to the uploaded product. Reason – if all are tagged up properly, the same will have great benefit for retrieving the data lightning fast rather then scanning through all the documents .