Query Expansion – AutosuggestKeyphrases extractionOntologies – Solution



Query Expansion – AutosuggestKeyphrases extractionOntologies – Solution

0 0


query-wizard


On Github wolny / query-wizard

Query Expansion

AutosuggestKeyphrases extractionOntologies

Background: working on the idea of intelligent term suggestions and query exclusions for several FF; what if instead of asking a lot of questions we could provide the answers right away, just by having a single brand or product term

Query Autosuggest

We want the suggest interesting stuff as the user types: popular or very recent, and we also want the suggestions to be ranked, as there is no point of showing the suggestion if it occurs only in a bunch of documents
  • Based on historical queries from Analytics
  • Suggest popular or recent topics
  • Beyond keyword suggestion (phrases, query operators)
1. we want our suggestions to be based on the past queries created by users; we can use collective intelligence of our users in order to execute the most likely query as the user types; the value provided to the user is less keystrokes and intelligent suggestions of popular or recent topics 2. for me recency is more important that frequency/popularity of the topic, so suggestions that are more recent will receive better score 3. last but not least we want to suggest not only single terms, but also phrases

phrases / operators

various flavors of the autosuggest feature

Solution

  • Export queries from the past N days (e.g. 365 days)
  • Parse only terms and phrases
  • Measure query popularity (terms frequency)
  • Create a separate Solr Index from the extracted queries
  • Boost / penalize results based on recency

Sample Query

                    
2016-06-02T02:14:05Z,"country:eg AND (site:twitter.com OR site:facebook.com)
\\r\\nAND (\\r\\n\"Al Ahly\" OR \"Zamalek\" OR \"Champions League\" OR
\"EURO\" OR \"Football\" OR \"Mido\" OR \"Hazim Imam\" OR \"Imad Mit3eb\"
OR \"CAF\" OR \"LA LIGA\" OR \"Barcelona\" OR \"Real Madrid\" )"
                     
                

Sample output

                    
{"query":"insurance","creationDate":1463062131000,"count":1471},
{"query":"shampoo","creationDate":1462911742000,"count":1460},
{"query":"country:it","creationDate":1463145673000,"count":1456},
{"query":"watch","creationDate":1463132426000,"count":1438},
{"query":"for sale","creationDate":1462558474000,"count":1432},
{"query":"baby","creationDate":1462980572000,"count":1428},
{"query":"powder","creationDate":1462906762000,"count":1423},
{"query":"newspaper","creationDate":1463041488000,"count":1411},
{"query":"beauty","creationDate":1462908186000,"count":1375},
{"query":"country:cn","creationDate":1462901757000,"count":1370},
{"query":"spray","creationDate":1462911175000,"count":1365},
{"query":"amazon.co.uk","creationDate":1461691137000,"count":1347},
                    
                
This is the output after parsing each query and running the word count on the results. Each line in this JSON file is treated as a separate doc that's sent to Solr for indexing. CreationDate is not a correct name for this field, this should be named last_executed_on, as this shows *what* was the last time this query was executed.

Solr Index

                    
...
<field name="query" type="string" indexed="true" stored="true"
                        required="true"/>
<field name="query_ngram" type="tquery_ngram" indexed="true"
                        stored="false"/>
...
<copyField source="query" dest="query_ngram"/>
...
<fieldType name="tquery_ngram" class="solr.TextField">
 <analyzer type="index">
  ...
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
                        maxGramSize="10"/>
 </analyzer>
 ...
</fieldType>
                    
                
Important thing to note here is the *query_ngram* field for which we defined a separate type with it's own analyzer. During indexing n-grams of the size between 2 and 10 are created from query terms and added to the index

n-grams

The analysis of the phrase: "quick brown fox"

all those n-grams will be indexed by Solr; so when the user types for example "bro", the solr will match this particular phrase, because the n-gram "bro" was associated with it during the index time

Query Boosting

boost(query) = count(query) * recency_boost(query) The suggestions that will come from Solr will be by default scored with tf-idf, but we want to amend this scoring, because we want to boost by popularity and recency. Give positive boost for recent phrases and negative boost for older terms. Let's make an assumption that I want a boost of 10 for very recent document, and I want a boost to be less than 1 for documents older than say 60 days.

Boost recent / penalize old

DEMO!

Context terms / Keyphrases

What happens when the user clicks enter?

Problem

Extract interesting terms and phrases given just a single (brand/product) name

A single term is usually the only information we have at the beginning of the Brand Wizard, the question is how can we use this single term and our massive solr index of mentions and extract as much as possible from this single term or phrase?

Solution

Get results from the single term query Build the text corpus Extract (TF-IDF scored) terms using MoreLikeThis This is one of the approach: when the user enters a brand/product name (or choose our suggestion), we want to treat is as a very simple query, execute it against our index and build the text corpus based on the query results. Then we use this corpus to extract interesting terms, by leveraging some of the Solr core features, namely tf-idf scoring and MoreLikeThis

Grab some data first...

Consume tweets from Gnip Decahose (Kafka topic)

Single terms are not enough 😢

There is a problem with extracting interesting phrases based on the tf-idf score in Solr: by default the field analysis that's being used extracts single terms only, so the tf-idf score will be assigned to single terms instead of phrases. Is it possible to trick Solr to treat phrases as terms?

How do we index?

  • Create "pseudo-phrases" with Shingles
  • Store termVectors to speed up MoreLikeThis
                    
<fields>
  ...
  <field name="exact" type="shingle" indexed="true" stored="true"
                        required="true" termVectors="true"/>
  ...
</fields>
<types>
  ...
  <fieldtype name="shingle" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
      ...
      <filter class="solr.ShingleFilterFactory" maxShingleSize="3"/>
    </analyzer>
  </fieldtype>
</types>
                    
                
It is possible to tell Solr to index not only terms but also phrases. For this we use so called *Shingles*. A shingle is just a word-based n-gram, as opposed to character-based n-gram we've seen before. Again important thing to note here is the new type that uses the Shingle filter as well as the fact that we store *termVectors*. This might be quite heavy for a large indexes where documents are big, but in case of *test search* it's not that bad.

Sooo... What does the Shingle look like?

The analysis of the mention: The quick brown fox jumps over the lazy dog

This is what the analysis for the simple body of text looks like if the ShingleFilter is applied. We use those shingles to create "pseudo-phrases" during the indexing process. And since the shingle ends up being a single token in the index, it is a subject to the normal TF-IDF scoring that is used in Lucene. So we will get all the scoring goodies that come from Lucene for free.

How do we extract interesting keyphrases?

Configure MoreLikeThis Handler in your solrconfig.xml

                            
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler"/>
                            
                        

Request

http://localhost:8983/solr/query_context_terms/mlt?
	mlt.fl=exact&
	mlt.interestingTerms=details&
	mlt.mintf=2&
	mlt.minwl=3&
	mlt.mindf=1&
	mlt.boost=true&
	rows=0&
	stream.body=Brandwatch is a social media monitoring company
    headquartered in Brighton, England. Brandwatch is a software
    as a service, which archives social media data in order to provide
    companies with information and the means to track specific segments
    to analyse their brands' online presence
                

Response

                    
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">21</int>
  </lst>
<result name="response" numFound="59094" start="0"/>
  <lst name="interestingTerms">
    <float name="exact:media">1.1228108</float>
    <float name="exact:social">1.1294001</float>
    <float name="exact:social media">1.2617687</float>
    <float name="exact:brandwatch">2.1803792</float>
    <float name="exact:brandwatch is">2.2959332</float>
  </lst>
</response>
                    
                
DEMO!

Ontologies

Ontology is a formal term used in Semantic Web, but in this context it's quite simple: 1. it's about defining the attributes of a single term 2. characterize the possible relationships between terms

Goals

  • Give me terms that are conceptually close to a given term -> word2vec
  • Show me what was the context in which the term was used by the users -> query mining
word2vec computes distributed vector representation of words; the main advantage is that similar words are close in the vector space;

Term attributes

                
> db.terms.find().pretty()
{ "_id" : "mimmy", "attributes" : [ "exact" ] }
{ "_id" : "DBM", "attributes" : [ "raw" ] }
{ "_id" : "djrlcontact", "attributes" : [ "author" ] }
{ "_id" : "whotelsnyc", "attributes" : [ "exact" ] }
{
	"_id" : "kyliecosmetics",
	"attributes" : [
		"hashtags",
		"links",
		"at_mentions"
	]
}
{ "_id" : "paty_melo_", "attributes" : [ "author", "at_mentions" ] }
{ "_id" : "dremmanash", "attributes" : [ "exact", "author" ] }
{ "_id" : "vanessa_rodrgs", "attributes" : [ "author", "at_mentions" ] }
                
            
those are the sample term attributes extracted from the queries exported from our system; again queries were parsed and the operators extracted

word2vec and the CPU suffering

DEMO!

Summary

  • Use the wisdom of the crowd and play with our data
  • If someone did something clever in our system, suggest it to others
  • Dev tips
    • Do simple ETL in Scala/Java instead of Spark
    • Make sure your libraries were compiled by the same Scala version (Kafka, Spark)
    • Improve query parsing
    • Extracted keyphrases very similar to each other
1.1 Lots of user generated data, like queries, rules and categories. We can extract a lot of knowledge from it. Use the wisdom of our users to help new users. 1.2 Nowadays it's all about users behaviour and personalized recommendations. If you look at the query expansion problem again it's also a recommendation problem. These days it's not only about being able to understand the content itself, it's also having a deeper understanding of what our users are doing with that content. It's about learning from how the users behave and if many users did something clever, our responsibility is to show this cleverness to other users, especially to users which are new to the system. There is a term called *reflected intelligence*: how can I learn from users behaviour and then reflect it back up to new users.

?

GitHub

autosuggest + keyphrases

word2vec + query mining

Query Expansion AutosuggestKeyphrases extractionOntologies Background: working on the idea of intelligent term suggestions and query exclusions for several FF; what if instead of asking a lot of questions we could provide the answers right away, just by having a single brand or product term