On Github wolny / query-wizard
2016-06-02T02:14:05Z,"country:eg AND (site:twitter.com OR site:facebook.com) \\r\\nAND (\\r\\n\"Al Ahly\" OR \"Zamalek\" OR \"Champions League\" OR \"EURO\" OR \"Football\" OR \"Mido\" OR \"Hazim Imam\" OR \"Imad Mit3eb\" OR \"CAF\" OR \"LA LIGA\" OR \"Barcelona\" OR \"Real Madrid\" )"
{"query":"insurance","creationDate":1463062131000,"count":1471}, {"query":"shampoo","creationDate":1462911742000,"count":1460}, {"query":"country:it","creationDate":1463145673000,"count":1456}, {"query":"watch","creationDate":1463132426000,"count":1438}, {"query":"for sale","creationDate":1462558474000,"count":1432}, {"query":"baby","creationDate":1462980572000,"count":1428}, {"query":"powder","creationDate":1462906762000,"count":1423}, {"query":"newspaper","creationDate":1463041488000,"count":1411}, {"query":"beauty","creationDate":1462908186000,"count":1375}, {"query":"country:cn","creationDate":1462901757000,"count":1370}, {"query":"spray","creationDate":1462911175000,"count":1365}, {"query":"amazon.co.uk","creationDate":1461691137000,"count":1347},This is the output after parsing each query and running the word count on the results. Each line in this JSON file is treated as a separate doc that's sent to Solr for indexing. CreationDate is not a correct name for this field, this should be named last_executed_on, as this shows *what* was the last time this query was executed.
... <field name="query" type="string" indexed="true" stored="true" required="true"/> <field name="query_ngram" type="tquery_ngram" indexed="true" stored="false"/> ... <copyField source="query" dest="query_ngram"/> ... <fieldType name="tquery_ngram" class="solr.TextField"> <analyzer type="index"> ... <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10"/> </analyzer> ... </fieldType>Important thing to note here is the *query_ngram* field for which we defined a separate type with it's own analyzer. During indexing n-grams of the size between 2 and 10 are created from query terms and added to the index
The analysis of the phrase: "quick brown fox"
all those n-grams will be indexed by Solr; so when the user types for example "bro", the solr will match this particular phrase, because the n-gram "bro" was associated with it during the index timeExtract interesting terms and phrases given just a single (brand/product) name
A single term is usually the only information we have at the beginning of the Brand Wizard, the question is how can we use this single term and our massive solr index of mentions and extract as much as possible from this single term or phrase?Consume tweets from Gnip Decahose (Kafka topic)
<fields> ... <field name="exact" type="shingle" indexed="true" stored="true" required="true" termVectors="true"/> ... </fields> <types> ... <fieldtype name="shingle" class="solr.TextField"> <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> ... <filter class="solr.ShingleFilterFactory" maxShingleSize="3"/> </analyzer> </fieldtype> </types>It is possible to tell Solr to index not only terms but also phrases. For this we use so called *Shingles*. A shingle is just a word-based n-gram, as opposed to character-based n-gram we've seen before. Again important thing to note here is the new type that uses the Shingle filter as well as the fact that we store *termVectors*. This might be quite heavy for a large indexes where documents are big, but in case of *test search* it's not that bad.
The analysis of the mention: The quick brown fox jumps over the lazy dog
This is what the analysis for the simple body of text looks like if the ShingleFilter is applied. We use those shingles to create "pseudo-phrases" during the indexing process. And since the shingle ends up being a single token in the index, it is a subject to the normal TF-IDF scoring that is used in Lucene. So we will get all the scoring goodies that come from Lucene for free.Configure MoreLikeThis Handler in your solrconfig.xml
<requestHandler name="/mlt" class="solr.MoreLikeThisHandler"/>
http://localhost:8983/solr/query_context_terms/mlt? mlt.fl=exact& mlt.interestingTerms=details& mlt.mintf=2& mlt.minwl=3& mlt.mindf=1& mlt.boost=true& rows=0& stream.body=Brandwatch is a social media monitoring company headquartered in Brighton, England. Brandwatch is a software as a service, which archives social media data in order to provide companies with information and the means to track specific segments to analyse their brands' online presence
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">21</int> </lst> <result name="response" numFound="59094" start="0"/> <lst name="interestingTerms"> <float name="exact:media">1.1228108</float> <float name="exact:social">1.1294001</float> <float name="exact:social media">1.2617687</float> <float name="exact:brandwatch">2.1803792</float> <float name="exact:brandwatch is">2.2959332</float> </lst> </response>
> db.terms.find().pretty() { "_id" : "mimmy", "attributes" : [ "exact" ] } { "_id" : "DBM", "attributes" : [ "raw" ] } { "_id" : "djrlcontact", "attributes" : [ "author" ] } { "_id" : "whotelsnyc", "attributes" : [ "exact" ] } { "_id" : "kyliecosmetics", "attributes" : [ "hashtags", "links", "at_mentions" ] } { "_id" : "paty_melo_", "attributes" : [ "author", "at_mentions" ] } { "_id" : "dremmanash", "attributes" : [ "exact", "author" ] } { "_id" : "vanessa_rodrgs", "attributes" : [ "author", "at_mentions" ] }those are the sample term attributes extracted from the queries exported from our system; again queries were parsed and the operators extracted
GitHub