SOLR Pro Stemming Filter

Solr pro plugin does not include Stemming filtering for the default search. There is also no webtop config at this time. Stemming can only be added by manual config change of the Solr schema.xml.

We’ve encountered this on a couple of deployments where the search engine can not find ‘plural’ words based on the ‘singular’ word. Also various word transformations.

There are several Stemming algorithms available for Solr including Snowball, Porter and KStem. Snowball is slower than the other two since it supports languages other than English. If you are just running plain english go with Porter.

You can add a Stemming filter as follows:

  1. Open your Solr schema.xml; for example, /projects/[your project name]/solr/conf/schema.xml
  2. Locate the Text fieldType
  3. Add your specific stemming filter (example below)
  4. Reload Solr service.

For example, with Porter you would add <filter class="solr.PorterStemFilterFactory"/> to your schema:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
  	<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
	<filter 
		class="solr.WordDelimiterFilterFactory"
		generateWordParts="1"
		generateNumberParts="1"
		catenateWords="1"
		preserveOriginal="1"
		splitOnCaseChange="1"
		protected="protwords.txt"
	 />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Enjoy!

2 Likes

Thanks Ken. We tried to keep the plugin as simple as possible. We tend to modify the schema.xml for different purposes as well for different clients (usually with Solr plugins).

Keep in mind that there are times when you don’t want to stem (like searching for exact strings like a product number), although you likely wouldn’t use the fieldtype of text for that scenario anyway.

We’ve used stemming in the past for keyword searches, but we were on the fence of whether or not to include it as part of the default setup for the plugin. Do you feel it should be? Does anyone else?

I’d vote for it being the default for text fields.

Stemming was on by default in the old days of Verity – not that this should be seen as a bench mark :wink: