SOLRPro failing with long URL

Hi,

After wasting a good part of a day trying to work out why various searches wouldn’t work with SOLRPro I’ve discovered that it was because the URL calling the SOLR service is too long. Various bits of the fq parameter where being dropped.

After a bit of research I found that in jetty it is recommended to increase “headerBufferSize” (in jetty.xml) as it is only set to 8K by default. I noticed that the jetty.xml from the SOLRPro plugin was already much larger - regardless I tried doubling it and then double again (with service restarts in between) but it is still stripping the URL. I’m seeing the shortened URL in the request log file.

Is there somewhere else I should be increasing the buffer size?

Regards
Mark

Hmmm, I wonder if solrj (the java library it uses to talk to Solr) is truncating it.

Where are you seeing the long urls? log file somewhere? do you have an example of a long one you can share and where its being truncated. As you noted the jetty.xml we shipped handles GET requests up to 256k which is pretty damn long. Something tells me its not Jetty but something before Jetty so either the plugin itself or SolrJ before it gets to Jetty.

OK found something…

So, SolrJ supports either GET or POST requests. You can try to override packages/types/solrProContentType.cfc in your project and override the “search” function. The default “method” argument for this is “GET”. Either change the default to “POST” or perhaps check URL length and conditionally do a post, or if you’re using your own custom search and you’re manually calling .search() then you can set the method argument to “POST” instead of “GET”. That should get you around any long URL issues. The only thing you would lose is the fact that the URLs sent to Solr get logged in Jetty’s access logs.

Yeah, I had run into this same problem for a client who was indexing millions of records. So for v1.2.5 of the plugin I added the POST option, but for backwards compatability I set GET to be the default and increased the GET requests to 256k (I also inceased the maxFormContentSize from 1000000 to 1000000000) and then I set my client to use POST (which solved their problems). In all honesty I probably should have changed the default to POST instead of GET in the plugin.

If you want to do a quick test without having to override the .search() method, try changing the default to POST in the plugin (and do an updateapp) and see if it solves your problem (it most likely will). Maybe I’ll just change the defaulty to POST in the next plugin update (from everything I had read online back then, everyone pretty much switched to POST for their environments).

Note: The reason I had left it at GET instead of POST was because I was worried that if anyone was using separate servers for Solr and CF and set some kind of server xss/form posting protection on it, that Solr would cease to function once they updated the plugin. I doubt people would have that locked down though as they would most likely have the solr server locked down to a private connection anyway.

Hi Sean,

I did spot the GET/POST option in .search() and changed the call to pass the METHOD type as POST. SOLRPro did correctly change to POST as I noticed the SOLR log file change from GET (with all the details of the request) to just POST. However this didn’t solve my problem, still wasn’t getting records returned (that do get returned if I manually do the query via the SOLR Admin).

I had wondered if it had nothing to do with SOLR settings but more with the Java files being used (I noticed via JavaLoader). But my experience playing around with JAVA is extremely limited and wasn’t sure where to even start tinkering. Had a look at some of the Tomcat and Jetty settings in CF but didn’t know if this would have any impact on the plugin requests.

I got pretty frustrated with this on Friday night and gave up. I’ll revisit it today and have another crack at it. I first need to do a little bit of work to get the test and production SOLR environments completely separated before I do any more playing.

I’ll have a dig around and find previous logs (or generate new ones) to give you an example of the URL. One I do have still is this (which doesn’t work - notice it finishes part way in field name)

http://xxxx.xxxxx.xxxx.xxxxx:8983/solr/test_wi_yyyyyyyyyy/select?q=(chc30312)+AND+(typename%3AwiCourseInfo)+AND+(fcsp_sitename%3Atest_yyyyyyyyyy)+AND+fcsp_benablesearch%3Atrue&start=0&rows=500&qf=specialisation_string_stored+coursedescription_text_stored+bitsmart_int_stored+bitfeehelp_int_notstored+bitmainstream_int_notstored+bitapprenticeship_int_stored+qualification_string_stored+metakeywords_string_notstored+bitenrolmentsallowed_int_notstored+fcsp_rulecontent_phonetic+bitcomm_string_stored+coursename_string_stored+bitisgovsubsidcourse_int_stored+bittwcmobile_int_notstored+bittafeplus_int_stored+bittwconline_int_notstored+bitcomm_int_stored+bitapprenticeship_string_stored+careerpathways_text_notstored+bittwcworkshop_int_notstored+nationalcode_string_stored+careeropportunities_text_notstored+objectid+bitschooltrainee_int_stored+bittwc_string_stored+bittvet_int_stored+bittwcblock_int_notstored+bittraineeship_string_stored+purposeskillsets_text_notstored+bittwcposted_int_not

I think there is approx. 8 or 9 more fields that are missing.

As far as the log file goes, I have the following in the jetty.xml file so that I could see what was going on:

<Ref id="RequestLog">
  <Set name="requestLog">
    <New id="RequestLogImpl" class="org.mortbay.jetty.NCSARequestLog">
      <Set name="filename"><SystemProperty name="jetty.logs" default="./logs"/>/yyyy_mm_dd.request.log</Set>
      <Set name="filenameDateFormat">yyyy_MM_dd</Set>
      <Set name="retainDays">90</Set>
      <Set name="append">true</Set>
      <Set name="extended">false</Set>
      <Set name="logCookies">false</Set>
      <Set name="LogTimeZone">GMT</Set>
    </New>
  </Set>
</Ref>

<!-- =============================================================== -->
<!-- Configure stderr and stdout to a Jetty rollover log file        -->
<!-- =============================================================== -->

<New id="StdErr" class="java.io.PrintStream">
  <Arg>
    <New class="org.mortbay.util.RolloverFileOutputStream">
      <Arg>./logs/stderr-yyyy_mm_dd.log</Arg>
      <Arg type="boolean">true</Arg>
      <Arg type="int">14</Arg>
      <Get id="ServerLogName" name="datedFilename"/>
    </New>
  </Arg>
</New>
<New id="StdOut" class="java.io.PrintStream">
  <Arg>
    <New class="org.mortbay.util.RolloverFileOutputStream">
      <Arg>./logs/stdout-yyyy_mm_dd.log</Arg>
      <Arg type="boolean">true</Arg>
      <Arg type="int">14</Arg>
      <Get id="ServerLogName" name="datedFilename"/>
    </New>
  </Arg>
</New>   

Regards
Mark

Hey Jeff,

I had tried the change to POST (see reply to Sean) but I’ll give it another go today.

While the POST in theory should be better, one advantage I noticed with GET is that you can see what is being passed to the SOLR service (in the log file). But obviously if it breaks with large URL’s, working correctly is always better than advantages with logging :smile:

—would most likely have the solr server locked down to a private connection anyway

Good point - however we don’t have it locked down. Do you know if there is anyway to tell the SOLR service to only answer requests from specific addresses? Might be time for me to dig out the SOLR manual again.

Side note: every time I see your picture in the discourse avatar, it cracks me up. Nothing wrong with the picture - just that every time I see your face I think of the video you guys did for SOLRPro (especially Yuri).

Regards
Mark

It sounds like your problem might be something else then. For my exact issue when I needed to use POST, it was just for doing massive id list comparisons, so it’s not exactly the same scenario you’re having.

I don’t know of a way to tell Solr to only accept requests from specific locations - thats a security thing that (IMO) should be handled on the hardware level (either locking it down on the server or physical network).


Regarding Yuri, I forgot all about that - been a few years already. Thanks for the laugh. I had filmed all of that in one take (minus the quick interviews I gave myself). There are a bunch of little hidden funny text things in that video as well.

I think I’ve just come across something else that isn’t helping my testing. One of the really important fields, called “National Code” contains values like CHC30122, CHC30312 etc. I’ve just discovered that this is case sensitive. Value in the field is stored in uppercase and if you search for “CHC30312” it finds it but if you search for “chc30312” it results zero results.

Is that normal? I thought this would be case insensitive.

When creating/editing your configurations for your types in the SolrPro admin, using the type string will be case-sensitive. You might want to instead use phonetic or text for your type.

For a more detailed explaination, see Information and Tips section at the bottom of the type editing screen.

Thanks Jeff - was just coming back to edit my post :slight_smile: The Information and Tips section isn’t displaying in IE10 or FireFox 33.1. Just had to use Firebug to make the Information and Tips section visible. Once I did that I saw the section about using Text rather than String. Will change these now and see how I go.

I’ve asked this in another thread that Sean responded to, but wondering if you had seen any issues using more than one Boolean field? Search works fine with just one Boolean field but as soon as more than one, searching returns zero results (even using the test search form that comes with the plugin).

On a side note, pagination doesn’t appear to be working on FC7 when looking at the search logs.

Regards
Mark

I will have to look into the javascript bug and pagination issues some time when I get a chance.

Regarding the boolean issue you’re having: I personally haven’t used more than one boolean per type (at least I don’t recall any). Is this for a general stewed/keyword search or a custom search you built? One thing you can try is maybe using integer for it and see if that helps for now.

Hi Jeff,

Just a general search - in fact I don’t search on the Boolean fields but instead use the stored value when displaying results.

I have been using integer (found some older projects where I used string for unknown reason…) but was just asking in case you had seen it before and knew what was wrong. Will keep using integer for now.

Thank you for all your (and Sean’s) help !!! I still have more testing to do but it appears that a combination of needing POST rather than GET and then using String rather than Text was causing my issues.

If your only reason for using those fields is for storing the data in solr, maybe instead consider just pulling the data from FarCry. Keep in mind that FarCry’s object broker does a pretty good job of caching (depending on your needs, traffic, and record storage you likely will never notice the difference). We have many cases where we just pull the data from FarCry for much of the teaser information that is displayed.

Thanks Jeff. I had thought about that before but I have a feeling at some point soon I’m going to need to filter based on the Boolean fields (project I’m working on has major feature creep).

Sorry to keep bothering you Jeff.

So I thought this was fixed but now just noticed another weird thing. Case sensitivity seems to be fixed by switching type to text but noticing that part word searches aren’t returning correct results.

e.g. do a search for “chc30312” and it returns six results:

CHC30312
CHCSS000013
CHCSS000016
CHCSS000027
161-10966V03
CHCSS00009

Change the search word from chc30312 to chc30 or chc303 or even chc3031 and everything BUT chc30312 is returned. The only common thing about almost all the other search results is the text “CHC” in the main ID field (national code). I’ve checked all fields related to each record and they contain no other reference to CHC. The 161-10966V03 mentions a CHC number in one of it’s description fields.

So besides not understanding why anything other than “chc30312” will find chc30312, it also appears that SOLR is using just the CHC text to search for matches. Any pointers?

You can modify the schema.xml and use a field type with a filter. So, index that data with a lowercase filter and then send your query in as lower case. That way the case will always match. The Solr docs are your best bet here.

I have your test case for the boolean issue, I just haven’t had a chance to investigate it yet. I will try to do it soon, thanks for your patience.

As far as your other issues, you might want to build a custom search that specifies the exact fields for the qf paramater. The built in one uses all the fields you specify during setup, and might change them depending on whether you use an exact phrase search or an “AND” or “OR” search, for example. That works well for most apps, but you might need to write a custom one that pulls in only those fields you really need to search against.

We tried to write a plugin that worked for 80 percent of use cases, and tried to provide places for people to override functionality for that extra 20 percent that it doesn’t cover.

Thanks Sean. I did start looking at he schema.xml file last night but decided it looked like it might become a time sinkhole so I’ll probably leave it as is for the moment as the project launch is looming (<week). Once the project is launched I’ll include this in remediation work.

“I will try to do it soon, thanks for your patience.” - dude! open source product so I’m already amazed at the support you and Jeff do. I only asked Jeff as I was already talking with him and thought he might have come across it before.

I did start playing around with a more custom search last night but again it looked like I was going to spend ages on it and I’m running out of time so I stopped. I thought about it overnight and decided that I will probably take a look at this particular issue again today/tomorrow though as I think it’s important this part is working correctly before launch.

Thanks again to Jeff and yourself for the help!