One of my favourite things about Launchpad’s bug tracker is the dupe finder: when you report a new bug, it’ll search to see if there’s already a similar bug report. It’s the same for questions in Launchpad Answers, too.
Getting to see possible dupes before you file a bug or question is a great time saver for you and the people on the other end. However, the dupe finder has been timing out a lot lately.
Rob Collins, Launchpad’s new Technical Architect, has introduced some changes that should make the dupe finder more reliable.
Other than fewer timeouts, here’s what you might notice:
- the dupe finder now returns fewer matches — three or four rather than ten or more
- the results should be more relevant.
We want to know how this works in practice. Let us know how you get on with the new dupe finder. Either leave a comment here, mail firstname.lastname@example.org or join us on the launchpad-users mailing list.
How Rob did it
The previous dupe finder had a number of problems, not least that the search engine it’s built on is less efficient than we need. We’re planning to replace the search engine but not straight away, so Rob looked for a temporary solution that would work for the next five or six months.
I’ll hand over to Rob to explain what he actually did:
The old search did a pre-pass over every possible hit, which is 400,000 items for Ubuntu bugs and very slow to do. It then did a search matching any document that had a rare search term in it.
So, by rare we mean that the term showed up in less than half of the possible hits.
For example, if you searched for “firefox crashes on <website> in flash” on /ubuntu/+filebug it would search for any bug with any of “firefox” (< 50% of bugs are on firefox), "crash" (<50% of bugs say "crash"), "<<50%...), "flash" (< 50%...)
However, many, many bugs mention "firefox" and many, many bugs mention "crash" and many, many mention "flash".
So, the total return from the search could be 10,000 or 100,000 quite easily and — unlike other search engines — the more terms you typed in, to make it more precise, the less precise it became.
That sounds odd but here's why: it started bring back bugs from anywhere that happened to mention any search term and, adding in the relevance weighting we had just added confusion to it.
What we do now is: if you sesarch for "firefox crashes on <website> in flash" we search for any bug continaing three of the four non stopwords, i.e."firefox, crashes, <website>,flash". If a bug mentions any three, it will be returned.
We can switch this off easily if we have to, so we do want feedback about how people find this.