Git Product home page Git Product logo

hn-search's Introduction

HN Search powered by Algolia

This is the Rails 5 application providing HN Search. It's leveraging react on the frontend, algoliasearch-rails for the search and uses wkhtmltoimage to crawl+render thumbnails.

Development/Contributions

We love pull-requests :)

Setup

# clone the repository
git clone https://github.com/algolia/hn-search
cd hn-search

# install dependencies
bundle install

# setup credentials
cp config/database.example.yml config/database.yml # feel free to edit, default configuration is OK for search-only
cp config/application.example.yml config/application.yml # feel free to edit, default configuration is OK for search-only

# setup your (sqlite3) database
bundle exec rake db:migrate

# start contributing enjoying Guard (watchers, livereload, notifications, ...)
bundle exec guard

# done!
open http://localhost:3000

Code

If you want to contribute to the UI, the only directory you need to look at is app/assets. This directory contains all the JS, HTML & CSS code.

Deployment

To deploy, we're using capistrano and therefore you need SSH access to the underlying machines and run from your own computer:

bundle exec cap deploy

There is currently (December 2018) a bug with bluepill stopping the deployment. To workaround it, you need to force a restart with the following command instead:

bundle exec cap deploy:restart

There seems to as well be an issue with thin server, where after deployment orphaned thin processes are not killed. This means that the server tries serving previous version of the app and causes ChunkLoadErrors as the manifest points to no longer existing files. To fix the intermittent errors, you need to ssh to both servers, check for any orphaned thin processes and kill them manually.

ps aux | grep thin
kill <insert old thin process pid's>

Indexing Configuration

The indexing is configured using the following algoliasearch block:

class Item < ActiveRecord::Base
  include AlgoliaSearch

  algoliasearch per_environment: true do
    # the list of attributes sent to Algolia's API
    attribute :created_at, :title, :url, :author, :points, :story_text, :comment_text, :author, :num_comments, :story_id, :story_title
    attribute :created_at_i do
      created_at.to_i
    end

    # `title` is more important than `{story,comment}_text`, `{story,comment}_text` more than `url`, `url` more than `author`
    # btw, do not take into account position in most fields to avoid first word match boost
    attributesToIndex ['unordered(title)', 'unordered(story_text)', 'unordered(comment_text)', 'unordered(url)', 'author', 'created_at_i']

    # list of attributes to highlight
    attributesToHighlight ['title', 'story_text', 'comment_text', 'url', 'story_url', 'author', 'story_title']

    # tags used for filtering
    tags do
      [item_type, "author_#{author}", "story_#{story_id}"]
    end

    # use associated number of HN points to sort results (last sort criteria)
    customRanking ['desc(points)', 'desc(num_comments)']

    # controls the way results are sorted sorting on the following 4 criteria (one after another)
    # I removed the 'exact' match critera (improve 1-words query relevance, doesn't fit HNSearch needs)
    ranking ['typo', 'proximity', 'attribute', 'custom']

    # google+, $1.5M raises, C#: we love you
    separatorsToIndex '+#$'
  end

  def story_text
    item_type_cd != Item.comment ? text : nil
  end

  def story_title
    comment? && story ? story.title : nil
  end

  def story_url
    comment? && story ? story.url : nil
  end

  def comment_text
    comment? ? text : nil
  end

  def comment?
    item_type_cd == Item.comment
  end
end

Credits

hn-search's People

Contributors

ant-hem avatar antoinegauvain avatar dessaigne avatar dstein64 avatar elpicador avatar gwern avatar haroenv avatar hidroh avatar jstrieb avatar julienpa avatar jzck avatar kokliko avatar leavjenn avatar petasittek avatar peterdavehello avatar pixelastic avatar powerpak avatar pragmatictester avatar rahimnathwani avatar redox avatar rgaidot avatar rpozarickij avatar ryanwi avatar sapek avatar sfriquet avatar shipow avatar shreevatsa avatar speedblue avatar timmutton avatar vvo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hn-search's Issues

Add ability for whole string search

On the old search when I would look for 'ios', it would return items that contained full match.

Currently when I search for 'ios', it returns all results that contain 'io'.

It would be nice if there was a way to specify you want to match the string exactly or partial match is ok.

Indexing delay

We're currently experiencing indexing delay (since ~20h) because the update stream seems to be flooded by super-old items. @kogir / @jamestamplin any chance you guys are aware of that and know what's happening?

https://hacker-news.firebaseio.com/v0/updates.json?print=pretty

{
  "items" : [ 4081111, 4081293, 4081227, 4081299, 4081198, 4081203, 9143078, 4081244, 4081245, 4081213, 4081131, 4081263, 4081152, 4081114, 4081112, 4081118, 4081212, 4081186, 4081135, 4081164, 4081237, 4081247, 4081225, 4081169, 4081194, 4081171, 4081126, 4081268, 4081197, 4081123, 4081170, 4081142, 4081116, 4081297, 4081257, 4081133, 4081183, 4081113, 4081303, 4081150, 4081153, 4081282, 4081176, 4081179, 4081269, 4081218, 4081236, 4081177, 4081160, 4081117, 4081110, 4081250, 4081284, 4081165, 4081151, 4081143, 4081146, 4081302, 4081266, 4081162, 4081154, 4081294, 4081128, 4081276, 4081279, 4081289, 4081230, 4081298, 4081280, 4081246, 4081239, 4081275, 4081122, 4081265, 4081238, 4081259, 4081159, 4081217, 4081254, 4081149, 4081202, 4081107, 4081173, 4081304, 4081281, 4081105, 4081196, 4081106, 4081285, 4081180, 4081155, 4081121, 4081277, 4081174, 4081195, 4081140, 4081200, 4081172, 4081201, 4081296, 4081157, 4081305, 4081120, 4081156, 4081233, 4081231, 4081273, 4081145, 4081272, 4081240, 4081241, 4081167, 4081147, 4081291, 4081130, 4081214, 4081235, 4081139, 4081206, 4081222, 4081163, 4081288, 4081290, 4081270, 4081215, 4081287, 4081161, 4081253, 4081190, 4081208, 4081168, 4081211, 4081216, 4081209, 4081189, 4081256, 4081283, 4081251, 4081181, 4081278, 4081205, 4081210, 4081220, 4081127, 4081138, 4081184, 4081260, 4081115, 4081252, 4081255, 4081234, 4081108, 4081223, 4081125, 4081141, 4081136, 4079393, 4077256, 4079620, 4081226, 4080836, 4081137, 4079567, 4081228, 4064867, 4080268, 4081148, 4081248, 4081193, 4081129, 4081301, 4080074, 4081204, 4080537, 4079837, 4079615, 4081286, 4079737, 4080373, 4080201, 4081292, 4081119, 4079588, 4081185, 4081224, 4081267, 4081191, 4081243, 4081038, 4080240, 4078509, 4081221, 4079183, 4081258, 4077431, 4081132, 4080330, 4081097, 4079977, 4078635, 4081124, 4081249, 4077645, 4078554, 4081199, 4081242, 4081109, 4081182, 4081144, 4078334, 4079572, 4081264, 4078309, 4081053, 4081134, 4080522, 4081219, 4081271, 4079206, 4069914, 4079862, 4080817, 9142819, 4077891, 4081232, 4081188, 4078288, 4079934, 4081262, 4081274, 4081300, 4078483, 4080451, 4081261, 4076834, 4080320, 4081158, 4081178, 4081187, 4081207, 4081295, 4081166, 4081229, 4081175, 4081192 ],
  "profiles" : [ "Argorak", "shrikrishna", "nexneo" ]
}

Data is not complete in places?

Hey guys,

I have noticed a few issues with the data as compared to the site:

  1. there are multimonth gaps in data, some samples including from Nov 2007 - Dec 2007 (date>1193875200&date<1196467200) and Aug 2009 - Dec 2009 (date>1249084800&date<1259625600), I spot checked a few ids that are suppose to be between them, and found them on hacker news but not on the search engine. Any ideas why?
  2. for past items, it doesn't seem to be updating (much?) anymore, for example, a fairly recent item https://news.ycombinator.com/item?id=7787384 has 1469 points on the site, but only has 1377 points on the search site. It would be nice to include a "cached" field in the api to signify exactly when this item was pulled if older items are not updated anymore :)

Thanks a lot for any insights into these issues :)

Jason

Users API lacks data for some usernames

These accounts return results in stories/comments searches but no user profile was found. I am in the process of getting a complete list of this. Here are the first few:

kidb
STCPI
yolosolo
wworried
lakshyabazaar
kittykat04
dolfelt
dbunkah
zigger
dancapo
experimentsin
jmorgan84

Concrete Example:
https://hn.algolia.com/api/v1/users/kidb
https://news.ycombinator.com/user?id=kidb
https://hn.algolia.com/#!/all/forever/prefix/0/author:kidb

Note also that these usernames don't show up within the instant search (likely due to this problem.)

More query operators

Add new operators in the query syntax:

  • date
  • points
"search engine" -algolia points>42 date>1395440948

[REQ] Option to exclude username from search

For certain searches (e.g. when looking for comments about the Julia programming language) the search string matches a number of usernames.

It would be great to have options to restrict the search to matching the comment body, as unless I'm searching for a particular person (for which there's already a search modifier) usernames aren't of relevance.

Fuzzy match too fuzy

It used to be that I could search for things like [spacex] and [castar], now I am getting tons of non-SpaceX and non-CastAR results because it's considering "space" to be a match, etc etc.

When I use quotes I get what I wanted, but it's dropping the quotes in some circumstances, like:

  • add quotes to query
  • note that the URL now has quotes
  • reload the page
  • quotes are gone

All recent comments have 1 point

It looks like older comments have the correct number of points, but comments younger than about 7 days have 1 point. For example, check out what I pulled on the thread about Project Naptha (which gets its info from this API call).

Is this a known limitation of the API, or a bug? If it's a limitation of the API, it would be ideal to be able to pull the comments in the order that they appear on Hacker News, but I'm not sure how your export works.

I noticed that this was mentioned in #14, but it looks like 3260ff0 might not have solved this completely. Thanks!

API JSON slightly different than the old HNSearch

Is this intentional? (It means that apps using the HNSearch API will need to be updated a bit.) Specifically I have noticed that for users, create_ts is now called created_at, but other differences also exist.

Custom date range

Feature request here, I'd love to be able to filter stories and comments by date on a more granular level.

Search behavior is inconsistent depending on how search is executed

I discovered this when realizing I missed a previous front-page story about "Input Fonts" because I searched for "Input Font". (See: https://news.ycombinator.com/item?id=8173181)

If a search is executed from the text box on HN or by typing the search and pressing enter, exact matching is used. If the search is executed by typing in the search and NOT pressing enter (letting the live results come up), then prefix matching is used.

Example:

If I type "test search" in the box, I get this URL: https://hn.algolia.com/#!/story/forever/prefix/0/test%20search

If I then press enter, I get this URL: https://hn.algolia.com/#!/story/forever/0/test%20search

The latter will not find "tests" or "searched" or other close matches.

Support human-readable timestamps as well as UNIX ones

The “Advanced search syntax” help dialog says this:

Use date>TIMESTAMP or date<TIMESTAMP to filter by date.

However, when I tried searching by date using the ISO format YYYY-MM-DD, e.g. clojure date>2014-08-09 because I was curious about the number of recent Clojure stories, I got a bunch of stories with the wrong dates, or no stories were returned even though there should have been some. It didn’t work when I removed the hyphens either, like 20140809. It took me a while to realize that you only supported UNIX timestamps, e.g. clojure date>1404878400. I had to go find an online tool that converts human-readable timestamps to UNIX timestamps in order to do my search.

It would be easier for users of the website (as opposed to users of the API) to use the date filter if you supported the ISO 8601 date and time format too. You could look for the presence of hyphens in the date to distinguish UNIX timestamps from ISO ones. Also, the “Advanced search syntax” help dialog should be updated to state what format the “TIMESTAMP” is expected to be in – just a short note like “(TIMESTAMP can be UNIX timestamp or ISO 8601 formatted)” is fine.

Users API should included created_at_i field

I generally am finding it difficult working with created_at fields (is there a built-in way to parse these in Ruby? I could write some custom string tokenizing code to do that but it seems like there would be a better solution.)

I find it much easier to work with created_at_i fields. The users API is missing this, though. Can it get added?

Breaking change in Access-Control-Allow-Headers?

I wrote a prototype in November, which doesn't seem to work anymore.

XMLHttpRequest cannot load http://hn.algolia.io/1/404?tags=story&restrictSearchableAttributes=url&query=%22https://github.com/blog/1986-announcing-git-large-file-storage-lfs%22. Request header field Accept-Encoding is not allowed by Access-Control-Allow-Headers.

I'm making this request from the client with JavaScript. I suppose requesting this from the server would let me omit the Accept-Encoding header field. Is that the intended use case?

Points for comments items

Hi guys,

I'm building a small android app (the n-th HN Reader of course !) and using your search API to retrieve items. However, I'm a bit confused on comments - it seems like the points (upvotes / downvotes) are always 1. Am I missing something ?

Migrating hackernews-button to hn-search API

First off, kudos for the great service! We're looking into migrating to your API to power the HN button (see igrigorik/hackernews-button#14), and ran into a few questions (igrigorik/hackernews-button#15)

Also, for context, the current button gets ~20QPS, but we cache the API results in memcache, so a small fraction of that would hit your API.. I think we'd be ok with 10K/day limit.

Missing stories in Algolia

It appears that Algolia is not returning some of the stories that are valid and accessible on HN's main website.

Example:
https://news.ycombinator.com/item?id=90945

I converted the story post time to Unix epoch and then did following API query to get post by author that was less than this time (I added couple of days in timestamp just to be safe around UTC).

https://hn.algolia.com/api/v1/search_by_date?tags=story,author_felipe&hitsPerPage=10&numericFilters=created_at_i%3C1199080820

The result I get does have older stories but not the above story which indicates that Algolia itself doesn't have the story. The story is accessible on HN without any issues.

Home page typo

In the README, it has an extra , :

attribute :created_at, :title, :url, :author, :points, :story_text, :comment_text, :author, :num_comments, :story_id, :story_title, :

Searching by exact query match?

Hi, I dont know if there's a parameter for this currently but I want to query results by URL attached to a story. This seemed possible with the old API but cannot seem to query by URL only.

I have tried query matches with a URL but the matching seems to be quite fuzzy and I want exact matches. Querying http://hn.algolia.io/api/v1/search_by_date?tags=story&query=http://skimfeed.com/ seems to offer results with a matchedWord that doesn't appear in the text at all. I'm assuming this is a fuzzy-match feature and not a bug, but is there a way to perform a more exact query?

XSS on https://news.ycombinator.com/item?id=1154379

A search that views this post ends up actually navigating to "htt://hostile.com".

I see commit 066f152 seems to address XSS issues. I do not know if this commit is deployed to the search application or not, but the approach is fundamentally flawed. You can not filter out badness; you must correctly encode the text such that HTML entities are correctly used. HN posts do not permit HTML to be used within them, so somewhere you need to call an HTML encoding function on the text, which will encode everything that looks like a tag.

The linked post is a decent test case... it really ought to come up in the search engine exactly as it displays on HN, no additional encoding, no removed encoding, nothing filtered out.

I don't know Ruby, but I'm very familiar with this problem in general; if you need some more help, let me know. I often help my coworkers with this.

Malformed <p> tags

Hey, sorry to open two issues in one day – I've been really digging into the output of the API and doing some comment bugfixes on my end. Hacker News has had some screwy paragraph tags in comment text for a long time, and the bug is present in your API. For example, a comment that should look like this:

<p>This is my first paragraph.</p>
<p>This is my second paragraph.</p>
<p>This is my third paragraph.</p>

Instead, looks like this:

This is my first paragraph.
<p>This is my second paragraph.
<p>This is my third paragraph.

A fix for this would be great, but I wouldn't mind this being marked as a "wontfix". I just wanted to at least make an issue about it so that others can find it in the future if they run into the same problem. I wrote solutions in Ruby and Javascript for anyone else trying to solve this problem quickly:

# ruby
fixed_text = '<p>' + text.gsub("<p>", "</p><p>") + "</p>"
// javascript
fixedText =  '<p>' + text.split("<p>").join("</p><p>"); + "</p>";

favicon in opensearch uses http

The favicon is not showing up in firefox when I add hn.algolia as a search provider. I think it is due to mixed http/https. Everything else uses //link or https://link the favicon uses http://link

Optimization of loading time

Currently we perform the prefetch javascript query in all situation.

This prefetch query is nice when there is no query as a parameter of the page because if allow to resolve the DNS and initialize the keep-alive HTTP session. But when we have a query as parameter of the URL, we should not perform this prefetch query (using false as the 4th parameter of AlgoliaSearch).

Btw, we should also upgrade to the version 2.5.4 of the javascript client to remove the needs of OPTIONS CORS request.

Improve spam deletion

Since there is no push notification, it turns out that we're missing a few item deletions while polling the HN API. Could we improve the refreshing rate/cycle?

Add "Past Year" to date range

I often find myself reaching for posts in the last year. Could we add that option from the date range drop down menu?

Internal Server Error reported when deleted stories are queried

Recently (less than 2 days ago) started to notice a lot of Internal Server Error messages like this:

   curl https://hn.algolia.com/api/v1/items/8189743
  {"status":"500","error":"Internal Server Error"}

Item 8189743 is marked as [deleted] on the official site (as of 2014-08-17 17:07 UTC) but it still receives comments which are returned by /api/v1/search_by_date. In the past a query like this used to return a json object.

Issue searching for comments with points filter

Hi,

It appears the combination of tags=comment and numericFilters=points>X is no longer returning new results.

The latest results are from early October 2014: https://hn.algolia.com/api/v1/search_by_date?tags=comment&numericFilters=points%3E1

It seems to just be this combination. Searching for comments and filtering on created_at_i works fine. Searching for stories and filtering on points, num_comments, or created_at_i works fine.

It looks like a switch was made to the official HN API around early-to-mid October 2014 (#29). My best guess is this issue is related to that switch.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.