algolia / hn-search Goto Github PK

Hacker News Search

License: Other

Ruby 21.22% JavaScript 1.80% HTML 0.66% Shell 0.40% TypeScript 53.77% SCSS 20.37% Haml 1.77%

hn-search's Introduction

HN Search powered by Algolia

This is the Rails 5 application providing HN Search. It's leveraging react on the frontend, algoliasearch-rails for the search and uses wkhtmltoimage to crawl+render thumbnails.

Development/Contributions

We love pull-requests :)

Setup

# clone the repository
git clone https://github.com/algolia/hn-search
cd hn-search

# install dependencies
bundle install

# setup credentials
cp config/database.example.yml config/database.yml # feel free to edit, default configuration is OK for search-only
cp config/application.example.yml config/application.yml # feel free to edit, default configuration is OK for search-only

# setup your (sqlite3) database
bundle exec rake db:migrate

# start contributing enjoying Guard (watchers, livereload, notifications, ...)
bundle exec guard

# done!
open http://localhost:3000

Code

If you want to contribute to the UI, the only directory you need to look at is app/assets. This directory contains all the JS, HTML & CSS code.

Deployment

To deploy, we're using capistrano and therefore you need SSH access to the underlying machines and run from your own computer:

bundle exec cap deploy

There is currently (December 2018) a bug with bluepill stopping the deployment. To workaround it, you need to force a restart with the following command instead:

bundle exec cap deploy:restart

There seems to as well be an issue with thin server, where after deployment orphaned thin processes are not killed. This means that the server tries serving previous version of the app and causes ChunkLoadErrors as the manifest points to no longer existing files. To fix the intermittent errors, you need to ssh to both servers, check for any orphaned thin processes and kill them manually.

ps aux | grep thin
kill <insert old thin process pid's>

Indexing Configuration

The indexing is configured using the following algoliasearch block:

class Item < ActiveRecord::Base
  include AlgoliaSearch

  algoliasearch per_environment: true do
    # the list of attributes sent to Algolia's API
    attribute :created_at, :title, :url, :author, :points, :story_text, :comment_text, :author, :num_comments, :story_id, :story_title
    attribute :created_at_i do
      created_at.to_i
    end

    # `title` is more important than `{story,comment}_text`, `{story,comment}_text` more than `url`, `url` more than `author`
    # btw, do not take into account position in most fields to avoid first word match boost
    attributesToIndex ['unordered(title)', 'unordered(story_text)', 'unordered(comment_text)', 'unordered(url)', 'author', 'created_at_i']

    # list of attributes to highlight
    attributesToHighlight ['title', 'story_text', 'comment_text', 'url', 'story_url', 'author', 'story_title']

    # tags used for filtering
    tags do
      [item_type, "author_#{author}", "story_#{story_id}"]
    end

    # use associated number of HN points to sort results (last sort criteria)
    customRanking ['desc(points)', 'desc(num_comments)']

    # controls the way results are sorted sorting on the following 4 criteria (one after another)
    # I removed the 'exact' match critera (improve 1-words query relevance, doesn't fit HNSearch needs)
    ranking ['typo', 'proximity', 'attribute', 'custom']

    # google+, $1.5M raises, C#: we love you
    separatorsToIndex '+#$'
  end

  def story_text
    item_type_cd != Item.comment ? text : nil
  end

  def story_title
    comment? && story ? story.title : nil
  end

  def story_url
    comment? && story ? story.url : nil
  end

  def comment_text
    comment? ? text : nil
  end

  def comment?
    item_type_cd == Item.comment
  end
end

Credits

HackerNews
Firebase for the real-time crawling API
wkhtmltoimage to back the thumbnails' crawl+rendering

hn-search's People

Contributors

Stargazers

Watchers

hn-search's Issues

API JSON slightly different than the old HNSearch

Is this intentional? (It means that apps using the HNSearch API will need to be updated a bit.) Specifically I have noticed that for users, create_ts is now called created_at, but other differences also exist.

Job type post doesn't exist

I'm not sure if it's technically a HN "story", but some job posts appear on the frontpage as part of the stories.

HN: https://news.ycombinator.com/item?id=7205858
API: http://hn.algolia.io/api/v1/items/7205858

Searching by exact query match?

Hi, I dont know if there's a parameter for this currently but I want to query results by URL attached to a story. This seemed possible with the old API but cannot seem to query by URL only.

I have tried query matches with a URL but the matching seems to be quite fuzzy and I want exact matches. Querying http://hn.algolia.io/api/v1/search_by_date?tags=story&query=http://skimfeed.com/ seems to offer results with a matchedWord that doesn't appear in the text at all. I'm assuming this is a fuzzy-match feature and not a bug, but is there a way to perform a more exact query?

Update user profile (karma, bio) 24 hrs after submission of new item (story or comment)

When a user submits a new item, there is a chance that it will get voted on by users, in which case the karma will change. It would be nice to have the karma updated 24 hours after the submission, or the profile will be horribly out-of-sync for a potentially long time.

Add JSONP Support

/items/[ID] API doesn't return poll information

For example this HN story contains a poll: https://news.ycombinator.com/item?id=7028714

The API response unfortunately doesn't return any information of the poll: http://hn.algolia.io/api/v1/items/7028714 😦

XSS on https://news.ycombinator.com/item?id=1154379

A search that views this post ends up actually navigating to "htt://hostile.com".

I see commit 066f152 seems to address XSS issues. I do not know if this commit is deployed to the search application or not, but the approach is fundamentally flawed. You can not filter out badness; you must correctly encode the text such that HTML entities are correctly used. HN posts do not permit HTML to be used within them, so somewhere you need to call an HTML encoding function on the text, which will encode everything that looks like a tag.

The linked post is a decent test case... it really ought to come up in the search engine exactly as it displays on HN, no additional encoding, no removed encoding, nothing filtered out.

I don't know Ruby, but I'm very familiar with this problem in general; if you need some more help, let me know. I often help my coworkers with this.

Migrating hackernews-button to hn-search API

First off, kudos for the great service! We're looking into migrating to your API to power the HN button (see igrigorik/hackernews-button#14), and ran into a few questions (igrigorik/hackernews-button#15)

Is it possible to search by URL? It seems like you index URL as part of the overall search text, but is it possible to restrict search by URL only?
How come some hits don't have story ID assigned to them? E.g. http://hn.algolia.com/api/v1/search_by_date?query=%22https://www.igvita.com/2014/05/05/minimum-viable-block-chain/%22

Also, for context, the current button gets ~20QPS, but we cache the API results in memcache, so a small fraction of that would hit your API.. I think we'd be ok with 10K/day limit.

Users API lacks data for some usernames

These accounts return results in stories/comments searches but no user profile was found. I am in the process of getting a complete list of this. Here are the first few:

kidb
STCPI
yolosolo
wworried
lakshyabazaar
kittykat04
dolfelt
dbunkah
zigger
dancapo
experimentsin
jmorgan84

Concrete Example:
https://hn.algolia.com/api/v1/users/kidb
https://news.ycombinator.com/user?id=kidb
https://hn.algolia.com/#!/all/forever/prefix/0/author:kidb

Note also that these usernames don't show up within the instant search (likely due to this problem.)

Users API should included created_at_i field

I generally am finding it difficult working with created_at fields (is there a built-in way to parse these in Ruby? I could write some custom string tokenizing code to do that but it seems like there would be a better solution.)

I find it much easier to work with created_at_i fields. The users API is missing this, though. Can it get added?

Indexing delay

We're currently experiencing indexing delay (since ~20h) because the update stream seems to be flooded by super-old items. @kogir / @jamestamplin any chance you guys are aware of that and know what's happening?

https://hacker-news.firebaseio.com/v0/updates.json?print=pretty

{
  "items" : [ 4081111, 4081293, 4081227, 4081299, 4081198, 4081203, 9143078, 4081244, 4081245, 4081213, 4081131, 4081263, 4081152, 4081114, 4081112, 4081118, 4081212, 4081186, 4081135, 4081164, 4081237, 4081247, 4081225, 4081169, 4081194, 4081171, 4081126, 4081268, 4081197, 4081123, 4081170, 4081142, 4081116, 4081297, 4081257, 4081133, 4081183, 4081113, 4081303, 4081150, 4081153, 4081282, 4081176, 4081179, 4081269, 4081218, 4081236, 4081177, 4081160, 4081117, 4081110, 4081250, 4081284, 4081165, 4081151, 4081143, 4081146, 4081302, 4081266, 4081162, 4081154, 4081294, 4081128, 4081276, 4081279, 4081289, 4081230, 4081298, 4081280, 4081246, 4081239, 4081275, 4081122, 4081265, 4081238, 4081259, 4081159, 4081217, 4081254, 4081149, 4081202, 4081107, 4081173, 4081304, 4081281, 4081105, 4081196, 4081106, 4081285, 4081180, 4081155, 4081121, 4081277, 4081174, 4081195, 4081140, 4081200, 4081172, 4081201, 4081296, 4081157, 4081305, 4081120, 4081156, 4081233, 4081231, 4081273, 4081145, 4081272, 4081240, 4081241, 4081167, 4081147, 4081291, 4081130, 4081214, 4081235, 4081139, 4081206, 4081222, 4081163, 4081288, 4081290, 4081270, 4081215, 4081287, 4081161, 4081253, 4081190, 4081208, 4081168, 4081211, 4081216, 4081209, 4081189, 4081256, 4081283, 4081251, 4081181, 4081278, 4081205, 4081210, 4081220, 4081127, 4081138, 4081184, 4081260, 4081115, 4081252, 4081255, 4081234, 4081108, 4081223, 4081125, 4081141, 4081136, 4079393, 4077256, 4079620, 4081226, 4080836, 4081137, 4079567, 4081228, 4064867, 4080268, 4081148, 4081248, 4081193, 4081129, 4081301, 4080074, 4081204, 4080537, 4079837, 4079615, 4081286, 4079737, 4080373, 4080201, 4081292, 4081119, 4079588, 4081185, 4081224, 4081267, 4081191, 4081243, 4081038, 4080240, 4078509, 4081221, 4079183, 4081258, 4077431, 4081132, 4080330, 4081097, 4079977, 4078635, 4081124, 4081249, 4077645, 4078554, 4081199, 4081242, 4081109, 4081182, 4081144, 4078334, 4079572, 4081264, 4078309, 4081053, 4081134, 4080522, 4081219, 4081271, 4079206, 4069914, 4079862, 4080817, 9142819, 4077891, 4081232, 4081188, 4078288, 4079934, 4081262, 4081274, 4081300, 4078483, 4080451, 4081261, 4076834, 4080320, 4081158, 4081178, 4081187, 4081207, 4081295, 4081166, 4081229, 4081175, 4081192 ],
  "profiles" : [ "Argorak", "shrikrishna", "nexneo" ]
}

Support human-readable timestamps as well as UNIX ones

The “Advanced search syntax” help dialog says this:

Use date>TIMESTAMP or date<TIMESTAMP to filter by date.

However, when I tried searching by date using the ISO format YYYY-MM-DD, e.g. clojure date>2014-08-09 because I was curious about the number of recent Clojure stories, I got a bunch of stories with the wrong dates, or no stories were returned even though there should have been some. It didn’t work when I removed the hyphens either, like 20140809. It took me a while to realize that you only supported UNIX timestamps, e.g. clojure date>1404878400. I had to go find an online tool that converts human-readable timestamps to UNIX timestamps in order to do my search.

It would be easier for users of the website (as opposed to users of the API) to use the date filter if you supported the ISO 8601 date and time format too. You could look for the presence of hyphens in the date to distinguish UNIX timestamps from ISO ones. Also, the “Advanced search syntax” help dialog should be updated to state what format the “TIMESTAMP” is expected to be in – just a short note like “(TIMESTAMP can be UNIX timestamp or ISO 8601 formatted)” is fine.

Help dialog listing syntax/query operators

Add ability for whole string search

On the old search when I would look for 'ios', it would return items that contained full match.

Currently when I search for 'ios', it returns all results that contain 'io'.

It would be nice if there was a way to specify you want to match the string exactly or partial match is ok.

Points for comments items

Hi guys,

I'm building a small android app (the n-th HN Reader of course !) and using your search API to retrieve items. However, I'm a bit confused on comments - it seems like the points (upvotes / downvotes) are always 1. Am I missing something ?

Code moved to the right place, building URL with encodeURI. + Reset on empty inputfield.

Add "Past Year" to date range

I often find myself reaching for posts in the last year. Could we add that option from the date range drop down menu?

Monitor indexing/crawling status

To ensure both our crawler & indexer stay up-to-date.

Strings like php, html, asp, uk... return false positives due to URL

The first 3 results are good matches for php but the others are here just because their URL contains .php . The same happens with :

html
asp(x)
jsp
uk
...

2 clicks needed to open a link from the results page

It takes 2 clicks to open a link from the results page.
This issue only happens the first time. Once a result link has been opened, the issue

Schedule full-reindexing

Some items were not detected as "dead" while importing the initial dump.

Avg is zero for some users when using the /users API

I have noticed it on at least a couple of occasions while testing a program I am writing that makes use of the API. One such example (cannot remember the other at the moment) is:

https://hn.algolia.com/api/v1/users/dweekly

(reports average as zero while HN reports it as 4.)

[REQ] Option to exclude username from search

For certain searches (e.g. when looking for comments about the Julia programming language) the search string matches a number of usernames.

It would be great to have options to restrict the search to matching the comment body, as unless I'm searching for a particular person (for which there's already a search modifier) usernames aren't of relevance.

Custom date range

Feature request here, I'd love to be able to filter stories and comments by date on a more granular level.

Internal Server Error reported when deleted stories are queried

Recently (less than 2 days ago) started to notice a lot of Internal Server Error messages like this:

   curl https://hn.algolia.com/api/v1/items/8189743
  {"status":"500","error":"Internal Server Error"}

Item 8189743 is marked as [deleted] on the official site (as of 2014-08-17 17:07 UTC) but it still receives comments which are returned by /api/v1/search_by_date. In the past a query like this used to return a json object.

Sort comments thread by points

HN: https://news.ycombinator.com/item?id=7207506
API: http://hn.algolia.io/api/v1/items/7207506

Just wondering if it's possible to pass in another parameter to sort comments by points (highest to lowest)? Not sure if it's really the points, my intention is just to make it sort the same way as the comments on HN. Now the API seems default to sort them by date.

Possible to get an item's hacker news parent id and story id added to the API?

Is there any way to get the Hacker News parent_id and story_id added to the API?

Userfeed returns "Unknown."

Following these docs, I am expecting a feed of pg's undeleted comments: http://hn.algolia.com/userfeed/?username=pg

However, I get a response of "Unknown." from the server.

It's the same error trying to visit just userfeed: http://hn.algolia.com/userfeed

https://github.com/algolia/hn-search/blob/master/app/controllers/home_controller.rb#L38-#L43

Is this functionality no longer supported?

Fuzzy match too fuzy

It used to be that I could search for things like [spacex] and [castar], now I am getting tons of non-SpaceX and non-CastAR results because it's considering "space" to be a match, etc etc.

When I use quotes I get what I wanted, but it's dropping the quotes in some circumstances, like:

add quotes to query
note that the URL now has quotes
reload the page
quotes are gone

Android's virtual keyboard being closed at each keystroke with Dolphin browser

Need to investigate that

Comments sorting doesn't match

HN: https://news.ycombinator.com/item?id=7253711
API: http://hn.algolia.io/api/v1/items/7253711

The sorting of the comments thread doesn't match. All comments have 1 point but I guess HN seems to put more weight on comments with more replies?

Data is not complete in places?

Hey guys,

I have noticed a few issues with the data as compared to the site:

there are multimonth gaps in data, some samples including from Nov 2007 - Dec 2007 (date>1193875200&date<1196467200) and Aug 2009 - Dec 2009 (date>1249084800&date<1259625600), I spot checked a few ids that are suppose to be between them, and found them on hacker news but not on the search engine. Any ideas why?
for past items, it doesn't seem to be updating (much?) anymore, for example, a fairly recent item https://news.ycombinator.com/item?id=7787384 has 1469 points on the site, but only has 1377 points on the search site. It would be nice to include a "cached" field in the api to signify exactly when this item was pulled if older items are not updated anymore :)

Thanks a lot for any insights into these issues :)

Jason

favicon in opensearch uses http

The favicon is not showing up in firefox when I add hn.algolia as a search provider. I think it is due to mixed http/https. Everything else uses //link or https://link the favicon uses http://link

Add parameter to filter result set by character length

Hi, I am interested in getting sets of the longer Hacker News posts using the API.

Is it possible to have added to the search API a filter something like LengthFilters=CharLength>8000,WordLength>1000 similar to NumericFilters here? https://www.algolia.com/doc/rest_api#Indexes

If that's not the appropriate syntax/format/location for such a filter, I apologize. Hopefully my intention is clear.

Breaking change in Access-Control-Allow-Headers?

I wrote a prototype in November, which doesn't seem to work anymore.

XMLHttpRequest cannot load http://hn.algolia.io/1/404?tags=story&restrictSearchableAttributes=url&query=%22https://github.com/blog/1986-announcing-git-large-file-storage-lfs%22. Request header field Accept-Encoding is not allowed by Access-Control-Allow-Headers.

I'm making this request from the client with JavaScript. I suppose requesting this from the server would let me omit the Accept-Encoding header field. Is that the intended use case?

Search behavior is inconsistent depending on how search is executed

I discovered this when realizing I missed a previous front-page story about "Input Fonts" because I searched for "Input Font". (See: https://news.ycombinator.com/item?id=8173181)

If a search is executed from the text box on HN or by typing the search and pressing enter, exact matching is used. If the search is executed by typing in the search and NOT pressing enter (letting the live results come up), then prefix matching is used.

Example:

If I type "test search" in the box, I get this URL: https://hn.algolia.com/#!/story/forever/prefix/0/test%20search

If I then press enter, I get this URL: https://hn.algolia.com/#!/story/forever/0/test%20search

The latter will not find "tests" or "searched" or other close matches.

Add HN's home page RSS feed

Based on https://news.ycombinator.com/rss with username, points, number of comments, and comments link.

Tag Show/Ask HN stories

Replies to deleted comments are not included in the response

HN post: https://news.ycombinator.com/item?id=7219840
API: http://hn.algolia.com/api/v1/items/7219840

Basically two things here:

Deleted comment is not in the API response, which I think is okay.
Reply/replies to a deleted comment is not in the API response. In this case, andrenotgiant's reply is missing 😕

HN's front page RSS

Hey Kimonolabs guys (any github account ?), do you mind working with us to provide such features?

Love what you announced: http://kimonify.kimonolabs.com/kimload?url=http%3A%2F%2Fwww.kimonolabs.com%2Fwelcome.html (http://vimeo.com/82849382)

Issue searching for comments with points filter

Hi,

It appears the combination of tags=comment and numericFilters=points>X is no longer returning new results.

The latest results are from early October 2014: https://hn.algolia.com/api/v1/search_by_date?tags=comment&numericFilters=points%3E1

It seems to just be this combination. Searching for comments and filtering on created_at_i works fine. Searching for stories and filtering on points, num_comments, or created_at_i works fine.

It looks like a switch was made to the official HN API around early-to-mid October 2014 (#29). My best guess is this issue is related to that switch.

Thanks!

Comments count is only computed on stories, not polls

A Poll JSON Object does not have the value for num_comments, even though Hacker News itself shows it.
For example, https://news.ycombinator.com/item?id=7234822 has 9 comments but its JSON Object has a value of "null" for num_comments.
http://hn.algolia.com/api/v1/search_by_date?hitsPerPage=1&tags=poll&query=encrypt
https://hn.algolia.io/api/v1/items/7234822

Missing stories in Algolia

It appears that Algolia is not returning some of the stories that are valid and accessible on HN's main website.

Example:
https://news.ycombinator.com/item?id=90945

I converted the story post time to Unix epoch and then did following API query to get post by author that was less than this time (I added couple of days in timestamp just to be safe around UTC).

https://hn.algolia.com/api/v1/search_by_date?tags=story,author_felipe&hitsPerPage=10&numericFilters=created_at_i%3C1199080820

The result I get does have older stories but not the above story which indicates that Algolia itself doesn't have the story. The story is accessible on HN without any issues.

Home page typo

In the README, it has an extra , :

attribute :created_at, :title, :url, :author, :points, :story_text, :comment_text, :author, :num_comments, :story_id, :story_title, :

Optimization of loading time

Currently we perform the prefetch javascript query in all situation.

This prefetch query is nice when there is no query as a parameter of the page because if allow to resolve the DNS and initialize the keep-alive HTTP session. But when we have a query as parameter of the URL, we should not perform this prefetch query (using false as the 4th parameter of AlgoliaSearch).

Btw, we should also upgrade to the version 2.5.4 of the javascript client to remove the needs of OPTIONS CORS request.

Is there JSONP or CORS support?

I have tried adding &callback=foo to queries but still get the same JSON output.

Improve spam deletion

Since there is no push notification, it turns out that we're missing a few item deletions while polling the HN API. Could we improve the refreshing rate/cycle?

Malformed <p> tags

Hey, sorry to open two issues in one day – I've been really digging into the output of the API and doing some comment bugfixes on my end. Hacker News has had some screwy paragraph tags in comment text for a long time, and the bug is present in your API. For example, a comment that should look like this:

<p>This is my first paragraph.</p>
<p>This is my second paragraph.</p>
<p>This is my third paragraph.</p>

Instead, looks like this:

This is my first paragraph.
<p>This is my second paragraph.
<p>This is my third paragraph.

A fix for this would be great, but I wouldn't mind this being marked as a "wontfix". I just wanted to at least make an issue about it so that others can find it in the future if they run into the same problem. I wrote solutions in Ruby and Javascript for anyone else trying to solve this problem quickly:

# ruby
fixed_text = '<p>' + text.gsub("<p>", "</p><p>") + "</p>"

// javascript
fixedText =  '<p>' + text.split("<p>").join("</p><p>"); + "</p>";

More query operators

Add new operators in the query syntax:

date
points

"search engine" -algolia points>42 date>1395440948

All recent comments have 1 point

It looks like older comments have the correct number of points, but comments younger than about 7 days have 1 point. For example, check out what I pulled on the thread about Project Naptha (which gets its info from this API call).

Is this a known limitation of the API, or a bug? If it's a limitation of the API, it would be ideal to be able to pull the comments in the order that they appear on Hacker News, but I'm not sure how your export works.

I noticed that this was mentioned in #14, but it looks like 3260ff0 might not have solved this completely. Thanks!