HackerNewsBooks Blog - Margin Notes

Migrating Elasticsearch 1.x → 2.x

Migrating Elasticsearch 1.x → 2.x

We upgraded Hacker News Books from Elasticsearch 1.x to 2.x to land on a supported baseline, clean up legacy query/mapping quirks, and pave the way for future upgrades. Below is the exact playbook we followed and the code-level changes that mattered most.

What changed in our codebase

Centralized the ES client and trimmed ad-hoc calls

We removed scattered Elasticsearch usage in small utilities and funneled requests through one module with a single ES_HOST. That reduced surface area while we tested against a 2.x node and made DSL rewrites safer.

Paused risky helpers during the flip

Where we previously used helpers.scan / helpers.bulk inline, we deferred those call sites until after query rewrites were in place, then reinstated them with scroll settings that match 2.x semantics. The helpers are fine—it’s the query + scroll details that matter.

Mappings stayed “2.x-correct”

For new indices on 2.x, we kept the 1.x/2.x "type": "string" with index: "analyzed" or "not_analyzed" instead of jumping to 5.x’s text/keyword split early. We’ll do that split when we move to 5.x+.

Avoided removed/moved features


Query DSL rewrites (1.x → 2.x)

filtered → bool with filter

Before (1.x style)

{
  "filtered": {
    "query":  { "match": { "title": "python" } },
    "filter": { "term":  { "type": "book" } }
  }
}

After (2.x style)

{
  "query": {
    "bool": {
      "must":   { "match": { "title": "python" } },
      "filter": { "term":  { "type": "book" } }
    }
  }
}

and / or filters → bool

Replace legacy boolean filters with bool { must / should } to keep behavior consistent and future-proof.


Scrolling at scale (replacing search_type=scan)

search_type=scan was deprecated shortly after 2.0. Use Scroll with sort: "_doc" to get the same “fast, no-score” walk:

REST example

GET /hnbooks/_search?scroll=2m
{
  "sort": ["_doc"],
  "size": 1000,
  "query": { "match_all": {} }
}

Python helper example

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(ES_HOST)
for hit in helpers.scan(
    es,
    index="hnbooks",
    query={"query": {"match_all": {}}},
    scroll="2m",
    size=1000
):
    process(hit["_source"])

Mapping & index gotchas we checked


Ops playbook we used

  1. Snapshot first → restore to a test 2.x cluster. Safer rehearsal and easy rollbacks.
  2. Run migration checks on 1.x to flag incompatible mappings/settings.
  3. Apply DSL changes in the app (filtered → bool/filter, remove and/or, replace scan with Scroll + _doc).
  4. Full cluster restart for the major jump (1.x → 2.x requires downtime). Plan maintenance, disable allocation appropriately, verify green health after.

Before/After code we deployed

Search query (1.x style → 2.x style)

# before
res = es.search(index="hnbooks", body={
  "filtered": {
    "query":  {"match": {"title": q}},
    "filter": {"term":  {"type":  "book"}}
  }
}})
# after
res = es.search(index="hnbooks", body={
  "query": {
    "bool": {
      "must":   {"match": {"title": q}},
      "filter": {"term":  {"type":  "book"}}
    }
  }
}})

Scroll export (replacement for scan)

from elasticsearch import helpers

for hit in helpers.scan(
    es,
    index="hnbooks",
    query={"query": {"match_all": {}}},
    scroll="2m",
    size=1000
):
    export(hit["_source"])

Results, not just process

#baseline #elasticsearch #engineering #migration