Migrating Elasticsearch 1.x → 2.x
Migrating Elasticsearch 1.x → 2.x
We upgraded Hacker News Books from Elasticsearch 1.x to 2.x to land on a supported baseline, clean up legacy query/mapping quirks, and pave the way for future upgrades. Below is the exact playbook we followed and the code-level changes that mattered most.
What changed in our codebase
Centralized the ES client and trimmed ad-hoc calls
We removed scattered Elasticsearch usage in small utilities and funneled requests through one module with a single ES_HOST. That reduced surface area while we tested against a 2.x node and made DSL rewrites safer.
Paused risky helpers during the flip
Where we previously used helpers.scan / helpers.bulk inline, we deferred those call sites until after query rewrites were in place, then reinstated them with scroll settings that match 2.x semantics. The helpers are fine—it’s the query + scroll details that matter.
Mappings stayed “2.x-correct”
For new indices on 2.x, we kept the 1.x/2.x "type": "string" with index: "analyzed" or "not_analyzed" instead of jumping to 5.x’s text/keyword split early. We’ll do that split when we move to 5.x+.
Avoided removed/moved features
- Delete-by-Query (moved to a plugin in 2.0) wasn’t in use, so no extra steps.
- Deprecated filter constructs (filtered, and, or) were rewritten to bool queries with a filter clause.
Query DSL rewrites (1.x → 2.x)
filtered → bool with filter
Before (1.x style)
{
"filtered": {
"query": { "match": { "title": "python" } },
"filter": { "term": { "type": "book" } }
}
}
After (2.x style)
{
"query": {
"bool": {
"must": { "match": { "title": "python" } },
"filter": { "term": { "type": "book" } }
}
}
}
and / or filters → bool
Replace legacy boolean filters with bool { must / should } to keep behavior consistent and future-proof.
Scrolling at scale (replacing search_type=scan)
search_type=scan was deprecated shortly after 2.0. Use Scroll with sort: "_doc" to get the same “fast, no-score” walk:
REST example
GET /hnbooks/_search?scroll=2m
{
"sort": ["_doc"],
"size": 1000,
"query": { "match_all": {} }
}
Python helper example
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(ES_HOST)
for hit in helpers.scan(
es,
index="hnbooks",
query={"query": {"match_all": {}}},
scroll="2m",
size=1000
):
process(hit["_source"])
Mapping & index gotchas we checked
- Ancient indices: 2.x rejects indices originally created before 0.90; upgrade or reindex those first.
- Facets: removed in 2.0—use aggregations (we already did).
- _ttl / _timestamp: deprecated across 2.x; prefer explicit date fields & retention patterns.
- Don’t half-jump mappings: keep string on 2.x; plan the text/keyword split when moving to 5.x+.
Ops playbook we used
- Snapshot first → restore to a test 2.x cluster. Safer rehearsal and easy rollbacks.
- Run migration checks on 1.x to flag incompatible mappings/settings.
- Apply DSL changes in the app (filtered → bool/filter, remove and/or, replace scan with Scroll + _doc).
- Full cluster restart for the major jump (1.x → 2.x requires downtime). Plan maintenance, disable allocation appropriately, verify green health after.
Before/After code we deployed
Search query (1.x style → 2.x style)
# before
res = es.search(index="hnbooks", body={
"filtered": {
"query": {"match": {"title": q}},
"filter": {"term": {"type": "book"}}
}
}})
# after
res = es.search(index="hnbooks", body={
"query": {
"bool": {
"must": {"match": {"title": q}},
"filter": {"term": {"type": "book"}}
}
}
}})
Scroll export (replacement for scan)
from elasticsearch import helpers
for hit in helpers.scan(
es,
index="hnbooks",
query={"query": {"match_all": {}}},
scroll="2m",
size=1000
):
export(hit["_source"])
Results, not just process
- Zero feature changes surfaced to readers, but query consistency improved (no deprecated filtered/and/or).
- Ops confidence went up: fewer sharp edges (no core delete-by-query), and a clearer upgrade runway.
- Upgrade runway: we’re positioned for 5.x, where string becomes text/keyword, without re-plumbing everything twice.