Fast code search

2026-06-05 · search, code, tools, development

My first attempt at faster search in large codebases was Lucene, which handles indexing and search nicely. I wanted a small website that showed parts of a codebase and its documentation based on a search term. Later I found a Visual Studio extension from the Chromium engineers called VsChromium. It does real-time search as you type across multi-million-line codebases, and I got used to never waiting for a result.

This post is about where that speed comes from. There are two separate problems hiding under “code search”. One is finding an exact string or pattern, which is what grep does, and the interesting question is how to do it over gigabytes without reading every byte. The other is finding code by what it means, where the query shares no words with the answer, which no amount of regex will ever do. I will build a working piece of each, then benchmark everything against a real checkout of the Node.js source.

That tree is a good stress test. It is about 49,000 files and 17 million lines, but only a fraction of that is Node’s own code. The deps/ directory alone is V8, ICU, OpenSSL and zlib, so “search the codebase” already means deciding what the codebase even is.

Scanning without an index

The baseline is to read every file and run the pattern against it. ripgrep is the fast version of that idea, and it is worth seeing why it wins before reaching for anything cleverer. Its author Andrew Gallant wrote it all down: nothing in the hot path is wasteful.

The regex engine is built on finite automata rather than backtracking, so matching is linear in the length of the text with no catastrophic blowup on adversarial patterns. Before that engine even runs, the library pulls the literal strings out of a pattern, including from alternations and optional groups, and uses them as a prefilter. A search for MakeCallback becomes a fast scan for the bytes MakeCallback using SIMD, and the full automaton only wakes up where that substring is found. When there are many literals to look for at once, it reaches for a SIMD algorithm called Teddy that packs sixteen-byte comparisons, an idea it borrows from Intel’s Hyperscan. UTF-8 decoding is folded into the automaton itself, so Unicode matching costs nothing extra. On top of that ripgrep walks directories in parallel and skips anything in .gitignore by default.

The result is that for a one-off search, an index is often not worth it. ripgrep scans the whole 0.8 GB Node tree for MakeCallback in under a tenth of a second, and git grep does the same over tracked files, a little slower because it prefilters less.

The catch is the word “one-off”. A scanner re-reads everything on every query. If you are serving search to a team, or the corpus does not fit in cache, or you want results before the user finishes typing, paying once to build an index starts to pay off.

What a trigram index actually does

The trick that makes indexed regex search possible is due to Russ Cox, who built the engine behind the original Google Code Search and wrote up the technique. The idea is to index every three-byte substring, every trigram, and record which files contain it. Two-byte grams are too common to narrow anything down and four-byte grams are too numerous to store, so three is the sweet spot.

Given that index, a regular expression can be turned into a boolean query over trigrams that every match must satisfy. A search for the literal hello must appear in files that contain hel and ell and llo. You intersect those three posting lists, and only the survivors get the real regex. The genuinely clever part is that this composes through regex operators. To see it, the smallest thing worth building is the analysis itself.

I wrote a small trigram indexer in Rust to make this concrete. It is a teaching tool, not a ripgrep rival, and the whole thing is dependency-free so you can read it end to end. A trigram query is a tiny boolean algebra:

query.rs › query

/// A boolean query over trigrams. `All` matches every document (we learned
/// nothing), `None` matches no document (the regex is unsatisfiable).
#[derive(Clone, PartialEq, Eq)]
pub enum Query {
    All,
    None,
    Trigram(String),
    And(Vec<Query>),
    Or(Vec<Query>),
}

The analysis walks the regex and, for each subexpression, tracks the exact strings it can match (while that set stays small), how it can start and end, and the trigram query it implies. Concatenation is where the composition happens. When two literals sit next to each other their match queries combine, and the strings that straddle the boundary contribute their own trigrams. When a .* sits between them, it severs that boundary, and each side is forced to contribute its trigrams independently:

query.rs › concat

fn concat(x: Info, y: Info) -> Info {
    // Both sides fully known: the concatenation is just their cross product, and
    // it stays exact (until it grows too big and `simplify` cashes it in).
    if let (Some(xe), Some(ye)) = (&x.exact, &y.exact) {
        let mut z = Info::exact_set(cross(xe, ye));
        z.query = x.query.and(y.query);
        return z.simplify(false);
    }

    let xs = x.exact.clone().unwrap_or_else(|| x.suffix.clone());
    let yp = y.exact.clone().unwrap_or_else(|| y.prefix.clone());

    let mut query = x.query.clone().and(y.query.clone());
    // A known exact side that cannot extend into its neighbour must contribute
    // its own trigrams now: this is what "closes off" Google before `.*`.
    if x.exact.is_some() {
        query = query.and(query_for_set(&xs));
    }
    if y.exact.is_some() {
        query = query.and(query_for_set(&yp));
    }
    // Strings that straddle the boundary must appear too. An empty cross (one
    // side unknown, e.g. just after `.*`) adds no constraint.
    query = query.and(query_for_set(&cross(&xs, &yp)));

    let prefix = if x.exact.is_some() {
        let mut p = clamp_prefix(&cross(&xs, &yp));
        if x.can_empty {
            p = union(&p, &y.prefix);
        }
        p
    } else {
        x.prefix.clone()
    };
    let suffix = if y.exact.is_some() {
        let mut s = clamp_suffix(&cross(&xs, &yp));
        if y.can_empty {
            s = union(&s, &x.suffix);
        }
        s
    } else {
        y.suffix.clone()
    };

    Info {
        can_empty: x.can_empty && y.can_empty,
        exact: None,
        prefix,
        suffix,
        query,
    }
    .simplify(false)
}

That severing is what produces Russ Cox’s canonical example. The pattern Google.*Search cannot share a trigram across the wildcard, so it compiles to the eight trigrams of Google and Search, all required. Running the indexer’s explain command shows the translation directly:

run.sh

#!/bin/sh
# Build the toy indexer and show the trigram query each regex compiles to.
# CARGO_TARGET_DIR is set by the verifier; the deploy image never runs this.
set -e
cd "$(dirname "$0")"
cargo build --release --quiet --manifest-path ../../trigram/Cargo.toml
BIN="${CARGO_TARGET_DIR:-../../trigram/target}/release/trigram"

for re in 'hello' 'Google.*Search' '(Path|PathFragment).*=' 'napi_create_[a-z]+' 'ab'; do
  printf '%-26s  ->  %s\n' "$re" "$("$BIN" explain "$re")"
done

$ trigram explain '<regex>' captured at build, 2026-06-08

hello                       ->  ell AND hel AND llo
Google.*Search              ->  Goo AND Sea AND arc AND ear AND gle AND ogl AND oog AND rch
(Path|PathFragment).*=      ->  (Fra AND Pat AND agm AND ath AND ent AND gme AND hFr AND men AND rag AND thF OR Pat AND ath)
napi_create_[a-z]+          ->  _cr AND api AND ate AND cre AND eat AND i_c AND nap AND pi_ AND rea AND te_
ab                          ->  ALL

The last two lines show the limits. A wide character class like [a-z] carries no single trigram, so napi_create_[a-z]+ keeps only the trigrams of the fixed prefix. And ab is shorter than a trigram, so it compiles to ALL: the index cannot help, and you fall back to scanning. This is the rule that keeps the whole scheme correct. When the analysis is unsure, it returns ALL and lets the regex pass sort it out. It may scan a few files it did not need to, but it will never skip a file that matches.

Evaluating the query is set algebra over the posting lists, each one binary-searched out of the memory-mapped index rather than parsed up front:

main.rs › evaluate

/// Evaluate a trigram query to the set of candidate doc ids.
fn candidates(&self, q: &query::Query) -> Vec<u32> {
    match q {
        query::Query::All => (0..self.docs.len() as u32).collect(),
        query::Query::None => Vec::new(),
        query::Query::Trigram(t) => self.posting(t),
        query::Query::And(parts) => parts
            .iter()
            .map(|p| self.candidates(p))
            .reduce(intersect)
            .unwrap_or_default(),
        query::Query::Or(parts) => parts
            .iter()
            .map(|p| self.candidates(p))
            .reduce(union)
            .unwrap_or_default(),
    }
}

/// Binary-search the sorted trigram table in the mapped file, then copy out
/// its posting list. Nothing is parsed up front: only the few pages these
/// reads land on are faulted in from disk.
fn posting(&self, t: &str) -> Vec<u32> {
    let key = t.as_bytes();
    if key.len() != 3 {
        return (0..self.docs.len() as u32).collect();
    }
    let data = self.map.as_slice();
    let (mut lo, mut hi) = (0usize, self.n_tri);
    while lo < hi {
        let mid = (lo + hi) / 2;
        let rec = self.tri + mid * 12; // 3 bytes trigram, pad, u32 offset, u32 count
        match data[rec..rec + 3].cmp(key) {
            std::cmp::Ordering::Less => lo = mid + 1,
            std::cmp::Ordering::Greater => hi = mid,
            std::cmp::Ordering::Equal => {
                let at = self.post + u32_at(data, rec + 4) as usize;
                let n = u32_at(data, rec + 8) as usize;
                return (0..n).map(|i| u32_at(data, at + i * 4)).collect();
            }
        }
    }
    Vec::new()
}

Built against the Node tree, this little indexer covers about 48,000 files in a 234 MB index. The payoff is the pruning. A search for MakeCallback turns into ten trigrams, intersects their posting lists in under half a millisecond, and hands back 119 candidate files out of 48,000. That is the whole point: the regex only ever runs on 0.25% of the tree, and the 87 files it confirms are exactly what ripgrep finds. Russ Cox measured a ~100x speedup on the Linux kernel, where the same trick cut one search from scanning roughly 37,000 files to 25.

The trigram AND does the cheap work, throwing out 99.75% of the tree before the real regex runs, so the expensive matcher only ever opens 119 files.

The production indexers

Two production systems descend from that article. The reference implementation is Russ Cox’s own google/codesearch, the cindex and csearch tools that the regexp article ships with. It uses the trigram index more or less as described, but stores the posting lists with variable-length integer encoding instead of raw offsets, which is why its index is a third the size of my naive one and its queries land in single-digit milliseconds.

Zoekt, the engine Sourcegraph runs, builds a positional trigram index that records where in each file every trigram occurs. That costs more space but means a substring query only has to intersect two posting lists, the one for the opening trigram and the one for the closing trigram, with positions confirming the rest. Its design doc shows a query like (Path|PathFragment).*=.*/usr/local translating to exactly the boolean-over-substrings shape my toy produces.

Here is everything on the same Node checkout, indexes and scanners side by side.

Indexing the Node.js source tree: 48,948 files, 17.2M lines, 0.81 GB.

index	build time	size on disk	vs source
trigram (ours)	33.1 s	234 MB	29%
google/codesearch	4.7 s	70 MB	9%
zoekt	59.9 s	1489 MB	184%

Warm query latency, best of 5. The fastest tool per query is emphasised.

tool	MakeCallback	uv_async_init	napi_create_function	napi_create_[a-z]+	function
ripgrepno index	87.5 ms	89.9 ms	84.3 ms	84.5 ms	160 ms
git grepno index	291 ms	289 ms	286 ms	1,135 ms	384 ms
trigram (ours)	88.5 ms	22.9 ms	33.8 ms	39.1 ms	912 ms
google/codesearch	16.9 ms	6.1 ms	10.0 ms	9.8 ms	421 ms
zoekt	65.6 ms	64.1 ms	61.7 ms	69.9 ms	273 ms

On a AMD Ryzen 9 3900X 12-Core Processor (24 threads), captured 2026-06-05. Regenerate with npm run bench:codesearch.

What the table does not show is why. google/codesearch answers the selective queries in single-digit milliseconds because it reads from a compact index it memory-maps and never fully touches. My toy maps its index the same way, so a query faults in only the trigram table and the few posting lists it needs, and the time that is left goes into opening the surviving candidates and running the real regex. The function column is the other end: a term in almost every file gives the index nothing to prune, so every tool falls back to scan speed.

On a 0.8 GB tree that fits in cache, none of this is worth much. An index earns its keep when the corpus outgrows memory, when one index serves many queries a second, or when latency has to stay flat no matter how rare the term. Google’s Code Search, which evolved from trigrams through suffix arrays to a sparse n-gram index, indexes around 1.5 TB at sub-50ms median latency. That is the regime indexes are built for.

The other half: searching by meaning

Everything so far finds text you can describe exactly. None of it can answer “where do we validate a TLS certificate” unless those words appear in the code, and they usually do not. The code says X509_verify_cert and SSL_get_verify_result, not “validate certificate”. This is the gap embeddings fill.

The idea is to turn every chunk of code into a vector of numbers such that code doing similar things lands near similar vectors, do the same to the query, and return the nearest chunks. The query and the answer can share no words at all and still end up close. Three pieces matter, and each is swappable.

Every dot is a code chunk, placed by what it does. The question “validate a TLS certificate” lands among the certificate code and nowhere near the buffer or worker chunks, so its nearest neighbours are the answer, even though they share no words with it.

The first is chunking. Splitting a file every N lines cuts functions in half and glues unrelated code together, which gives the model incoherent things to embed. Splitting on the syntax tree instead keeps each chunk a whole unit. I parse each file with tree-sitter, then walk the tree greedily, emitting coherent spans up to a size budget and recursing into anything too big, an approach the cAST paper shows measurably helps retrieval:

embed.py › chunk

def chunk_file(path: Path, text: str):
    """AST-aware chunking: walk the syntax tree, emit coherent units up to
    MAX_CHARS, recursing into anything too big (a long function) and packing
    small siblings (imports, one-line helpers) together. tree-sitter reports
    byte offsets, so we slice the UTF-8 buffer, not the str."""
    grammar = LANGS.get(path.suffix)
    if grammar is None:
        yield from chunk_lines(path, text)
        return
    data = text.encode("utf-8")
    root = get_parser(grammar).parse(text).root_node()
    line_of = lambda b: data.count(b"\n", 0, b) + 1
    children = lambda n: (n.child(i) for i in range(n.child_count()))
    pending: list[tuple[int, int]] = []  # (start_byte, end_byte) of packed siblings

    def emit():
        if not pending:
            return None
        a, b = pending[0][0], pending[-1][1]
        chunk = Chunk(path, line_of(a), line_of(b), data[a:b].decode("utf-8", "ignore"))
        pending.clear()
        return chunk

    def visit(node, depth=0):
        for child in children(node):
            size = child.end_byte() - child.start_byte()
            if size > MAX_CHARS and child.child_count() and depth < MAX_DEPTH:
                if (c := emit()):
                    yield c
                yield from visit(child, depth + 1)  # a long function: split it further
            else:
                span = pending[-1][1] - pending[0][0] if pending else 0
                if span + size > MAX_CHARS and (c := emit()):
                    yield c
                pending.append((child.start_byte(), child.end_byte()))

    yield from visit(root)
    if (c := emit()):
        yield c

The second is the embedding model. Code-specific models are trained so that a natural-language description and the code it describes map to nearby vectors. I used jina-code-embeddings-0.5b. It is a small 2025 model built on a Qwen2.5-Coder backbone and runs comfortably on a desktop GPU. It is also instruction-tuned, so a query and a code chunk get different prefixes before embedding, which is what lets a question and its answer line up despite reading nothing alike.

The third is the index. A few thousand vectors you can compare one by one, but once a corpus runs to hundreds of thousands the brute-force scan gets wasteful, so the standard answer is an approximate nearest-neighbour structure. HNSW builds a layered graph where each node links to its near neighbours, with a few long-range links in sparse upper layers. A search starts at the top, greedily walks toward the query, and drops down a layer at a time, which reaches the neighbourhood in roughly logarithmic steps instead of scanning everything. FAISS has it built in, and it is the same structure whether the index holds ten thousand vectors or ten million:

embed.py › index

model = SentenceTransformer(MODEL, device="cuda", model_kwargs={"torch_dtype": "float16"})
model.max_seq_length = 512   # a chunk is ~400 tokens; bounds GPU memory
vecs = model.encode(
    [c.text for c in chunks],
    prompt_name=DOC_PROMPT,      # tag every chunk as a candidate code snippet
    batch_size=64,
    normalize_embeddings=True,   # cosine via inner product
    show_progress_bar=True,
    convert_to_numpy=True,
).astype(np.float32)
print(f"  {len(vecs)} vectors in {time.time() - t0:.1f}s")

print("building HNSW index ...")
t0 = time.time()
index = faiss.IndexHNSWFlat(DIM, 32, faiss.METRIC_INNER_PRODUCT)  # M = 32
index.hnsw.efConstruction = 200
index.add(vecs)

Querying is the mirror image. Embed the question with the query prefix, search the graph, map the hits back to files:

embed.py › search

def search(query: str, k: int = 8):
    index = faiss.read_index(str(CACHE / "index.faiss"))
    index.hnsw.efSearch = 64
    meta = [l.rstrip("\n").split("\t") for l in open(CACHE / "chunks.tsv")]
    model = SentenceTransformer(MODEL, device="cuda", model_kwargs={"torch_dtype": "float16"})

    t0 = time.time()
    q = model.encode(
        [query], prompt_name=QUERY_PROMPT, normalize_embeddings=True, convert_to_numpy=True
    ).astype(np.float32)
    scores, ids = index.search(q, k)
    dt = (time.time() - t0) * 1e3

    for score, i in zip(scores[0], ids[0]):
        path, start, end = meta[i]
        print(f"{score:.3f}  {path}:{start}-{end}")
    print(f"\nquery in {dt:.1f}ms (embed + HNSW search)")

I scoped this index to Node’s own source, the src/ and lib/ trees, and left the vendored dependencies in deps/ out. Embedding all of V8 and OpenSSL would be a couple of GPU-hours and mostly noise for a search like this, which is the same “what is the codebase” question from the start of the post, answered differently. Even scoped down it finds code that no trigram could. The results below are real queries against that index, and none of them share their key words with the files they return. Expand any hit to read the code the model actually matched and open it at the pinned commit on GitHub.

Node first-party source (src + lib): 861 files chunked on syntax into 13,741 pieces, embedded with jinaai/jina-code-embeddings-0.5b (896-d) on a NVIDIA GeForce RTX 5070 Ti in 66 s, into a 53 MB HNSW index.

“where do we validate a TLS certificate”515 ms

0.70 lib/tls.js:425–464 view

  if (net.isIP(hostname)) {
    valid = ips.includes(canonicalizeIP(hostname));
    if (!valid)
      reason = `IP: ${hostname} is not in the cert's list: ` + ips.join(', ');
  } else if (dnsNames.length > 0 || subject?.CN) {
    const hostParts = splitHost(hostname);
    const wildcard = (pattern) => check(hostParts, pattern, true);

    if (dnsNames.length > 0) {
      valid = dnsNames.some(wildcard);
      if (!valid)
        reason =
          `Host: ${hostname}. is not in the cert's altnames: ${altNames}`;
    } else {
      // Match against Common Name only if no supported identifiers exist.
      const cn = subject.CN;

      if (ArrayIsArray(cn))
        valid = cn.some(wildcard);
      else if (cn)
        valid = wildcard(cn);

      if (!valid)
        reason = `Host: ${hostname}. is not cert's CN: ${cn}`;
    }
  } else {
    reason = 'Cert does not contain a DNS name';
  }

  if (!valid) {
    return new ERR_TLS_CERT_ALTNAME_INVALID(reason, hostname, cert);
  }
};

exports.createSecureContext = tlsCommon.createSecureContext;
exports.SecureContext = tlsCommon.SecureContext;
exports.TLSSocket = tlsWrap.TLSSocket;
exports.Server = tlsWrap.Server;
exports.createServer = tlsWrap.createServer;
exports.connect = tlsWrap.connect;

lib/tls.js:425–464 on GitHub ↗

0.69 src/quic/tlscontext.cc:413–439 view

                                          X509_STORE_CTX* ctx) {
  // This callback is invoked by OpenSSL for each certificate in the
  // client's chain during the TLS handshake. The preverify_ok
  // parameter reflects OpenSSL's own chain validation result for
  // the current certificate. Failures include:
  //   - Expired or not-yet-valid certificates
  //   - Self-signed certificates not in the trusted CA list
  //   - Broken chain (signature verification failure)
  //   - Untrusted CA (chain does not terminate at a configured CA)
  //   - Revoked certificates (if CRL is configured)
  //   - Invalid basic constraints or key usage
  //
  // If preverify_ok is 1, validation passed for this cert and we
  // always continue. If it is 0, the behavior depends on the
  // reject_unauthorized option:
  //   - true (default): return 0 to abort the handshake immediately,
  //     avoiding wasted work on an untrusted client.
  //   - false: return 1 to let the handshake complete. The validation
  //     error is still recorded by OpenSSL and will be reported to JS
  //     via VerifyPeerIdentity() in the handshake callback, allowing
  //     the application to make its own decision.
  //
  // Note that even when preverify_ok is 1 (chain validation passed),
  // the application may need to perform additional verification after
  // the handshake — for example, checking the certificate's common
  // name or subject alternative names against an allowlist, verifying
  // application-specific fields or extensions, or enforcing certificate

src/quic/tlscontext.cc:413–439 on GitHub ↗

0.64 src/quic/tlscontext.cc:913–952 view

  {
    TransportParams tp(ngtcp2_conn_get_local_transport_params(*session_));
    // Preflight to get the encoded size.
    ssize_t size = tp.EncodedSize();
    if (size > 0) {
      MaybeStackBuffer<uint8_t, 512> buf(size);
      ssize_t written = tp.EncodeInto(buf.out(), size);
      if (written > 0) {
        ngtcp2_vec vec = {buf.out(), static_cast<size_t>(written)};
        if (!ossl_context_.set_transport_params(vec)) {
          validation_error_ = "Failed to set transport parameters";
          ossl_context_.reset();
          return;
        }
      }
    }
  }
}

std::optional<TLSSession::PeerIdentityValidationError>
TLSSession::VerifyPeerIdentity(Environment* env) {
  // We are just temporarily wrapping the ssl, not taking ownership.
  SSLPointerRef ssl(ossl_context_);
  int err = ssl->verifyPeerCertificate().value_or(X509_V_ERR_UNSPECIFIED);
  if (err == X509_V_OK) return std::nullopt;
  Local<Value> reason;
  Local<Value> code;
  if (!crypto::GetValidationErrorReason(env, err).ToLocal(&reason) ||
      !crypto::GetValidationErrorCode(env, err).ToLocal(&code)) {
    // Getting the validation error details failed. We'll return a value but
    // the fields will be empty.
    return PeerIdentityValidationError{};
  }
  return PeerIdentityValidationError{reason, code};
}

MaybeLocal<Object> TLSSession::cert(Environment* env) const {
  SSLPointerRef ssl(ossl_context_);
  return crypto::X509Certificate::GetCert(env, ssl);
}

src/quic/tlscontext.cc:913–952 on GitHub ↗

0.63 lib/internal/tls/secure-context.js:129–179 view

function configSecureContext(context, options = kEmptyObject, name = 'options') {
  validateObject(options, name);

  const {
    allowPartialTrustChain,
    ca,
    cert,
    ciphers = getDefaultCiphers(),
    clientCertEngine,
    crl,
    dhparam,
    ecdhCurve = getDefaultEcdhCurve(),
    key,
    passphrase,
    pfx,
    privateKeyIdentifier,
    privateKeyEngine,
    sessionIdContext,
    sessionTimeout,
    sigalgs,
    ticketKeys,
  } = options;

  // Set the cipher list and cipher suite before anything else because
  // @SECLEVEL=<n> changes the security level and that affects subsequent
  // operations.
  if (ciphers !== undefined && ciphers !== null)
    validateString(ciphers, `${name}.ciphers`);

  // Work around an OpenSSL API quirk. cipherList is for TLSv1.2 and below,
  // cipherSuites is for TLSv1.3 (and presumably any later versions). TLSv1.3
  // cipher suites all have a standard name format beginning with TLS_, so split
  // the ciphers and pass them to the appropriate API.
  const {
    cipherList,
    cipherSuites,
  } = processCiphers(ciphers, `${name}.ciphers`);

  if (cipherSuites !== '')
    context.setCipherSuites(cipherSuites);
  context.setCiphers(cipherList);

  if (cipherList === '' &&
      context.getMinProto() < TLS1_3_VERSION &&
      context.getMaxProto() > TLS1_2_VERSION) {
    context.setMinProto(TLS1_3_VERSION);
  }

  // Add CA before the cert to be able to load cert's issuer in C++ code.
  // NOTE(@jasnell): ca, cert, and key are permitted to be falsy, so do not
  // change the checks to !== undefined checks.

lib/internal/tls/secure-context.js:129–179 on GitHub ↗

0.63 src/quic/tlscontext.cc:579–609 view

  {
    ClearErrorOnReturn clear_error_on_return;
    for (const auto& cert : options_.certs) {
      uv_buf_t buf = cert;
      if (buf.len > 0) {
        BIOPointer bio = crypto::NodeBIO::NewFixed(buf.base, buf.len);
        CHECK(bio);
        cert_.reset();
        issuer_.reset();
        if (crypto::SSL_CTX_use_certificate_chain(
                ctx.get(), std::move(bio), &cert_, &issuer_) == 0) {
          validation_error_ = "Invalid certificate";
          return SSLCtxPointer();
        }
      }
    }
  }

  {
    ClearErrorOnReturn clear_error_on_return;
    for (const auto& key : options_.keys) {
      if (key.GetKeyType() != crypto::KeyType::kKeyTypePrivate) {
        validation_error_ = "Invalid key";
        return SSLCtxPointer();
      }
      if (!SSL_CTX_use_PrivateKey(ctx.get(), key.GetAsymmetricKey().get())) {
        validation_error_ = "Invalid key";
        return SSLCtxPointer();
      }
    }
  }

src/quic/tlscontext.cc:579–609 on GitHub ↗

“convert a buffer to a string with an encoding”514 ms

0.58 lib/string_decoder.js:42–87 view

const {
  encodingsMap,
  normalizeEncoding,
} = require('internal/util');
const {
  ERR_INVALID_ARG_TYPE,
  ERR_INVALID_THIS,
  ERR_UNKNOWN_ENCODING,
} = require('internal/errors').codes;

const kNativeDecoder = Symbol('kNativeDecoder');

/**
 * StringDecoder provides an interface for efficiently splitting a series of
 * buffers into a series of JS strings without breaking apart multibyte
 * characters.
 * @param {string} [encoding]
 */
function StringDecoder(encoding) {
  this.encoding = normalizeEncoding(encoding);
  if (this.encoding === undefined) {
    throw new ERR_UNKNOWN_ENCODING(encoding);
  }
  this[kNativeDecoder] = Buffer.alloc(kSize);
  this[kNativeDecoder][kEncodingField] = encodingsMap[this.encoding];
}

/**
 * Returns a decoded string, omitting any incomplete multi-bytes
 * characters at the end of the Buffer, or TypedArray, or DataView
 * @param {string | Buffer | TypedArray | DataView} buf
 * @returns {string}
 * @throws {TypeError} Throws when buf is not in one of supported types
 */
StringDecoder.prototype.write = function write(buf) {
  if (typeof buf === 'string')
    return buf;
  if (!ArrayBufferIsView(buf))
    throw new ERR_INVALID_ARG_TYPE('buf',
                                   ['Buffer', 'TypedArray', 'DataView'],
                                   buf);
  if (!this[kNativeDecoder]) {
    throw new ERR_INVALID_THIS('StringDecoder');
  }
  return decode(this[kNativeDecoder], buf);
};

lib/string_decoder.js:42–87 on GitHub ↗

0.58 src/string_bytes.h:70–97 view

  // Write the bytes from the string or buffer into the char*
  // returns the number of bytes written, which will always be
  // <= buflen.  Use StorageSize/Size first to know how much
  // memory to allocate.
  static size_t Write(v8::Isolate* isolate,
                      char* buf,
                      size_t buflen,
                      v8::Local<v8::Value> val,
                      enum encoding enc);

  // Take the bytes in the src, and turn it into a Buffer or String.
  static v8::MaybeLocal<v8::Value> Encode(v8::Isolate* isolate,
                                          const char* buf,
                                          size_t buflen,
                                          enum encoding encoding);

  // Like Encode(..., UTF8) but does not re-validate. Input must be valid UTF-8.
  static v8::MaybeLocal<v8::Value> EncodeValidUtf8(v8::Isolate* isolate,
                                                   const char* buf,
                                                   size_t buflen);

  // Warning: This reverses endianness on BE platforms, even though the
  // signature using uint16_t implies that it should not.
  // However, the brokenness is already public API and can't therefore
  // be changed easily.
  static v8::MaybeLocal<v8::Value> Encode(v8::Isolate* isolate,
                                          const uint16_t* buf,
                                          size_t buflen);

src/string_bytes.h:70–97 on GitHub ↗

0.56 src/string_bytes.h:99–114 view

  static v8::MaybeLocal<v8::Value> Encode(v8::Isolate* isolate,
                                          const char* buf,
                                          enum encoding encoding);

 private:
  static size_t WriteUCS2(v8::Isolate* isolate,
                          char* buf,
                          size_t buflen,
                          v8::Local<v8::String> str);
};

}  // namespace node

#endif  // defined(NODE_WANT_INTERNALS) && NODE_WANT_INTERNALS

#endif  // SRC_STRING_BYTES_H_

src/string_bytes.h:99–114 on GitHub ↗

0.56 src/string_bytes.cc:773–801 view

MaybeLocal<Value> StringBytes::Encode(Isolate* isolate,
                                      const uint16_t* buf,
                                      size_t buflen) {
  if (buflen == 0) return String::Empty(isolate);
  CHECK_BUFLEN_IN_RANGE(buflen);

  // Node's "ucs2" encoding expects LE character data inside a
  // Buffer, so we need to reorder on BE platforms.  See
  // https://nodejs.org/api/buffer.html regarding Node's "ucs2"
  // encoding specification
  if constexpr (IsBigEndian()) {
    return EncodeTwoByteString(isolate, buflen, [buf, buflen](uint16_t* dst) {
      size_t nbytes = buflen * sizeof(uint16_t);
      memcpy(dst, buf, nbytes);
      CHECK(nbytes::SwapBytes16(reinterpret_cast<char*>(dst), nbytes));
    });
  } else {
    return ExternTwoByteString::NewFromCopy(isolate, buf, buflen);
  }
}

MaybeLocal<Value> StringBytes::Encode(Isolate* isolate,
                                      const char* buf,
                                      enum encoding encoding) {
  const size_t len = strlen(buf);
  return Encode(isolate, buf, len, encoding);
}

}  // namespace node

src/string_bytes.cc:773–801 on GitHub ↗

0.56 src/crypto/crypto_util.cc:424–462 view

Local<ArrayBuffer> ByteSource::ToArrayBuffer(Environment* env) {
  std::unique_ptr<BackingStore> store = ReleaseToBackingStore(env);
  return ArrayBuffer::New(env->isolate(), std::move(store));
}

MaybeLocal<Uint8Array> ByteSource::ToBuffer(Environment* env) {
  Local<ArrayBuffer> ab = ToArrayBuffer(env);
  return Buffer::New(env, ab, 0, ab->ByteLength());
}

ByteSource ByteSource::FromBIO(const BIOPointer& bio) {
  CHECK(bio);
  BUF_MEM* bptr = bio;
  auto out = DataPointer::Alloc(bptr->length);
  memcpy(out.get(), bptr->data, bptr->length);
  return ByteSource::Allocated(out.release());
}

ByteSource ByteSource::FromEncodedString(Environment* env,
                                         Local<String> key,
                                         enum encoding enc) {
  size_t length = 0;
  ByteSource out;

  if (StringBytes::Size(env->isolate(), key, enc).To(&length) && length > 0) {
    auto buf = DataPointer::Alloc(length);
    size_t actual = StringBytes::Write(
        env->isolate(), static_cast<char*>(buf.get()), length, key, enc);
    out = ByteSource::Allocated(buf.resize(actual).release());
  }

  return out;
}

ByteSource ByteSource::FromStringOrBuffer(Environment* env,
                                          Local<Value> value) {
  return IsAnyBufferSource(value) ? FromBuffer(value)
                                  : FromString(env, value.As<String>());
}

src/crypto/crypto_util.cc:424–462 on GitHub ↗

“where are worker threads created and started”523 ms

0.74 src/node_messaging.cc:54 view
```
namespace worker {
```
src/node_messaging.cc:54 on GitHub ↗
0.74 src/node_worker.cc:55 view
```
namespace worker {
```
src/node_worker.cc:55 on GitHub ↗
0.74 src/node_worker.h:17 view
```
namespace worker {
```
src/node_worker.h:17 on GitHub ↗
0.73 src/node_worker.cc:296 view
```
void Worker::Run() {
```
src/node_worker.cc:296 on GitHub ↗

0.69 lib/internal/cluster/primary.js:161–213 view

cluster.fork = function(env) {
  cluster.setupPrimary();
  const id = ++ids;
  const workerProcess = createWorkerProcess(id, env);
  const worker = new Worker({
    id: id,
    process: workerProcess,
  });

  worker.on('message', function(message, handle) {
    cluster.emit('message', this, message, handle);
  });

  worker.process.once('exit', (exitCode, signalCode) => {
    /*
     * Remove the worker from the workers list only
     * if it has disconnected, otherwise we might
     * still want to access it.
     */
    if (!worker.isConnected()) {
      removeHandlesForWorker(worker);
      removeWorker(worker);
    }

    worker.exitedAfterDisconnect = !!worker.exitedAfterDisconnect;
    worker.state = 'dead';
    worker.emit('exit', exitCode, signalCode);
    cluster.emit('exit', worker, exitCode, signalCode);
  });

  worker.process.once('disconnect', () => {
    /*
     * Now is a good time to remove the handles
     * associated with this worker because it is
     * not connected to the primary anymore.
     */
    removeHandlesForWorker(worker);

    /*
     * Remove the worker from the workers list only
     * if its process has exited. Otherwise, we might
     * still want to access it.
     */
    if (worker.isDead())
      removeWorker(worker);

    worker.exitedAfterDisconnect = !!worker.exitedAfterDisconnect;
    worker.state = 'disconnected';
    worker.emit('disconnect');
    cluster.emit('disconnect', worker);
  });

  worker.process.on('internalMessage', internal(worker, onmessage));

lib/internal/cluster/primary.js:161–213 on GitHub ↗

“parse command line options and flags”524 ms

0.61 lib/internal/test_runner/utils.js:267 view
```
function parseCommandLine() {
```
lib/internal/test_runner/utils.js:267 on GitHub ↗

0.59 src/node.cc:980–1000 view

  if (!(flags & ProcessInitializationFlags::kDisableCLIOptions)) {
    // Parse the options coming from the config file.
    // This is done before parsing the command line options
    // as the cli flags are expected to override the config file ones.
    std::vector<std::string> extra_argv =
        per_process::config_reader.GetNamespaceFlags();
    // [0] is expected to be the program name, fill it in from the real argv.
    extra_argv.insert(extra_argv.begin(), argv->at(0));
    // Parse the extra argv coming from the config file
    ExitCode exit_code = ProcessGlobalArgsInternal(
        &extra_argv, nullptr, errors, kDisallowedInEnvvar);
    if (exit_code != ExitCode::kNoFailure) return exit_code;
    // Parse options coming from the command line.
    exit_code =
        ProcessGlobalArgsInternal(argv, exec_argv, errors, kDisallowedInEnvvar);
    if (exit_code != ExitCode::kNoFailure) return exit_code;
  }

  // Set the process.title immediately after processing argv if --title is set.
  if (!per_process::cli_options->title.empty())
    uv_set_process_title(per_process::cli_options->title.c_str());

src/node.cc:980–1000 on GitHub ↗

0.59 lib/internal/util/parse_args/parse_args.js:364–408 view

  const tokens = argsToTokens(args, options);

  // Phase 2: process tokens into parsed option values and positionals
  const result = {
    values: { __proto__: null },
    positionals: [],
  };
  if (returnTokens) {
    result.tokens = tokens;
  }
  ArrayPrototypeForEach(tokens, (token) => {
    if (token.kind === 'option') {
      if (strict) {
        checkOptionUsage(parseConfig, token);
        checkOptionLikeValue(token);
      }
      storeOption(token, options, result.values, parseConfig.allowNegative);
    } else if (token.kind === 'positional') {
      if (!allowPositionals) {
        throw new ERR_PARSE_ARGS_UNEXPECTED_POSITIONAL(token.value);
      }
      ArrayPrototypePush(result.positionals, token.value);
    }
  });

  // Phase 3: fill in default values for missing args
  ArrayPrototypeForEach(ObjectEntries(options), ({ 0: longOption,
                                                   1: optionConfig }) => {
    const mustSetDefault = useDefaultValueOption(longOption,
                                                 optionConfig,
                                                 result.values);
    if (mustSetDefault) {
      storeDefaultOption(longOption,
                         objectGetOwn(optionConfig, 'default'),
                         result.values);
    }
  });


  return result;
};

module.exports = {
  parseArgs,
};

lib/internal/util/parse_args/parse_args.js:364–408 on GitHub ↗

0.58 src/node_options.cc:2040–2056 view

void GetOptionsAsFlags(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = args.GetIsolate();
  Local<Context> context = isolate->GetCurrentContext();
  Environment* env = Environment::GetCurrent(context);

  if (!env->has_run_bootstrapping_code()) {
    // No code because this is an assertion.
    THROW_ERR_OPTIONS_BEFORE_BOOTSTRAPPING(
        isolate, "Should not query options before bootstrapping is done");
  }
  env->set_has_serialized_options(true);

  Mutex::ScopedLock lock(per_process::cli_options_mutex);
  IterateCLIOptionsScope s(env);

  std::vector<std::string> flags;
  PerProcessOptions* opts = per_process::cli_options.get();

src/node_options.cc:2040–2056 on GitHub ↗

0.57 lib/internal/util/parse_args/parse_args.js:160–185 view

/**
 * Store the default option value in `values`.
 * @param {string} longOption - long option name e.g. 'foo'
 * @param {string
 *         | boolean
 *         | string[]
 *         | boolean[]} optionValue - default value from option config
 * @param {object} values - option values returned in `values` by parseArgs
 */
function storeDefaultOption(longOption, optionValue, values) {
  if (longOption === '__proto__') {
    return; // No. Just no.
  }

  values[longOption] = optionValue;
}

/**
 * Process args and turn into identified tokens:
 * - option (along with value, if any)
 * - positional
 * - option-terminator
 * @param {string[]} args - from parseArgs({ args }) or mainArgs
 * @param {object} options - option configs, from parseArgs({ options })
 * @returns {any[]}
 */

lib/internal/util/parse_args/parse_args.js:160–185 on GitHub ↗

Cosine similarity (higher is nearer). Click a hit to see the code the model matched and open it at nodejs/node@be7ea27. The per-query time is almost all the cost of embedding the question through the model; the HNSW search itself is sub-millisecond. Generated by uv run embed.py, captured 2026-06-05.

The cost shows up on the other side. Building the index is not free the way a trigram index is: every chunk passes through the model, GPU-minutes rather than seconds of I/O. It goes stale the moment code changes, since an edited function needs re-embedding. And it returns what is similar, not what is correct, so a search for an exact identifier will surface things that only look related.

Putting them together

Which is why the real answer is usually both. Lexical search nails identifiers, error strings and anything you can name precisely. Vector search handles intent and paraphrase. Hybrid search runs both and fuses the ranked lists, most often with reciprocal rank fusion, which scores each result by its rank in each list rather than trying to reconcile incompatible score scales. For code specifically, the exact half is not optional, because an identifier match is something embeddings routinely miss.

The tools people actually use have sorted themselves along this same line. Sourcegraph’s Cody dropped embeddings as its backbone and moved to keyword search over Zoekt with BM25 ranking, citing the cost of sending code to a third party, the operational drag of keeping embeddings fresh, and poor scaling past a hundred thousand repositories. Cursor goes the other way and maintains a vector index of your codebase, syncing it with a Merkle tree so only changed chunks are re-embedded. Claude Code does neither: its creator said plainly that early versions used a local vector database and RAG, and that agentic search with plain grep (ripgrep, in fact) “generally works better” while sidestepping the staleness, privacy and reliability problems of an index. Not everyone agrees, and vector-database vendors will tell you grep burns too many tokens on large repos. Both are right about different repositories.

There is no single fast code search. A scanner for when you know the bytes you want, a trigram or suffix-array index for when you will ask many times, an embedding index for when you do not know the words at all. Knowing which one a query needs is the whole game, and most serious systems just run more than one.