LeetCode #393 — MEDIUM

UTF-8 Validation

Move from brute-force thinking to an efficient approach using array strategy.

Solve on LeetCode
The Problem

Problem Statement

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

     Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
            1          |   0xxxxxxx
            2          |   110xxxxx 10xxxxxx
            3          |   1110xxxx 10xxxxxx 10xxxxxx
            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Constraints:

  • 1 <= data.length <= 2 * 104
  • 0 <= data[i] <= 255
Patterns Used

Roadmap

  1. Brute Force Baseline
  2. Core Insight
  3. Algorithm Walkthrough
  4. Edge Cases
  5. Full Annotated Code
  6. Interactive Study Demo
  7. Complexity Analysis
Step 01

Brute Force Baseline

Problem summary: Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters). A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules: For a 1-byte character, the first bit is a 0, followed by its Unicode code. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10. This is how the UTF-8 encoding would work: Number of Bytes | UTF-8 Octet Sequence | (binary) --------------------+----------------------------------------- 1 | 0xxxxxxx 2 | 110xxxxx 10xxxxxx 3 | 1110xxxx 10xxxxxx 10xxxxxx 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx x denotes a bit in the binary form of a byte that may be either 0 or 1. Note: The input is an array of integers. Only the least significant 8 bits of each integer is

Baseline thinking

Start with the most direct exhaustive search. That gives a correctness anchor before optimizing.

Pattern signal: Array · Bit Manipulation

Example 1

[197,130,1]

Example 2

[235,140,4]
Step 02

Core Insight

What unlocks the optimal approach

  • Read the data integer by integer. When you read it, process the least significant 8 bits of it.
  • Assume the next encoding is 1-byte data. If it is not 1-byte data, read the next integer and assume it is 2-bytes data.
  • Similarly, if it is not 2-bytes data, try 3-bytes then 4-bytes. If you read four integers and it still does not match any pattern, return false.
Interview move: turn each hint into an invariant you can check after every iteration/recursion step.
Step 03

Algorithm Walkthrough

Iteration Checklist

  1. Define state (indices, window, stack, map, DP cell, or recursion frame).
  2. Apply one transition step and update the invariant.
  3. Record answer candidate when condition is met.
  4. Continue until all input is consumed.
Use the first example testcase as your mental trace to verify each transition.
Step 04

Edge Cases

Minimum Input
Single element / shortest valid input
Validate boundary behavior before entering the main loop or recursion.
Duplicates & Repeats
Repeated values / repeated states
Decide whether duplicates should be merged, skipped, or counted explicitly.
Extreme Constraints
Upper-end input sizes
Re-check complexity target against constraints to avoid time-limit issues.
Invalid / Corner Shape
Empty collections, zeros, or disconnected structures
Handle special-case structure before the core algorithm path.
Step 05

Full Annotated Code

Source-backed implementations are provided below for direct study and interview prep.

// Accepted solution for LeetCode #393: UTF-8 Validation
class Solution {
    public boolean validUtf8(int[] data) {
        int cnt = 0;
        for (int v : data) {
            if (cnt > 0) {
                if (v >> 6 != 0b10) {
                    return false;
                }
                --cnt;
            } else if (v >> 7 == 0) {
                cnt = 0;
            } else if (v >> 5 == 0b110) {
                cnt = 1;
            } else if (v >> 4 == 0b1110) {
                cnt = 2;
            } else if (v >> 3 == 0b11110) {
                cnt = 3;
            } else {
                return false;
            }
        }
        return cnt == 0;
    }
}
Step 06

Interactive Study Demo

Use this to step through a reusable interview workflow for this problem.

Press Step or Run All to begin.
Step 07

Complexity Analysis

Time
O(n)
Space
O(1)

Approach Breakdown

SORT + SCAN
O(n log n) time
O(n) space

Sort the array in O(n log n), then scan for the missing or unique element by comparing adjacent pairs. Sorting requires O(n) auxiliary space (or O(1) with in-place sort but O(n log n) time remains). The sort step dominates.

BIT MANIPULATION
O(n) time
O(1) space

Bitwise operations (AND, OR, XOR, shifts) are O(1) per operation on fixed-width integers. A single pass through the input with bit operations gives O(n) time. The key insight: XOR of a number with itself is 0, which eliminates duplicates without extra space.

Shortcut: Bit operations are O(1). XOR cancels duplicates. Single pass → O(n) time, O(1) space.
Coach Notes

Common Mistakes

Review these before coding to avoid predictable interview regressions.

Off-by-one on range boundaries

Wrong move: Loop endpoints miss first/last candidate.

Usually fails on: Fails on minimal arrays and exact-boundary answers.

Fix: Re-derive loops from inclusive/exclusive ranges before coding.