What is Tokenization?

✓ 🍋

Tất cả công cụ Hướng dẫn 54 Thuật ngữ 31 Định dạng tệp 131 Trường hợp sử dụng 302 API

Tokenization

Text Tokenization

Splitting text into meaningful units (tokens) such as words, sentences, or subword pieces for processing.

Chi tiết kỹ thuật

Tokenization operates on sequences of Unicode code points, where each character's properties (category, script, case, directionality) are defined by the Unicode standard. Text processing in the browser uses the TextEncoder/TextDecoder APIs for encoding conversion and Intl.Segmenter for locale-aware word and sentence boundary detection. Understanding the distinction between bytes, code units, code points, and grapheme clusters is essential for correct text manipulation.

Ví dụ

```javascript
// Tokenization: text processing example
const input = 'Sample text for processing';
const result = input
  .trim()
  .split(/\s+/)
  .filter(Boolean);
console.log(result); // ['Sample', 'text', 'for', 'processing']
```

Công cụ liên quan

Đ Đếm Từ C Chuyển Đổi Kiểu Chữ S Sắp Xếp Dòng T Tạo Lorem Ipsum T Tạo Slug T Tìm Và Thay Thế X Xóa Dòng Trùng Lặp M Mã Hóa/Giải Mã Base64 M Mã Hóa/Giải Mã URL Đ Định Dạng JSON M Mã Hóa/Giải Mã HTML Entity Đ Đảo Ngược Văn Bản T Thêm/Xóa Số Dòng S So Sánh Văn Bản T Trích Xuất Văn Bản

Thuật ngữ liên quan

BOM Case Conversion Escape Character ASCII Diacritics Kerning CJK Grep