Implementing Unicode Properties: A Comprehensive Guide
Introduction
This article delves into the crucial task of implementing Unicode properties, specifically focusing on the \p{...} and \P{...} syntax, within a regular expression engine. This feature is a cornerstone of UTS#18 Level 1 conformance, a standard that ensures robust Unicode support in regular expression implementations. Our discussion will span from the current limitations in existing systems to the technical intricacies involved in adding full Unicode property support.
Current Status: A Critical Gap in Implementation
Currently, many regex engines have a significant limitation: they offer only partial support for Unicode, often restricted to ASCII-based Perl classes. This means that features like \d (ASCII digits), \s (ASCII whitespace), and \w (ASCII word characters) are supported, but the more extensive world of Unicode properties remains unexplored. This lack of support means no handling for Unicode properties like \p{Letter} or \p{Script=Greek}. This is a critical issue because true Unicode compliance requires the ability to identify and manipulate characters based on their Unicode properties, not just their ASCII equivalents.
The UTS#18 Requirement: Properties
The Unicode Technical Standard #18 (UTS#18) sets out guidelines for Unicode Regular Expressions. Requirement RL1.2, Properties, is especially important. To meet this requirement, an implementation must provide a minimal list of properties. These include fundamental categories and properties such as General_Category, Core Properties (Any, ASCII, Assigned), Script and Script_Extensions, Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, and Default_Ignorable_Code_Point. The syntax that must be supported includes \p{Property}, \p{Property=Value}, and \P{Property} for negated properties. Meeting this standard is more than just adding features; it's about ensuring that the regex engine can correctly handle the vast diversity of characters and scripts defined by Unicode.
Goals of Implementation
The primary goal of implementing Unicode property support is to enable regular expressions to work seamlessly across all Unicode characters. This involves several key objectives:
- Support for Unicode Letters: Implement
\p{Letter}(or\p{L}) to match any Unicode letter, going beyond the ASCII limitations. This is crucial for handling text in various languages. - Script Identification: Enable
\p{Script=Greek}(or\p{sc=Grek}) to match characters from the Greek script. Similar support should extend to all Unicode scripts. - Broader Character Matching: Introduce
\p{Alphabetic}to match alphabetic characters, which includes a broader set of characters than just letters. - Negated Properties: Implement
\P{ASCII}to match any character that is not an ASCII character. This is vital for filtering and manipulating non-ASCII text. - Integration with Existing Features: Ensure that Unicode properties can be used in conjunction with other regex features like character classes and quantifiers. This means
[\p{Letter}]*should work as expected.
Acceptance Criteria for a Successful Implementation
To ensure a successful implementation of Unicode property support, several criteria must be met. These criteria span from the initial investigation phase to comprehensive testing and documentation:
- Investigation Phase Completion: Thoroughly document the capabilities of the UnicodeBasic library (or any chosen library) to understand its features, limitations, and suitability for the task.
- Data Structures: Introduce a
UnicodePropertytype and update the existingClasstype to accommodate Unicode properties within the regex engine's data structures. - Parser Updates: Modify the parser to correctly accept the
\p{...}and\P{...}syntax, ensuring that it can interpret Unicode property expressions. - Property Matching: Implement all required properties as defined by UTS#18, ensuring that they function correctly and efficiently.
- Property Names: Support both long and short aliases for property names (e.g.,
LandLetter), providing flexibility and ease of use. - Property Values: Handle syntax like
\p{Script=Greek}and\p{gc=Lu}, allowing for specific property value matching. - Case-Insensitive Matching: Ensure that property names are matched case-insensitively, as per Unicode standards.
- Comprehensive Testing: Develop and execute a suite of tests covering all properties and edge cases to guarantee the correctness and robustness of the implementation.
- Documentation: Provide clear and comprehensive usage guides and examples to help users understand and utilize the new Unicode property support.
Technical Details: Implementing Unicode Properties
Current Implementation: A Blank Slate
Currently, there is no Unicode property support in many existing regex engines. This represents a significant gap in functionality, limiting the ability to process text from diverse languages and scripts effectively. The existing systems are primarily geared towards ASCII characters, lacking the necessary mechanisms to handle the complexities of Unicode.
Evidence from Code
Examining the code reveals the limitations. For example, the parser often only recognizes Perl classes like \d, \s, and \w, with no provisions for \p{...} or \P{...}. Similarly, character classes are typically defined in terms of ASCII ranges, with no consideration for Unicode properties. These limitations underscore the need for a significant overhaul to incorporate Unicode support.
What Needs to Change: A Phased Approach
Implementing Unicode property support is a major feature that necessitates changes across multiple components of the regex engine. Given the complexity, a phased approach is recommended. This allows for incremental development, testing, and integration, minimizing the risk of introducing errors and ensuring that each component works correctly before moving on to the next.
Phase 1: Investigation – Understanding the Landscape
The initial phase is crucial for laying the groundwork. It involves a thorough investigation of available resources, particularly the UnicodeBasic library (or similar libraries), to determine their suitability and capabilities. This phase sets the direction for the entire implementation process.
Tasks
- Documentation of UnicodeBasic Capabilities: Create a detailed document outlining:a. The properties available in UnicodeBasic.b. The Unicode version it supports.c. API examples and usage patterns.d. The extent of property value aliases support.e. Performance characteristics.
- Gap Analysis: Identify any missing RL1.2 properties and determine if these can be contributed to UnicodeBasic. Develop a fallback strategy if UnicodeBasic proves unsuitable.
Deliverable
The primary deliverable is a decision document that clearly articulates whether to use UnicodeBasic or an alternative approach, based on the findings of the investigation.
Phase 2: Data Structures – Laying the Foundation
This phase focuses on extending the type system to represent Unicode properties effectively. It involves defining new types and updating existing ones to accommodate Unicode properties within the regex engine's data structures.
Key Files
The primary file to modify is typically regex/Regex/Data/Classes.lean (or its equivalent in other languages), which defines the data structures for character classes and properties.
Add New Types
The following new types are needed:
- Enumerated Property Values: Define enums for property values like
GeneralCategoryValue(Lu, Ll, etc.) andScriptValue(Latin, Greek, etc.). - Unicode Property Representation: Create a
UnicodePropertyenum to represent Unicode properties like generalCategory, script, alphabetic, etc.
Update Class Type
Modify the existing Class type to include a unicodeProperty variant, allowing it to represent Unicode properties along with other character classes.
Property Matching Interface
Develop a UnicodeProperty.matches function (typically in a new file like regex/Regex/Data/UnicodeProperty.lean) to determine if a character matches a given Unicode property. This function is the core of the property matching logic.
Phase 3: Parser – Interpreting the Syntax
This phase involves modifying the parser to recognize and interpret the \p{...} and \P{...} syntax. It requires adding new parsing rules and logic to handle Unicode property expressions.
Key Files
The primary file to modify is typically regex/Regex/Syntax/Parser/Basic.lean (or its equivalent), which handles the parsing of regular expression syntax.
Add Property Parser
Implement a unicodeProperty parser to handle the \p{...} and \P{...} syntax. This parser should extract the property name and value (if any) from the expression.
Parse Property Specification
Create a propertySpec parser to handle property specifications like Name or Name=Value. This parser should normalize property names and values for consistent matching.
Property Name Matching
Implement lenient property name matching, as per UTS#18. This involves normalizing property names (lowercase, removing spaces/hyphens/underscores) and matching against both short and long aliases.
Phase 4: Property Implementation – Bringing Properties to Life
This phase is the heart of the implementation, where the actual Unicode properties are implemented. It involves defining the logic for each required property and integrating it with the data structures and matching interface.
Required Properties
The following properties need to be implemented, as per UTS#18:
- General_Category: Implement support for all general categories (L, M, N, P, S, Z, C) and their subcategories (Lu, Ll, Nd, etc.).
- Core Properties: Implement
Any,ASCII, andAssignedproperties. - Script: Implement support for all Unicode scripts, using data from UnicodeData.txt or Scripts.txt.
- Script_Extensions: Handle script extensions, which allow characters to belong to multiple scripts.
- Binary Properties: Implement
Alphabetic,Uppercase,Lowercase,White_Space,Noncharacter_Code_Point, andDefault_Ignorable_Code_Point.
Data Source Options
Consider the following data source options:
- Use UnicodeBasic library (if complete).
- Generate tables from Unicode data files at build time.
- Embed pre-generated tables.
Phase 5: Testing – Ensuring Correctness and Robustness
This phase is critical for ensuring that the implementation is correct, robust, and performs well. It involves developing and executing a comprehensive test suite that covers all aspects of the implementation.
Unit Tests
Add unit tests to cover:
- Parser syntax.
- Property matching for various properties and characters.
- Edge cases and boundary conditions.
Edge Cases
Test edge cases like:
- Category unions (e.g.,
\p{L}should match Lu, Ll, Lt, Lm, and Lo). - Ranges with properties (e.g.,
[\p{L}&&[A-Z]]). - Multiple properties (e.g.,
[\p{Lu}\p{Ll}]).
Corpus Tests
Enable Unicode tests in corpus test suites to ensure compatibility with real-world scenarios.
Performance Tests
Conduct performance tests to ensure that property checks are efficient and do not degrade overall regex performance.
Key Files to Modify: A Summary
regex/Regex/Data/Classes.lean: AddUnicodePropertytypes, updateClasstype, and updateClass.memfunction.regex/Regex/Data/UnicodeProperty.lean: (New file) Implement property matching logic, property value enums, and helper functions.regex/Regex/Syntax/Parser/Basic.lean: AddunicodePropertyparser, addpropertySpecparser, and implement property name normalization.regex/Regex/Syntax/Parser/Error.lean: Add error variants for unknown properties and invalid property values.regex/lakefile.toml: Add UnicodeBasic dependency (if used).regex/tests/CorpusTest.lean: Enable more Unicode tests from testdata.
Testing: A Rigorous Approach
Testing is a cornerstone of any robust implementation. For Unicode property support, a multi-faceted testing strategy is essential to ensure correctness, performance, and adherence to standards. This includes unit tests, edge case testing, corpus tests, and performance evaluations.
Unit Tests: Validating Individual Components
Unit tests focus on verifying the behavior of individual components, such as the parser and property matching logic. These tests provide a granular view of the system's functionality, making it easier to identify and fix issues.
Parser Tests: Ensuring Correct Syntax Interpretation
Parser tests are crucial for validating that the regex engine correctly interprets Unicode property syntax. These tests cover various aspects of the syntax, including basic property syntax, properties with values, case insensitivity, and error handling.
Examples
#guard parseAst "\\p{L}" = .ok (.classes (Classes.mk false #[.unicodeProperty false .letter]))#guard parseAst "\\p{Letter}" = .ok (...)#guard parseAst "\\P{Letter}" = .ok (.classes (Classes.mk false #[.unicodeProperty true .letter]))#guard parseAst "\\p{Script=Greek}" = .ok (...)#guard parseAst "\\p{sc=Grek}" = .ok (...)#guard parseAst "\\p{General_Category=Lu}" = .ok (...)#guard parseAst "\\p{gc=Lu}" = .ok (...)#guard parseAst "\\p{uppercase letter}" = .ok (...)#guard parseAst "\\p{UPPERCASE_LETTER}" = .ok (...)#guard parseAst "[\\p{L}\\p{N}]" = .ok (...)#guard parseAst "[a-z\\p{Greek}]" = .ok (...)#guard parseAst "\\p{InvalidProperty}" = .error (.unknownProperty \"invalidproperty\")#guard parseAst "\\p{Script=FakeScript}" = .error (.invalidPropertyValue ...)#guard parseAst "\\p{" = .error (.unexpectedEndOfInput)
Matching Tests: Verifying Property Matching Logic
Matching tests focus on verifying that the property matching logic correctly identifies characters that belong to specific Unicode properties. These tests cover various properties, including general categories, scripts, and binary properties.
Examples
#guard "\\p{Lu}" matches "A"#guard "\\p{Lu}" matches "Σ" -- Greek uppercase sigma#guard "\\p{Lu}" matches "Ж" -- Cyrillic uppercase zhe#guard "\\p{Lu}" doesn't match "a"#guard "\\p{Lu}" doesn't match "5"#guard "\\p{Ll}" matches "a"#guard "\\p{Ll}" matches "α" -- Greek lowercase alpha#guard "\\p{Nd}" matches "5"#guard "\\p{Nd}" matches "٥" -- Arabic-Indic digit five#guard "\\p{Nd}" matches "५" -- Devanagari digit five#guard "\\p{Script=Greek}" matches "Α"#guard "\\p{Script=Greek}" matches "ω"#guard "\\p{Script=Greek}" doesn't match "A"#guard "\\p{sc=Cyrillic}" matches "Ж"#guard "\\p{sc=Arab}" matches "ع"#guard "\\p{sc=Hani}" matches "中"#guard "\\p{scx=Hira}" matches "ー" -- Common script, but used with Hiragana#guard "\\p{Alphabetic}" matches "A"#guard "\\p{Alphabetic}" matches "α"#guard "\\p{Alphabetic}" matches "中"#guard "\\p{Alphabetic}" doesn't match "5"#guard "\\p{White_Space}" matches " "#guard "\\p{White_Space}" matches "\t"#guard "\\p{White_Space}" matches "\u{2028}" -- Line separator#guard "\\p{White_Space}" doesn't match "a"#guard "\\p{Any}" matches any character#guard "\\p{ASCII}" matches "A"#guard "\\p{ASCII}" doesn't match "Σ"#guard "\\p{Assigned}" doesn't match unassigned code points#guard "\\P{ASCII}" doesn't match "A"#guard "\\P{ASCII}" matches "Σ"#guard "\\P{ASCII}" matches "中"
Edge Cases: Pushing the Boundaries
Edge case testing focuses on verifying the behavior of the implementation under unusual or boundary conditions. These tests help uncover subtle issues that may not be apparent in typical scenarios.
Examples
#guard "\\p{L}" matches "A" -- Lu#guard "\\p{L}" matches "a" -- Ll#guard "\\p{L}" matches "Dž" -- Lt (titlecase)#guard "\\p{L}" matches "ª" -- Lm (modifier)#guard "\\p{L}" matches "中" -- Lo (other)#guard "[\\p{L}&&[A-Z]]" matches "A" -- Letter AND A-Z#guard "[\\p{Greek}--[α-ω]]" matches "Α" -- Greek except lowercase#guard "[\\p{Lu}\\p{Ll}]" matches "A"#guard "[\\p{Lu}\\p{Ll}]" matches "a"#guard "\\p{Lu}" doesn't match "\u{10FFFF}" -- Last code point#guard "\\p{ASCII}" matches "\u{7F}" -- DEL#guard "\\p{ASCII}" doesn't match "\u{80}"#guard "\\p{Mn}" matches "\u{0301}" -- Combining acute accent#guard "e\\p{Mn}" matches "é" -- e + combining acute
Corpus Tests: Real-World Validation
Corpus tests involve running the regex engine against a large and diverse set of real-world text samples. These tests provide a comprehensive evaluation of the implementation's behavior in realistic scenarios.
Key Files
unicode.toml- General Unicode testsflags.toml- Case-insensitive with Unicode- Update others to use
\p{...}where appropriate
Performance Tests: Ensuring Efficiency
Performance tests are crucial for ensuring that the implementation is efficient and does not introduce performance bottlenecks. These tests focus on measuring the execution time of property checks and identifying areas for optimization.
Goals
- Ensure property checks are fast.
- Property lookups should use optimized tables (2-stage lookup).
- Avoid linear search through all code points!
Implementation Strategy: A Phased Approach
Given the complexity of implementing Unicode property support, a phased approach is highly recommended. This allows for incremental development, testing, and integration, minimizing the risk of introducing errors and ensuring that each component works correctly before moving on to the next.
Recommended Approach: Step-by-Step Implementation
- Week 1: Investigation
- Thoroughly research UnicodeBasic (or chosen library).
- Create a detailed decision document.
- Set up the testing framework.
- Week 2: Data Structures
- Define necessary types for Unicode properties.
- Implement basic scaffolding for property representation.
- Focus on internal representation without parsing.
- Week 3: Parser
- Implement parsing logic for property syntax.
- Handle property name normalization.
- Implement error handling for invalid syntax.
- Weeks 4-5: Properties Implementation
- Start with simple properties (Any, ASCII, Assigned).
- Implement binary properties (Alphabetic, Uppercase, etc.).
- Implement enumerated properties (General_Category, Script).
- Integrate with UnicodeBasic or data tables.
- Week 6: Testing & Proofs
- Develop a comprehensive test suite.
- Update correctness proofs to include property support.
Alternative if UnicodeBasic Incomplete: Handling Library Limitations
If UnicodeBasic (or the chosen library) does not provide all the necessary properties, consider the following alternatives:
- Contribute to UnicodeBasic: Add missing properties to the library (preferred).
- Custom Implementation: Generate all necessary tables from Unicode data files.
It's crucial to discuss these options with the library maintainers before choosing the best approach.
Resources: Navigating the Unicode Landscape
Implementing Unicode property support requires a deep understanding of Unicode standards, data files, and related libraries. Here are some essential resources to guide the implementation process.
Unicode Data Files: The Foundation of Unicode Properties
The Unicode Standard provides a wealth of data files that are essential for implementing Unicode properties. These files contain information about character properties, scripts, and other Unicode-related data.
- UnicodeData.txt: Contains general category and character properties.
- Scripts.txt: Contains script property information.
- ScriptExtensions.txt: Contains script extensions property information.
- DerivedCoreProperties.txt: Contains alphabetic, uppercase, lowercase, and other core properties.
- PropList.txt: Contains whitespace, noncharacter code point, and other properties.
All these files are available at: https://www.unicode.org/Public/UCD/latest/ucd/
Documentation: Understanding the Standards
Unicode Technical Standards (UTS) and Unicode Standard Annexes (UAX) provide detailed specifications for various aspects of Unicode, including regular expressions and character properties.
- UTS#18 Specification: Defines Unicode Regular Expressions (
UTS #18_ Unicode Regular Expressions.html- lines 1024-1531). - Compliance Analysis: Provides a compliance checklist for UTS#18 (
uts18_compliance_check.md- lines 137-327). - UAX#44 (UCD): Defines the Unicode Character Database: https://www.unicode.org/reports/tr44/
- UAX#24 (Script): Defines script property values: https://www.unicode.org/reports/tr24/
Libraries: Leveraging Existing Implementations
Several libraries provide Unicode support, including character properties and regular expression engines. These libraries can be valuable resources for implementing Unicode property support.
- UnicodeBasic: A library providing basic Unicode functionality: https://github.com/fgdorais/lean4-unicode-basic
- ICU (for reference): The International Components for Unicode library: http://site.icu-project.org/
Notes for Contributors: A Guide to Collaboration
Implementing Unicode property support is a complex task that often requires collaboration among multiple contributors. Here are some guidelines for contributors to ensure a smooth and successful implementation.
Getting Started: Steps to Contribution
- Claim the issue: Indicate your interest by leaving a comment on the issue.
- Start with investigation: Begin by thoroughly understanding the available resources and libraries.
- Ask questions early: Don't hesitate to ask questions and seek clarification on any aspect of the implementation.
- Incremental PRs: Break the implementation into smaller, manageable pull requests.
Common Pitfalls: Avoiding Common Mistakes
- ❌ Don't do linear search: Avoid linear searching through all code points for each match; use optimized data structures.
- ❌ Don't hardcode Unicode data: Do not hardcode Unicode data; use libraries or generated tables.
- ❌ Don't forget property name aliases: Remember to support both short and long forms of property names.
- ❌ Don't forget case insensitivity: Ensure case-insensitive property name matching.
- ✅ Do optimize for common cases: Optimize for common cases like ASCII and frequently used properties.
- ✅ Do test with real Unicode text: Test with real Unicode text, including various scripts and languages.
Questions? Seeking Clarification
Implementing Unicode property support is a complex undertaking, and questions are inevitable. Don't hesitate to:
- Ask questions in the issue tracker.
- Discuss design decisions before implementing.
- Request code reviews early and often.
Related Issues: Connecting the Dots
Implementing Unicode property support has implications for other areas of the regex engine. Here are some related issues to consider.
- Blocks: RL1.4 (Word Boundaries) - requires Alphabetic and Nd properties.
- Blocks: RL1.5 (Case Insensitive) - requires case folding data.
- Enables: Much better Unicode support throughout the engine.
- After this: RL1.3 (Set Operations) works much better with
\p{...}.
Conclusion
Implementing Unicode properties support in a regular expression engine is a complex but essential task for achieving true Unicode compliance. By following a phased approach, conducting thorough testing, and leveraging available resources, developers can create a robust and efficient implementation that handles the diverse world of Unicode characters. This article has provided a detailed roadmap for this journey, from initial investigation to final testing, ensuring that the resulting regex engine can seamlessly handle Unicode text. For further information on Unicode regular expressions, consider exploring the Unicode Consortium's website.