No puede seleccionar más de 25 temas Los temas deben comenzar con una letra o número, pueden incluir guiones ('-') y pueden tener hasta 35 caracteres de largo.

84 líneas
3.1 KiB

  1. // Copyright 2015 The Go Authors. All rights reserved.
  2. // Use of this source code is governed by a BSD-style
  3. // license that can be found in the LICENSE file.
  4. package cases
  5. func (c info) cccVal() info {
  6. if c&exceptionBit != 0 {
  7. return info(exceptions[c>>exceptionShift]) & cccMask
  8. }
  9. return c & cccMask
  10. }
  11. func (c info) cccType() info {
  12. ccc := c.cccVal()
  13. if ccc <= cccZero {
  14. return cccZero
  15. }
  16. return ccc
  17. }
  18. // TODO: Implement full Unicode breaking algorithm:
  19. // 1) Implement breaking in separate package.
  20. // 2) Use the breaker here.
  21. // 3) Compare table size and performance of using the more generic breaker.
  22. //
  23. // Note that we can extend the current algorithm to be much more accurate. This
  24. // only makes sense, though, if the performance and/or space penalty of using
  25. // the generic breaker is big. Extra data will only be needed for non-cased
  26. // runes, which means there are sufficient bits left in the caseType.
  27. // Also note that the standard breaking algorithm doesn't always make sense
  28. // for title casing. For example, a4a -> A4a, but a"4a -> A"4A (where " stands
  29. // for modifier \u0308).
  30. // ICU prohibits breaking in such cases as well.
  31. // For the purpose of title casing we use an approximation of the Unicode Word
  32. // Breaking algorithm defined in Annex #29:
  33. // http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table.
  34. //
  35. // For our approximation, we group the Word Break types into the following
  36. // categories, with associated rules:
  37. //
  38. // 1) Letter:
  39. // ALetter, Hebrew_Letter, Numeric, ExtendNumLet, Extend.
  40. // Rule: Never break between consecutive runes of this category.
  41. //
  42. // 2) Mid:
  43. // Format, MidLetter, MidNumLet, Single_Quote.
  44. // (Cf. case-ignorable: MidLetter, MidNumLet or cat is Mn, Me, Cf, Lm or Sk).
  45. // Rule: Don't break between Letter and Mid, but break between two Mids.
  46. //
  47. // 3) Break:
  48. // Any other category, including NewLine, CR, LF and Double_Quote. These
  49. // categories should always result in a break between two cased letters.
  50. // Rule: Always break.
  51. //
  52. // Note 1: the Katakana and MidNum categories can, in esoteric cases, result in
  53. // preventing a break between two cased letters. For now we will ignore this
  54. // (e.g. [ALetter] [ExtendNumLet] [Katakana] [ExtendNumLet] [ALetter] and
  55. // [ALetter] [Numeric] [MidNum] [Numeric] [ALetter].)
  56. //
  57. // Note 2: the rule for Mid is very approximate, but works in most cases. To
  58. // improve, we could store the categories in the trie value and use a FA to
  59. // manage breaks. See TODO comment above.
  60. //
  61. // Note 3: according to the spec, it is possible for the Extend category to
  62. // introduce breaks between other categories grouped in Letter. However, this
  63. // is undesirable for our purposes. ICU prevents breaks in such cases as well.
  64. // isBreak returns whether this rune should introduce a break.
  65. func (c info) isBreak() bool {
  66. return c.cccVal() == cccBreak
  67. }
  68. // isLetter returns whether the rune is of break type ALetter, Hebrew_Letter,
  69. // Numeric, ExtendNumLet, or Extend.
  70. func (c info) isLetter() bool {
  71. ccc := c.cccVal()
  72. if ccc == cccZero {
  73. return !c.isCaseIgnorable()
  74. }
  75. return ccc != cccBreak
  76. }