Вы не можете выбрать более 25 тем Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.
 
 
 

179 строки
4.9 KiB

  1. // Copyright 2015 The Go Authors. All rights reserved.
  2. // Use of this source code is governed by a BSD-style
  3. // license that can be found in the LICENSE file.
  4. package colltab
  5. // An Iter incrementally converts chunks of the input text to collation
  6. // elements, while ensuring that the collation elements are in normalized order
  7. // (that is, they are in the order as if the input text were normalized first).
  8. type Iter struct {
  9. Weighter Weighter
  10. Elems []Elem
  11. // N is the number of elements in Elems that will not be reordered on
  12. // subsequent iterations, N <= len(Elems).
  13. N int
  14. bytes []byte
  15. str string
  16. // Because the Elems buffer may contain collation elements that are needed
  17. // for look-ahead, we need two positions in the text (bytes or str): one for
  18. // the end position in the text for the current iteration and one for the
  19. // start of the next call to appendNext.
  20. pEnd int // end position in text corresponding to N.
  21. pNext int // pEnd <= pNext.
  22. }
  23. // Reset sets the position in the current input text to p and discards any
  24. // results obtained so far.
  25. func (i *Iter) Reset(p int) {
  26. i.Elems = i.Elems[:0]
  27. i.N = 0
  28. i.pEnd = p
  29. i.pNext = p
  30. }
  31. // Len returns the length of the input text.
  32. func (i *Iter) Len() int {
  33. if i.bytes != nil {
  34. return len(i.bytes)
  35. }
  36. return len(i.str)
  37. }
  38. // Discard removes the collation elements up to N.
  39. func (i *Iter) Discard() {
  40. // TODO: change this such that only modifiers following starters will have
  41. // to be copied.
  42. i.Elems = i.Elems[:copy(i.Elems, i.Elems[i.N:])]
  43. i.N = 0
  44. }
  45. // End returns the end position of the input text for which Next has returned
  46. // results.
  47. func (i *Iter) End() int {
  48. return i.pEnd
  49. }
  50. // SetInput resets i to input s.
  51. func (i *Iter) SetInput(s []byte) {
  52. i.bytes = s
  53. i.str = ""
  54. i.Reset(0)
  55. }
  56. // SetInputString resets i to input s.
  57. func (i *Iter) SetInputString(s string) {
  58. i.str = s
  59. i.bytes = nil
  60. i.Reset(0)
  61. }
  62. func (i *Iter) done() bool {
  63. return i.pNext >= len(i.str) && i.pNext >= len(i.bytes)
  64. }
  65. func (i *Iter) appendNext() bool {
  66. if i.done() {
  67. return false
  68. }
  69. var sz int
  70. if i.bytes == nil {
  71. i.Elems, sz = i.Weighter.AppendNextString(i.Elems, i.str[i.pNext:])
  72. } else {
  73. i.Elems, sz = i.Weighter.AppendNext(i.Elems, i.bytes[i.pNext:])
  74. }
  75. if sz == 0 {
  76. sz = 1
  77. }
  78. i.pNext += sz
  79. return true
  80. }
  81. // Next appends Elems to the internal array. On each iteration, it will either
  82. // add starters or modifiers. In the majority of cases, an Elem with a primary
  83. // value > 0 will have a CCC of 0. The CCC values of collation elements are also
  84. // used to detect if the input string was not normalized and to adjust the
  85. // result accordingly.
  86. func (i *Iter) Next() bool {
  87. if i.N == len(i.Elems) && !i.appendNext() {
  88. return false
  89. }
  90. // Check if the current segment starts with a starter.
  91. prevCCC := i.Elems[len(i.Elems)-1].CCC()
  92. if prevCCC == 0 {
  93. i.N = len(i.Elems)
  94. i.pEnd = i.pNext
  95. return true
  96. } else if i.Elems[i.N].CCC() == 0 {
  97. // set i.N to only cover part of i.Elems for which prevCCC == 0 and
  98. // use rest for the next call to next.
  99. for i.N++; i.N < len(i.Elems) && i.Elems[i.N].CCC() == 0; i.N++ {
  100. }
  101. i.pEnd = i.pNext
  102. return true
  103. }
  104. // The current (partial) segment starts with modifiers. We need to collect
  105. // all successive modifiers to ensure that they are normalized.
  106. for {
  107. p := len(i.Elems)
  108. i.pEnd = i.pNext
  109. if !i.appendNext() {
  110. break
  111. }
  112. if ccc := i.Elems[p].CCC(); ccc == 0 || len(i.Elems)-i.N > maxCombiningCharacters {
  113. // Leave the starter for the next iteration. This ensures that we
  114. // do not return sequences of collation elements that cross two
  115. // segments.
  116. //
  117. // TODO: handle large number of combining characters by fully
  118. // normalizing the input segment before iteration. This ensures
  119. // results are consistent across the text repo.
  120. i.N = p
  121. return true
  122. } else if ccc < prevCCC {
  123. i.doNorm(p, ccc) // should be rare, never occurs for NFD and FCC.
  124. } else {
  125. prevCCC = ccc
  126. }
  127. }
  128. done := len(i.Elems) != i.N
  129. i.N = len(i.Elems)
  130. return done
  131. }
  132. // nextNoNorm is the same as next, but does not "normalize" the collation
  133. // elements.
  134. func (i *Iter) nextNoNorm() bool {
  135. // TODO: remove this function. Using this instead of next does not seem
  136. // to improve performance in any significant way. We retain this until
  137. // later for evaluation purposes.
  138. if i.done() {
  139. return false
  140. }
  141. i.appendNext()
  142. i.N = len(i.Elems)
  143. return true
  144. }
  145. const maxCombiningCharacters = 30
  146. // doNorm reorders the collation elements in i.Elems.
  147. // It assumes that blocks of collation elements added with appendNext
  148. // either start and end with the same CCC or start with CCC == 0.
  149. // This allows for a single insertion point for the entire block.
  150. // The correctness of this assumption is verified in builder.go.
  151. func (i *Iter) doNorm(p int, ccc uint8) {
  152. n := len(i.Elems)
  153. k := p
  154. for p--; p > i.N && ccc < i.Elems[p-1].CCC(); p-- {
  155. }
  156. i.Elems = append(i.Elems, i.Elems[p:k]...)
  157. copy(i.Elems[p:], i.Elems[k:])
  158. i.Elems = i.Elems[:n]
  159. }