`tdk.tools`¶

Various tools for working with Turkish text.

Module Contents¶

Functions¶

`hecele`	Split text into syllables.
`get_syllable_type`	Determine the type of a syllable according to aruz prosody rules.
`get_letter_type`	Determine the type of a letter.
`lowercase`	Remove all whitespace and punctuation from text and lowercase it.
`dictionary_order`	Get a tuple of indices that can be used as orthographic order.
`counter`	Find total number of occurrences of each element in targets.
`streaks`	Find streaks of consecutive targets in text
`max_streak`	Find the maximum consecutive targets in word.
`distinct`	Get a copy of the sequence with each element appearing once in input order.

Data¶

__all__

API¶

tdk.tools.__all__¶: [‘hecele’, ‘get_syllable_type’, ‘get_letter_type’, ‘lowercase’, ‘dictionary_order’, ‘counter’, ‘stre…

tdk.tools.hecele(text: str, /) → list[str]¶

Split text into syllables.

>>> hecele("merhaba")
["mer", "ha", ba"]
>>> hecele("ortaokul")
["or", "ta", "o", "kul"]

tdk.tools.get_syllable_type(syllable: str, /) → tdk.enums.SyllableType¶

Determine the type of a syllable according to aruz prosody rules.

The type of the syllable is defined as follows, where C is a consonant, V is a short vowel, and L is a long vowel:

If the syllable is of the form LC, CLC, VCC, or CVCC; it is SyllableType.MEDLI.
If the syllable ends with a short vowel, it is SyllableType.OPEN.
Otherwise, it is SyllableType.CLOSED.

tdk.tools.get_letter_type(letter: str, /) → tdk.enums.LetterType¶

Determine the type of a letter.

If the letter is a vowel without a circumflex, it is a LetterType.SHORT_VOWEL.
If the letter is a vowel with a circumflex, it is a LetterType.LONG_VOWEL.
If the letter is a consonant, it is a LetterType.CONSONANT.

Raises:: ValueError – If the letter is not a valid letter in VOWELS, LONG_VOWELS, or CONSONANTS.

tdk.tools.lowercase(text: str, /, *, keep_nonletters: bool = False, remove_hats: bool = True) → str¶

Remove all whitespace and punctuation from text and lowercase it.

Parameters:

text – The text to be lowercased.

keep_nonletters –

If a truthy value, characters that are not in the Turkish alphabet will be kept. This includes whitespace and punctuation.

>>> lowercase("geçti Bor'un pazarı (sür eşeğini Niğde'ye)",
...           keep_nonletters=False)  # The default
"geçtiborunpazarısüreşeğininiğdeye"
>>> lowercase("geçti Bor'un pazarı (sür eşeğini Niğde'ye)",
...           keep_nonletters=True)
"geçti bor'un pazarı (sür eşeğini niğde'ye)"

remove_hats –
If a truthy value, characters with circumflexes will be replaced with their non-circumflexed counterparts. (e.g. “â” will be replaced with “a”.)
```
>>> lowercase("İKAMETGÂH", remove_hats=True)
"ikametgah"
>>> lowercase("İKAMETGÂH", remove_hats=False)
"ikametgâh"
```

Returns:

A lowercase string.

tdk.tools.dictionary_order(word: str, /) → tuple[int, ...]¶

Get a tuple of indices that can be used as orthographic order.

Returns:: A tuple of numbers suitable to be used as a dictionary order.

assert dictionary_order("algarina") < dictionary_order("zamansızlık")
assert dictionary_order("yumuşaklık") > dictionary_order("beşik")

Invariant

If B comes after A in the dictionary, dictionary_order(B) > dictionary_order(A).

tdk.tools.counter(word: str, *, targets: str = VOWELS) → int¶

Find total number of occurrences of each element in targets.

>>> counter(word="aaaaaBBBc", targets="c")
1
>>> counter(word="aaaaaBBBc", targets="b")
3
>>> counter(word="aaaaaBBBc", targets="cb")
4

word is sanitized using lowercase().

>>> counter(word="aaaaaBBBc", targets="B")
0

tdk.tools.streaks(text: str, /, *, targets: str = CONSONANTS) → list[int]¶

Find streaks of consecutive targets in text

>>> streaks("anapara")
[0, 1, 1, 1, 0]  # /a N /a P /a R /a /
>>> streaks("zorlanmak")
[1, 2, 2, 1]     # Z /o RL /a NM /a K /
>>> streaks("çözümlemek")
[1, 1, 2, 1, 1]  # Ç /ö Z /ü ML /e M /e K /
>>> streaks("tasdikletmek")
[1, 2, 2, 2, 1]  # T /a SD /i KL /e TM /e K /

tdk.tools.max_streak(word: str, *, targets: str = CONSONANTS) → int¶: Find the maximum consecutive targets in word.

tdk.tools.distinct(seq: collections.abc.Sequence[T]) → collections.abc.Sequence[T]¶: Get a copy of the sequence with each element appearing once in input order.

tdk.tools¶

Module Contents¶

Functions¶

Data¶

API¶

`tdk.tools`¶