tdk.tools

Various tools for working with Turkish text.

Module Contents

Functions

hecele

Split text into syllables.

get_syllable_type

Determine the type of a syllable according to aruz prosody rules.

get_letter_type

Determine the type of a letter.

lowercase

Remove all whitespace and punctuation from text and lowercase it.

dictionary_order

Get a tuple of indices that can be used as orthographic order.

counter

Find total number of occurrences of each element in targets.

streaks

Find streaks of consecutive targets in text

max_streak

Find the maximum consecutive targets in word.

distinct

Get a copy of the sequence with each element appearing once in input order.

Data

API

tdk.tools.__all__

[‘hecele’, ‘get_syllable_type’, ‘get_letter_type’, ‘lowercase’, ‘dictionary_order’, ‘counter’, ‘stre…

tdk.tools.hecele(text: str, /) list[str]

Split text into syllables.

>>> hecele("merhaba")
["mer", "ha", ba"]
>>> hecele("ortaokul")
["or", "ta", "o", "kul"]
tdk.tools.get_syllable_type(syllable: str, /) tdk.enums.SyllableType

Determine the type of a syllable according to aruz prosody rules.

The type of the syllable is defined as follows, where C is a consonant, V is a short vowel, and L is a long vowel:

tdk.tools.get_letter_type(letter: str, /) tdk.enums.LetterType

Determine the type of a letter.

Raises:

ValueError – If the letter is not a valid letter in VOWELS, LONG_VOWELS, or CONSONANTS.

tdk.tools.lowercase(text: str, /, *, keep_nonletters: bool = False, remove_hats: bool = True) str

Remove all whitespace and punctuation from text and lowercase it.

Parameters:
  • text – The text to be lowercased.

  • keep_nonletters

    If a truthy value, characters that are not in the Turkish alphabet will be kept. This includes whitespace and punctuation.

    >>> lowercase("geçti Bor'un pazarı (sür eşeğini Niğde'ye)",
    ...           keep_nonletters=False)  # The default
    "geçtiborunpazarısüreşeğininiğdeye"
    >>> lowercase("geçti Bor'un pazarı (sür eşeğini Niğde'ye)",
    ...           keep_nonletters=True)
    "geçti bor'un pazarı (sür eşeğini niğde'ye)"
    

  • remove_hats

    If a truthy value, characters with circumflexes will be replaced with their non-circumflexed counterparts. (e.g. “â” will be replaced with “a”.)

    >>> lowercase("İKAMETGÂH", remove_hats=True)
    "ikametgah"
    >>> lowercase("İKAMETGÂH", remove_hats=False)
    "ikametgâh"
    

Returns:

A lowercase string.

tdk.tools.dictionary_order(word: str, /) tuple[int, ...]

Get a tuple of indices that can be used as orthographic order.

Returns:

A tuple of numbers suitable to be used as a dictionary order.

assert dictionary_order("algarina") < dictionary_order("zamansızlık")
assert dictionary_order("yumuşaklık") > dictionary_order("beşik")

Invariant

If B comes after A in the dictionary, dictionary_order(B) > dictionary_order(A).

tdk.tools.counter(word: str, *, targets: str = VOWELS) int

Find total number of occurrences of each element in targets.

>>> counter(word="aaaaaBBBc", targets="c")
1
>>> counter(word="aaaaaBBBc", targets="b")
3
>>> counter(word="aaaaaBBBc", targets="cb")
4

word is sanitized using lowercase().

>>> counter(word="aaaaaBBBc", targets="B")
0
tdk.tools.streaks(text: str, /, *, targets: str = CONSONANTS) list[int]

Find streaks of consecutive targets in text

>>> streaks("anapara")
[0, 1, 1, 1, 0]  # /a N /a P /a R /a /
>>> streaks("zorlanmak")
[1, 2, 2, 1]     # Z /o RL /a NM /a K /
>>> streaks("çözümlemek")
[1, 1, 2, 1, 1]  # Ç /ö Z /ü ML /e M /e K /
>>> streaks("tasdikletmek")
[1, 2, 2, 2, 1]  # T /a SD /i KL /e TM /e K /
tdk.tools.max_streak(word: str, *, targets: str = CONSONANTS) int

Find the maximum consecutive targets in word.

tdk.tools.distinct(seq: collections.abc.Sequence[T]) collections.abc.Sequence[T]

Get a copy of the sequence with each element appearing once in input order.