tdk.tools¶
Various tools for working with Turkish text.
Module Contents¶
Functions¶
Split text into syllables. |
|
Determine the type of a syllable according to aruz prosody rules. |
|
Determine the type of a letter. |
|
Remove all whitespace and punctuation from text and lowercase it. |
|
Get a tuple of indices that can be used as orthographic order. |
|
Find total number of occurrences of each element in targets. |
|
Find streaks of consecutive targets in text |
|
Find the maximum consecutive targets in word. |
|
Get a copy of the sequence with each element appearing once in input order. |
Data¶
API¶
- tdk.tools.__all__¶
[‘hecele’, ‘get_syllable_type’, ‘get_letter_type’, ‘lowercase’, ‘dictionary_order’, ‘counter’, ‘stre…
- tdk.tools.hecele(text: str, /) list[str]¶
Split text into syllables.
>>> hecele("merhaba") ["mer", "ha", ba"] >>> hecele("ortaokul") ["or", "ta", "o", "kul"]
- tdk.tools.get_syllable_type(syllable: str, /) tdk.enums.SyllableType¶
Determine the type of a syllable according to aruz prosody rules.
The type of the syllable is defined as follows, where
Cis a consonant,Vis a short vowel, andLis a long vowel:If the syllable is of the form
LC,CLC,VCC, orCVCC; it isSyllableType.MEDLI.If the syllable ends with a short vowel, it is
SyllableType.OPEN.Otherwise, it is
SyllableType.CLOSED.
- tdk.tools.get_letter_type(letter: str, /) tdk.enums.LetterType¶
Determine the type of a letter.
If the letter is a vowel without a circumflex, it is a
LetterType.SHORT_VOWEL.If the letter is a vowel with a circumflex, it is a
LetterType.LONG_VOWEL.If the letter is a consonant, it is a
LetterType.CONSONANT.
- Raises:
ValueError – If the letter is not a valid letter in
VOWELS,LONG_VOWELS, orCONSONANTS.
- tdk.tools.lowercase(text: str, /, *, keep_nonletters: bool = False, remove_hats: bool = True) str¶
Remove all whitespace and punctuation from text and lowercase it.
- Parameters:
text – The text to be lowercased.
keep_nonletters –
If a truthy value, characters that are not in the Turkish alphabet will be kept. This includes whitespace and punctuation.
>>> lowercase("geçti Bor'un pazarı (sür eşeğini Niğde'ye)", ... keep_nonletters=False) # The default "geçtiborunpazarısüreşeğininiğdeye" >>> lowercase("geçti Bor'un pazarı (sür eşeğini Niğde'ye)", ... keep_nonletters=True) "geçti bor'un pazarı (sür eşeğini niğde'ye)"
remove_hats –
If a truthy value, characters with circumflexes will be replaced with their non-circumflexed counterparts. (e.g. “â” will be replaced with “a”.)
>>> lowercase("İKAMETGÂH", remove_hats=True) "ikametgah" >>> lowercase("İKAMETGÂH", remove_hats=False) "ikametgâh"
- Returns:
A lowercase string.
- tdk.tools.dictionary_order(word: str, /) tuple[int, ...]¶
Get a tuple of indices that can be used as orthographic order.
- Returns:
A tuple of numbers suitable to be used as a dictionary order.
assert dictionary_order("algarina") < dictionary_order("zamansızlık") assert dictionary_order("yumuşaklık") > dictionary_order("beşik")
Invariant
If
Bcomes afterAin the dictionary,dictionary_order(B) > dictionary_order(A).
- tdk.tools.counter(word: str, *, targets: str = VOWELS) int¶
Find total number of occurrences of each element in targets.
>>> counter(word="aaaaaBBBc", targets="c") 1 >>> counter(word="aaaaaBBBc", targets="b") 3 >>> counter(word="aaaaaBBBc", targets="cb") 4
wordis sanitized usinglowercase().>>> counter(word="aaaaaBBBc", targets="B") 0
- tdk.tools.streaks(text: str, /, *, targets: str = CONSONANTS) list[int]¶
Find streaks of consecutive targets in text
>>> streaks("anapara") [0, 1, 1, 1, 0] # /a N /a P /a R /a / >>> streaks("zorlanmak") [1, 2, 2, 1] # Z /o RL /a NM /a K / >>> streaks("çözümlemek") [1, 1, 2, 1, 1] # Ç /ö Z /ü ML /e M /e K / >>> streaks("tasdikletmek") [1, 2, 2, 2, 1] # T /a SD /i KL /e TM /e K /
- tdk.tools.max_streak(word: str, *, targets: str = CONSONANTS) int¶
Find the maximum consecutive targets in word.
- tdk.tools.distinct(seq: collections.abc.Sequence[T]) collections.abc.Sequence[T]¶
Get a copy of the sequence with each element appearing once in input order.