Inconsistent result with tokenization

14 hours ago 4

ARTICLE AD BOX

I'm building a list of bigrams from my text data in Python. I want to have punctuations as a token as well, here's my code:

def get_all_bigrams(): data = "" with open("data.txt", "r") as dataset: data = dataset.read() puncs: list = find_all_punctuations(data) print("".join(puncs)) # Output: `)*='[,-/:(&;]"!?. data = data.split() data = [str.lower(d) for d in data] print("Before punctuation separation: ", len(data), "tokens") # Separate punctuations into a separate string separated_puncs = [] for d in data: puncd = False for p in puncs: if p in d: parted = d.partition(p) if parted[0] != '': separated_puncs.append(parted[0]) if parted[1] != '': separated_puncs.append(parted[1]) if parted[2] != '': separated_puncs.append(parted[2]) puncd = True break if not puncd: separated_puncs.append(d) print("After punctuation separation: ", len(separated_puncs), "tokens")

Here are my inconsistent results:

# Iteration 1 Before punctuation separation: 992315 tokens After punctuation separation: 1124139 tokens # Iteration 2 Before punctuation separation: 992315 tokens After punctuation separation: 1123467 tokens # Iteration 3 Before punctuation separation: 992315 tokens After punctuation separation: 1123831 tokens

I did a little digging and saved the iteration outputs to txt files and I'm confused:

# 1.txt 54172 ta 54173 ' 54174 kul, # 2.txt 54172 ta'kul 54173 ,

How did I get inconsistent number of tokens after partitioning of punctuations?

I remembered reading something about Python lists returning random orders everytime the script is executed, is that the case here?

Thank you for your time.

EDIT: Added the definition of the function find_all_punctuations which as the accepted answerer has pointed out, the source of the inconsistency.

def find_all_punctuations(text): all_punc = [char for char in text if char in string.punctuation] unique_punc = list(set(all_punc)) return unique_punc

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Inconsistent result with tokenization

ARTICLE AD BOX

Related

How to group connected materials and aggregate stocks in PySpark/SQL on Databricks

Huggingface. Passing data to multimodal chat bot without having to save to files

Algorithm to calculate heat load reduction through multi-layered glass with IR-rejection coatings

LEFT SIDEBAR AD