Inconsistent result with tokenization

14 hours ago 4
ARTICLE AD BOX

I'm building a list of bigrams from my text data in Python. I want to have punctuations as a token as well, here's my code:

def get_all_bigrams(): data = "" with open("data.txt", "r") as dataset: data = dataset.read() puncs: list = find_all_punctuations(data) print("".join(puncs)) # Output: `)*='[,-/:(&;]"!?. data = data.split() data = [str.lower(d) for d in data] print("Before punctuation separation: ", len(data), "tokens") # Separate punctuations into a separate string separated_puncs = [] for d in data: puncd = False for p in puncs: if p in d: parted = d.partition(p) if parted[0] != '': separated_puncs.append(parted[0]) if parted[1] != '': separated_puncs.append(parted[1]) if parted[2] != '': separated_puncs.append(parted[2]) puncd = True break if not puncd: separated_puncs.append(d) print("After punctuation separation: ", len(separated_puncs), "tokens")

Here are my inconsistent results:

# Iteration 1 Before punctuation separation: 992315 tokens After punctuation separation: 1124139 tokens # Iteration 2 Before punctuation separation: 992315 tokens After punctuation separation: 1123467 tokens # Iteration 3 Before punctuation separation: 992315 tokens After punctuation separation: 1123831 tokens

I did a little digging and saved the iteration outputs to txt files and I'm confused:

# 1.txt 54172 ta 54173 ' 54174 kul, # 2.txt 54172 ta'kul 54173 ,

How did I get inconsistent number of tokens after partitioning of punctuations?

I remembered reading something about Python lists returning random orders everytime the script is executed, is that the case here?

Thank you for your time.

EDIT: Added the definition of the function find_all_punctuations which as the accepted answerer has pointed out, the source of the inconsistency.

def find_all_punctuations(text): all_punc = [char for char in text if char in string.punctuation] unique_punc = list(set(all_punc)) return unique_punc
Read Entire Article