Remember Wordle?
I’m probably two or three years late on this post. I’m not sure how many people still play Wordle with their morning coffee. At one point my streak was over 100 days, but in 2025 it’s only an occasional thing.
Still, I think it would be fun to identify elite Wordle starting words. Which words give you the best chance of winning if you guess them first?
1. Wordle Theory.
From a theoretical perspective, I don’t think there can be a single “best” starting word. The optimal choice will depend on a player’s strategy and how they structure their thinking. Is it better to know two letters in the wrong position or a single letter in the correct position? Which letter position is the most valuable? Is it better to know a consonant or a vowel? These questions are at least partially player-dependent.
There are no perfect answers, therefore we can’t build a model that perfectly accounts for them. As usual, we have to make imperfect assumptions and interpret the results in context.
My approach will be to look at all five-letter words in the English language and count the frequency of each letter at each position. For example, the most common first letter is S. About 14% of five-letter words start with S. So when hunting for Wordle starting words, we will prioritize S-words. The most common final letter is E, so we’ll also prioritize words that end with E. The idea is to assign points to words depending on how frequently their letters show up at each position.
This approach has a couple limitations:
(1) The folks at Wordle Headquarters hand-select their words, which undoubtedly introduces some biases. Using the entire English language will fail to account for them.
(2) It doesn’t consider how the first guess feeds into the second guess. It essentially optimizes for the first turn. We could build a more complicated model to play the game with logic at each step, but that’s well beyond the scope of this post.
I’m not training for the Wordle World Championships so I think this exercise will be useful anyway.
2. The model.
I’ll use english-words-py to load a word list. You’ll have to pip install
the library before starting.
The parameter lower
returns all lower-case words. That’s good because there’s no capitalization in Wordle.
from english_words import get_english_words_set word_list = get_english_words_set(sources=["gcide"], lower=True)
Do a list comprehension to filter out words that include punctuation. Convert each word string to a set and use the built-in issubset
method. Any words that include non-alphabetic characters will be removed.
word_list = [word for word in word_list if set(word).issubset(list("abcdefghijklmnopqrstuvwxyz"))]
Do another list comprehension to filter word_list
down to five-letter words.
word_list = [word for word in word_list if len(word) == 5]
Now let’s convert word_list
into a five-column pandas DataFrame. We can do this by utilizing the unpack operator (*).
The code below creates five lists, each of which becomes a column in the DataFrame. I included a “hello world” example to make it clearer what’s happening.
import pandas as pd ''' zip(*['hello', 'world']) == [('h', 'w'), ('e', 'o'), ('l', 'r'), ('l', 'l'), ('o', 'd')] ''' letter1, letter2, letter3, letter4, letter5 = [list(characters) for characters in zip(*word_list)] df = pd.DataFrame({'first': letter1, 'second': letter2, 'third': letter3, 'fourth': letter4, 'fifth': letter5})
df.head()
is shown below. Words read left to right but each letter is in a separate column. The structure will make it easy to tally letter frequencies.
first second third fourth fifth 0 s h o w n 1 k e b o b 2 c h e c k 3 m a l t y 4 f r e e d
Before we create a score column, let’s measure letter frequencies using value_counts
and store them in a dictionary. It tells us how many times a letter appears in each column. We should do this once upfront rather than calling value_counts
repeatedly during the apply
in the next step.
counts_dict = {column: df[column].value_counts() for column in df.columns}
Now create a score column.
We’ll normalize letter position frequencies by turning scores into percentages of the maximum value. To illustrate, E is very common in the fifth position. If we didn’t normalize scores, the model would reward matching E in the fifth position more than matching S in the first position. It’s debatable if that’s good model design. We would have slightly more matches overall, but they would be in what I consider a less valuable position. I think it would be a mistake to arbitrarily weight the model this way.
Regardless, now that we’ve written a function that weights letter positions equally, it would be easy to go back and introduce weighting to match someone’s preference. I’ve played around with this idea and, without fairly extreme weights, it has minimal impact on the suggested starting words.
def get_score(row): score = 0 for col in df.columns: max_points = counts_dict[col].max() this_points = counts_dict[col][row[col]] score += this_points / max_points return score df.loc[:, 'score'] = df.apply(get_score, axis=1)
That’s it. Sort the DataFrame by score, high to low, and print the results.
df = df.sort_values("score", ascending=False) print(df.head(10))
3. The output.
[Disclaimer: Several of these words aren’t accepted by the official version of Wordle, for various reasons. I’m displaying the DataFrame as-is for completeness’ sake.]
first second third fourth fifth score 1304 s a r e e 4.844063 4570 s o r e e 4.734130 6243 s e t e e 4.283245 6038 s a r s e 4.256860 5722 c o o e e 4.252834 158 b o r e e 4.245192 1316 s e r i e 4.223717 707 s o o t e 4.092258 3721 s o r e l 4.078741 4958 d o r e e 4.070414
According to our model, SAREE is the best Wordle starting word.
Do I agree? Maybe, but it has three vowels, including a double E. I would probably tweak the code to require five unique letters, like this:
word_list = [word for word in word_list if len(set(word)) == 5]
That removes 35% of the list. Now the five best acceptable starting words, according to the model, are:
- SOREL
- SOAVE
- SAUTE
- SAUCE
- SAINT
I actually know some of these words! They all seem like very good choices to me. I usually start with SLATE but I think I’ll give SOREL a try and see what happens.
Just for fun, what does the model say are the worst acceptable starting words?
- ETHYL
- EMBOX
- EMBOW
- INDOW
- NYMPH
Those do seem like awful choices so thumbs up to the model.
Full code:
from english_words import get_english_words_set import pandas as pd def get_score(row): score = 0 for col in df.columns: max_points = counts_dict[col].max() this_points = counts_dict[col][row[col]] score += this_points / max_points return score word_list = get_english_words_set(sources=["gcide"], alpha=True, lower=True) word_list = [word for word in word_list if set(word).issubset(list("abcdefghijklmnopqrstuvwxyz"))] word_list = [word for word in word_list if len(word) == 5] # word_list = [word for word in word_list if len(set(word)) == 5] letter1, letter2, letter3, letter4, letter5 = [list(characters) for characters in zip(*word_list)] df = pd.DataFrame({'first': letter1, 'second': letter2, 'third': letter3, 'fourth': letter4, 'fifth': letter5}) counts_dict = {column: df[column].value_counts() for column in df.columns} df.loc[:, 'score'] = df.apply(get_score, axis=1) df = df.sort_values("score", ascending=False) print(df.head())