{"id":2004,"date":"2025-09-04T07:00:41","date_gmt":"2025-09-04T12:00:41","guid":{"rendered":"https:\/\/wollen.org\/blog\/?p=2004"},"modified":"2025-09-04T06:12:10","modified_gmt":"2025-09-04T11:12:10","slug":"remember-wordle","status":"publish","type":"post","link":"https:\/\/wollen.org\/blog\/2025\/09\/remember-wordle\/","title":{"rendered":"Remember Wordle?"},"content":{"rendered":"<p>I&#8217;m probably two or three years late on this post. I&#8217;m not sure how many people still play <a href=\"https:\/\/www.nytimes.com\/games\/wordle\" target=\"_blank\" rel=\"noopener\">Wordle<\/a> with their morning coffee. At one point my streak was over 100 days, but in 2025 it&#8217;s only an occasional thing.<\/p>\n<p>Still, I think it would be fun to identify elite Wordle starting words. Which words give you the best chance of winning if you guess them first?<\/p>\n<hr \/>\n<h4>1. Wordle Theory.<\/h4>\n<p>From a theoretical perspective, I don&#8217;t think there can be a single &#8220;best&#8221; starting word. The optimal choice will depend on a player&#8217;s strategy and how they structure their thinking. Is it better to know two letters in the wrong position or a single letter in the correct position? Which letter position is the most valuable? Is it better to know a consonant or a vowel? These questions are at least partially player-dependent.<\/p>\n<p>There are no perfect answers, therefore we can&#8217;t build a model that perfectly accounts for them. As usual, we have to make imperfect assumptions and interpret the results in context.<\/p>\n<p>My approach will be to look at all five-letter words in the English language and count the frequency of each letter at each position. For example, the most common first letter is S. About 14% of five-letter words start with S. So when hunting for Wordle starting words, we will prioritize S-words. The most common final letter is E, so we&#8217;ll also prioritize words that end with E. The idea is to assign points to words depending on how frequently their letters show up at each position.<\/p>\n<p>This approach has a couple limitations:<\/p>\n<p><strong>(1)<\/strong> The folks at Wordle Headquarters hand-select their words, which undoubtedly introduces some biases. Using the entire English language will fail to account for them.<\/p>\n<p><strong>(2)<\/strong> It doesn&#8217;t consider how the first guess feeds into the second guess. It essentially optimizes for the first turn. We could build a more complicated model to play the game with logic at each step, but that&#8217;s well beyond the scope of this post.<\/p>\n<p>I&#8217;m not training for the Wordle World Championships so I think this exercise will be useful anyway.<\/p>\n<hr \/>\n<h4>2. The model.<\/h4>\n<p>I&#8217;ll use <a href=\"https:\/\/github.com\/mwiens91\/english-words-py\" target=\"_blank\" rel=\"noopener\">english-words-py<\/a> to load a word list. You&#8217;ll have to <code>pip install<\/code> the library before starting.<\/p>\n<p>The parameter <code>lower<\/code> returns all lower-case words. That&#8217;s good because there&#8217;s no capitalization in Wordle.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from english_words import get_english_words_set\r\n\r\nword_list = get_english_words_set(sources=[\"gcide\"], lower=True)\r\n<\/pre>\n<p>Do a list comprehension to filter out words that include punctuation. Convert each word string to a set and use the built-in <code>issubset<\/code> method. Any words that include non-alphabetic characters will be removed.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">word_list = [word for word in word_list if set(word).issubset(list(\"abcdefghijklmnopqrstuvwxyz\"))]<\/pre>\n<p>Do another list comprehension to filter <code>word_list<\/code> down to five-letter words.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">word_list = [word for word in word_list if len(word) == 5]<\/pre>\n<p>Now let&#8217;s convert <code>word_list<\/code> into a five-column pandas DataFrame. We can do this by utilizing the unpack operator (*).<\/p>\n<p>The code below creates five lists, each of which becomes a column in the DataFrame. I included a &#8220;hello world&#8221; example to make it clearer what&#8217;s happening.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pandas as pd\r\n\r\n'''\r\nzip(*['hello', 'world']) == [('h', 'w'),\r\n                             ('e', 'o'),\r\n                             ('l', 'r'),\r\n                             ('l', 'l'),\r\n                             ('o', 'd')]\r\n'''\r\nletter1, letter2, letter3, letter4, letter5 = [list(characters) for characters in zip(*word_list)]\r\n\r\ndf = pd.DataFrame({'first': letter1,\r\n                   'second': letter2,\r\n                   'third': letter3,\r\n                   'fourth': letter4,\r\n                   'fifth': letter5})<\/pre>\n<p><code>df.head()<\/code> is shown below. Words read left to right but each letter is in a separate column. The structure will make it easy to tally letter frequencies.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">  first second third fourth fifth\r\n0     s      h     o      w     n\r\n1     k      e     b      o     b\r\n2     c      h     e      c     k\r\n3     m      a     l      t     y\r\n4     f      r     e      e     d<\/pre>\n<p>Before we create a <em>score<\/em> column, let&#8217;s measure letter frequencies using <code>value_counts<\/code> and store them in a dictionary. It tells us how many times a letter appears in each column. We should do this once upfront rather than calling <code>value_counts<\/code> repeatedly during the <code>apply<\/code> in the next step.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">counts_dict = {column: df[column].value_counts() for column in df.columns}<\/pre>\n<p>Now create a score column.<\/p>\n<p>We&#8217;ll normalize letter position frequencies by turning scores into percentages of the maximum value. To illustrate, E is very common in the fifth position. If we didn&#8217;t normalize scores, the model would reward matching E in the fifth position more than matching S in the first position. It&#8217;s debatable if that&#8217;s good model design. We would have slightly more matches overall, but they would be in what I consider a less valuable position. I think it would be a mistake to arbitrarily weight the model this way.<\/p>\n<p>Regardless, now that we&#8217;ve written a function that weights letter positions equally, it would be easy to go back and introduce weighting to match someone&#8217;s preference. I&#8217;ve played around with this idea and, without fairly extreme weights, it has minimal impact on the suggested starting words.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">def get_score(row):\r\n\r\n    score = 0\r\n\r\n    for col in df.columns:\r\n        max_points = counts_dict[col].max()\r\n        this_points = counts_dict[col][row[col]]\r\n        score += this_points \/ max_points\r\n\r\n    return score\r\n\r\n\r\ndf.loc[:, 'score'] = df.apply(get_score, axis=1)<\/pre>\n<p>That&#8217;s it. Sort the DataFrame by <em>score<\/em>, high to low, and print the results.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">df = df.sort_values(\"score\", ascending=False)\r\n\r\nprint(df.head(10))<\/pre>\n<hr \/>\n<h4>3. The output.<\/h4>\n<p><em>[Disclaimer: Several of these words aren&#8217;t accepted by the official version of Wordle, for various reasons. I&#8217;m displaying the DataFrame as-is for completeness&#8217; sake.]<\/em><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">     first second third fourth fifth     score\r\n1304     s      a     r      e     e  4.844063\r\n4570     s      o     r      e     e  4.734130\r\n6243     s      e     t      e     e  4.283245\r\n6038     s      a     r      s     e  4.256860\r\n5722     c      o     o      e     e  4.252834\r\n158      b      o     r      e     e  4.245192\r\n1316     s      e     r      i     e  4.223717\r\n707      s      o     o      t     e  4.092258\r\n3721     s      o     r      e     l  4.078741\r\n4958     d      o     r      e     e  4.070414<\/pre>\n<p>According to our model, <strong>SAREE<\/strong> is the best Wordle starting word.<\/p>\n<p>Do I agree? Maybe, but it has three vowels, including a double E. I would probably tweak the code to require five unique letters, like this:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">word_list = [word for word in word_list if len(set(word)) == 5]<\/pre>\n<p>That removes 35% of the list. Now the five best <em>acceptable<\/em> starting words, according to the model, are:<\/p>\n<ol>\n<li>SOREL<\/li>\n<li>SOAVE<\/li>\n<li>SAUTE<\/li>\n<li>SAUCE<\/li>\n<li>SAINT<\/li>\n<\/ol>\n<p>I actually know some of these words! They all seem like very good choices to me. I usually start with SLATE but I think I&#8217;ll give SOREL a try and see what happens.<\/p>\n<hr \/>\n<p>Just for fun, what does the model say are the worst acceptable starting words?<\/p>\n<ol>\n<li>ETHYL<\/li>\n<li>EMBOX<\/li>\n<li>EMBOW<\/li>\n<li>INDOW<\/li>\n<li>NYMPH<\/li>\n<\/ol>\n<p>Those do seem like awful choices so thumbs up to the model.<!-- HFCM by 99 Robots - Snippet # 15: endmark-python -->\n<span class=\"endmark-python\"><\/span>\n<!-- \/end HFCM by 99 Robots -->\n<\/p>\n<hr \/>\n<p><strong>Full code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from english_words import get_english_words_set\r\nimport pandas as pd\r\n\r\n\r\ndef get_score(row):\r\n\r\n    score = 0\r\n\r\n    for col in df.columns:\r\n        max_points = counts_dict[col].max()\r\n        this_points = counts_dict[col][row[col]]\r\n        score += this_points \/ max_points\r\n\r\n    return score\r\n\r\n\r\nword_list = get_english_words_set(sources=[\"gcide\"], alpha=True, lower=True)\r\n\r\nword_list = [word for word in word_list if set(word).issubset(list(\"abcdefghijklmnopqrstuvwxyz\"))]\r\n\r\nword_list = [word for word in word_list if len(word) == 5]\r\n\r\n# word_list = [word for word in word_list if len(set(word)) == 5]\r\n\r\nletter1, letter2, letter3, letter4, letter5 = [list(characters) for characters in zip(*word_list)]\r\n\r\ndf = pd.DataFrame({'first': letter1,\r\n                   'second': letter2,\r\n                   'third': letter3,\r\n                   'fourth': letter4,\r\n                   'fifth': letter5})\r\n\r\ncounts_dict = {column: df[column].value_counts() for column in df.columns}\r\n\r\ndf.loc[:, 'score'] = df.apply(get_score, axis=1)\r\n\r\ndf = df.sort_values(\"score\", ascending=False)\r\n\r\nprint(df.head())<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m probably two or three years late on this post. I&#8217;m not sure how many people still play Wordle with their morning coffee. At one point my streak was over 100 days, but in 2025 it&#8217;s only an occasional thing.<\/p>\n","protected":false},"author":1,"featured_media":3263,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[239],"tags":[410,39,22,425,424,411,422,429,428,403,44,407,401,402,426,431,432,412,423,30,406,25,421,413,416,420,419,415,418,414,417,430,409,408,427,405,400,404],"class_list":["post-2004","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-entertainment","tag-best","tag-code","tag-data","tag-embow","tag-embox","tag-english","tag-ethyl","tag-first-guess","tag-first-word","tag-game","tag-games","tag-model","tag-new-york-times","tag-nyt","tag-ogham","tag-opener","tag-opening","tag-optimal","tag-oxbow","tag-pandas","tag-puzzle","tag-python","tag-saint","tag-saree","tag-sarse","tag-sauce","tag-saute","tag-setee","tag-soave","tag-soree","tag-sorel","tag-starting-guess","tag-starting-word","tag-strategy","tag-unpack","tag-word","tag-wordle","tag-words"],"_links":{"self":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/2004","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/comments?post=2004"}],"version-history":[{"count":36,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/2004\/revisions"}],"predecessor-version":[{"id":3266,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/2004\/revisions\/3266"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/media\/3263"}],"wp:attachment":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/media?parent=2004"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/categories?post=2004"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/tags?post=2004"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}