In part I we talked about some simple features that can be extracted from just the date fields. We extracted 9 features. This part takes it to the next level. Using a public source for holiday information, features around distances of transactions to major events will be created.
Distance between Holidays
Retail is a very seasonal business. Often, people are buying for an occasion or near an occasion. Intuitively, people may be purchasing for Valentine’s day, Thanksgiving due to all the great sales, Christmas etc. To understand which customers are more driven by these special occurrences, a set of features need to be created that measures distances to these occurrences. Following steps are required to make this work:
- Pull a list of holidays/occurrences from a data source/API for a given geography
- Create a
pandas
data frame of these - Create new columns in the transaction data frame that computes the distance between transaction date and holiday date
Pull a list of holidays and create a DataFrame
There are a number of public sources like Wikipedia or sites that provides an API like timeanddate.com. For the purpose of this article, lets assume that all transactions are from U. S. Further, instead of using API, lets scrape this data from here. Code will use Beautiful Soup to extract the data and create a list of dictionary objects that can be loaded into pandas DataFrame
. You could also create a CSV and load the CSV into a data frame. Using this CSV approach lets you build a service that runs periodically to create this CSV and doesn’t slow down the actual processing / feature creation. You don’t want to hit the API every time you need holiday information as it is essentially static. For sake of simplicity, the list of dictionaries are loaded straight into a data frame.
[code lang=python]
from bs4 import BeautifulSoup
import urllib2
page=urllib2.urlopen("http://www.timeanddate.com/holidays/us/2015?hol=16#!hol=49")
holidays = BeautifulSoup(page.read())
print holidays.title
[/code]
Holidays and observances in United States in 2015
At this point, a representation of the page has been loaded in to memory. The structure of the page needs to be looked at to determine the right element to target. Fortunately, the page exposes the entire list of holidays in a table. Column titled Holiday Type will be used to filter all the values. For purpose of this article, rows with the following types will be used:
- National Holiday
- Observance
Now, to get to the data in the table, use Chrome to highlight the table element, right click to Inspect element, and then right click on the element in the HTML code and select Copy CSS Path
to get the reference of the table:
[code lang=python]
import string # for translation
table = string.maketrans("", "") # for removing punctuation etc
rows = holidays.select("body > div > div.main-content-div > div.fixed > table > tbody > tr")
holidays_list = []
for row in rows:
cols = list(row)
day = cols[0].string # first col is the day
holiday_type = cols[3].string
name = cols[2].string
if holiday_type is not None:
if "national holiday" in holiday_type.lower():
print day, name # purely to debug
holiday = {}
holiday['name'] = str(name).translate(table, string.punctuation+" ")
holiday['day'] = str(day) + ", 2001" # since all transactions are from 2001
holidays_list.append(holiday)
# now convert to data frame
holidays_frame = pd.DataFrame(holidays_list)
[/code]
To keep things simple, only national holidays were selected. This list could be expanded to other types of holidays – this is left as an exercise for the reader.
Create the distance features
Logic for creating these features is to take each row in the transaction table, compare the date of the transaction to every row in the holidays_frame
created above, and compute number of days ahead or behind that particular holiday.
[code lang=python]
new_frame = pd.DataFrame(holidays_list, index=holidays_frame['name']) # to help with the index selection
def compute_date_diff(x, y):
# convert x into date, y into date, compute date diff
date_x = datetime.strptime(x, "%Y-%m-%d %H:%M:%S" )
date_y = datetime.strptime(y, "%b %d, %Y")
return (date_y – date_x).days
for holiday in list(new_frame.index):
day = new_frame.loc[holiday, 'day']
print day
txs[holiday] = txs['date'].apply(compute_date_diff, args=(day,))
txs[['date'] + list(holidays_frame['name'])].head()
[/code]
This produces an output similar to one below (with some columns omitted):
idx | date | NewYearsDay | MartinLutherKingDay | PresidentsDay | MemorialDay | IndependenceDayobserved |
---|---|---|---|---|---|---|
0 | 2001-01-01 00:00:00 | 0 | 18 | 46 | 144 | 183 |
1 | 2001-01-01 00:00:00 | 0 | 18 | 46 | 144 | 183 |
2 | 2001-01-01 00:00:00 | 0 | 18 | 46 | 144 | 183 |
3 | 2001-01-01 00:00:00 | 0 | 18 | 46 | 144 | 183 |
It is easy to see how weather information from public APIs could also be added in, if there was location information, through customer address information or store information, was available.
If you were trying to build a regression model to predict the amount (dollar value) of purchases given a customer, product and date, you could train the model on 15 additional features that were just created!
Computing preferences for customers
Another interesting thing that could be done just with this data is to see which customers have a preference for a certain season, or around Valentines day or Christmas.
[code lang=python]
cust_xmas = txs.groupby('customer')['ChristmasDay'].mean()
cust_xmas.order()
[/code]
Customer | Days from Christmas |
---|---|
941891 | 328 |
1440263 | 328 |
420914 | 328 |
1638516 | 328 |
The data set used was not rich enough to have a wider variety of dates for transactions, but in a more real-world scenario, you would see how this would play out. Feel free to try with product class, or products. You can also combine columns like customer-product class to see if there is a specific preference for a customer for a given product class.
Conclusion
In this two part series we converted a simple date time column into over 15 columns!!! There is more information in that column that has not been teased out. For example, days between purchases per customer could be created. Then, this difference could be subtracted from a global average of days between purchases to determine if a customer purchases more often. A trendily could also be created for a given customer suggesting how often generally a customer purchases and calculate a probability they have churned if they don’t purchase for a given number of days. There are many such measures that could be still extracted. So go forth, and feature engineer!
One thought on “Feature engineering with Dates II – Merging data from public sources like holidays”