Hotel Booking Demand Analysis

17 min readFeb 22, 2021

Authors: Zheng Ruoming, Wu Yu Han, Ng Shilin, Tan Hee, Ashley Nee

Introduction

Manpower, depreciation maintenance, these are some of the many fixed costs that the hospitality industry incurs on a day to day basis. Along with inflation, labour costs are also rising, hence sharply increasing the bottom line of hotels. Depending on the season, demand fluctuates and occupancy rate is affected as well. Every night when hotel rooms are left empty results in a loss in contribution margin, and over prolonged periods might drive the hotel chain out of business.

Some hotel chains put measures in place to increase the topline and reduce the bottomline in order to maximise the profit margins. This is often done by leveraging on events and engaging online platforms to market bundled products. However, these strategies are not useful in reducing the bottom line in low seasons and fixed cost still remains high during these periods. Hence, we can conclude that the key to profitability is to be adaptable to market conditions.

Nevertheless, the goal of the hospitality industry is not necessarily to get a 100 percent occupancy rate and maximum profits (that would be ideal), but rather to optimise revenue. Thus, developing a way to accurately forecast daily demand is a key operational challenge for the hospitality industry (Skodol, 2019).

With the increase in available data and ways to track and analyse the hotel’s performance, it has provided many opportunities to adopt data based strategies for the business to optimise profits, one of which is to forecast demand. By forecasting demand accurately, hotels can anticipate the market movement and make changes to their topline or bottom line accordingly.

Problems faced by the hospitality industry

A cancellation situation occurs when a customer books the room early in advance but cancels the room before the stipulated date of arrival. This results in empty rooms and a loss of potential revenue. As the market for the hospitality industry is very competitive, hotel chains are often lax on their cancellation policies in order to attract customers to book a room with them. In the event when too many bookings are cancelled, hotels will have to sell the inventory on discounted channels and recover a portion of the initial reservation (Freed,2016). In a sample survey conducted on more than 1,000 hotel and casino properties, an average of about 16 canceled rooms per day on average between January and May was recorded, signifying that this is a pressing issue which needs to be addressed.

An overbooking situation occurs when hotels allow customers to continue booking rooms even though they are already full. In this situation, the hotel attempts to hedge the event when a certain percentage of existing reservations are cancelled by the customers or ends up as “no show”. This is risky as without proper data and forecasting, it might result in the forceful termination of customers’ bookings when the demand is higher than the actual vacancy on the day itself. (Berner, 2017)

Underbooking is a situation which occurs when a hotel has failed to fulfill its satisfactory booking rate. This can occur when separate inventories are allocated on different distribution platforms. Although this eliminates the potential of the same room to be sold twice, it could also easily result in unnecessary vacant rooms. To avoid such situations, the hotel could pool the inventory together for the different distribution channels and manage them altogether using a property management system. (Xotels, 2019)

Business opportunities

Our approach can help to mitigate the omnipresent problem faced by the hotel industry, and is especially useful during the holiday peak seasons where hotel rooms are in very high demand. When business is in full swing due to peak season, having empty rooms is not the best thing that could happen to any hotel.

A popular strategy to counter this problem is to adopt a fluid pricing strategy based on anticipated demand of the hotel: depending on the length of stay and whether it is low/high season. On top of that, instead of laying off workers during the low season, seasonal wages can be given to existing employees to offset the training cost required for new employees during the high season. (Eiselman, 2016) However, this will require an accurate forecasting system in place, which is often hard to develop due to unprecedented events, such as cancellation and no show.

With a proper forecasting system that can predict the “true” demand, hotels can in turn better allocate allowance for overbooking, and occupancy rate, and potential revenue can be maximized. (Siteminder, 2020) Proper allocation of inventory to different distribution channels can also be done to minimize underbooking situations and optimize revenue

What we aim to accomplish is to develop a model that enables hotels to predict “actual” reservations in real-time, both at an individual and an aggregated level. This would allow them to determine if they are already booked at maximal capacity and whether to take in any more bookings to allow for a certain level of overbooking.

Additionally, we want to allow these hotels to predict the number of guests over time. We understand that hotels all have their peak seasons, where the number of staying guests is higher than over other periods. As such, it is critical for hotels to ensure that there are sufficient manpower and supplies, and minimize disruptive events such as renovations during these periods. By having a rough gauge on the number of staying guests beforehand, the hotels can further enhance their management and planning activities.

By offering these hotels a better understanding of their current demands, a reliable method to predict if a booking will end up as up as “cancelled”, or “no-show”, as well as a unique opportunity to look into the future demand for their rooms, we hope to empower them into making brilliant, data-driven decisions for the future.

Ultimately, on top of developing an accurate forecasting model, we hope to provide other useful insights which could improve on the existing operations of the hospitality industry.

Data Cleaning & Preparation

To investigate the issues in question, we will be utilizing a data set on hotel booking demand, which contains a wide range of attributes that could provide insights into the business problems mentioned above.

This data set contains booking information of a city hotel and a resort hotel, and we assume that both are located in the same country, owned by the same hotel chain. As most attributes apply to both hotels, we would be able to compare the analysis between the two hotels and craft unique business strategies for each of them.

To better understand the dimensionality of the data, we have grouped our attributes into the following six categories:

Lead Time (How far in advance the booking is made)
Booked Dates (Year, Month, Week, Day, number of weekends/weekdays stayed)
Customer Profile (Country of Origin, Number of Adults/Children/Babies, Customer Type)
Booking Details (Meal Type, Room Type, Daily Rate, Car Parking, Special Requests)
Channel (Deposit Type, Market Segment, Distribution Channel, Agent, Company)
Booking History (Previous Cancellations/Bookings, Number of Changes to Booking)

A note to make is that these six variable groups are based on a total of 32 variables in the dataset, which consist of 31 predictive variables and one target variable “is_cancelled”, which denotes whether a booking is cancelled eventually. Another variable that indicates the state of the booking is “reservation_status”, which categorizes the booking as canceled, check-out or no-show. Also, for confidentiality purposes, all personally identifying information has been removed from the data.

Data Exploration

The data exploration in the form of visualization will be conducted on Microsoft Power BI, through which we can understand each variable and its relationship with the target variable visually. The exploration will be grouped based on the six categories mentioned above and the key visualizations are shown in the following sections.

Lead Time

The lead time refers to the number of days that elapsed between the date of the booking and the arrival date. The value ranges from 0 to 737 days, with the average lead time being 104 days. When we plot a line chart to see the distribution of the lead time for different reservation status, we see an interesting story (Figure 1). A significant number of check-out bookings have a lead time of 0 days, and some of them also stretch out to as much as a few hundred days. This indicates that most bookings where the customers show up and eventually check out are made in less than a day before their stay. But there are also a number of check-out cases where customers make their bookings months and even years in advance, probably because they could be certain of their stay very early on.

On the other hand, the cancelled bookings shows a much flatter liner, and notably much less cases of last minute booking (lead time = 0). The lead time for the canceled bookings also tend to be under a year, unlike for check-out bookings.

2. Booked Dates

Booked dates are captured by the year, month, week and day of arrival. We plotted booked dates attributes (day of month, month, week) against the number of bookings to see whether there are time-related trends in the bookings.

When we look at the month and the week of the booked dates (Figure 2 and Figure 3), we see a clear trend where the number of bookings is the highest during summer months (July and August), and the lowest during winter months (November, December and January). This observation holds for both City Hotel and Resort Hotel, even though City Hotel seems to be affected by this seasonality more.

3. Country

As seen in Figure 4, a large proportion of customers are based in Europe. Most notably, 40.7% of the bookings are made by customers from Portugal (PRT). We can reasonably deduce that the two hotels are based in Europe, if not in Portugese, as 80% of tourists travel within their own region, according to the World Tourism Organization (Blackall, 2019)

Interestingly, when we look at the reservation status by country (Figure 5), we see that despite contributing to close to half of the bookings, more than half of the customers from Portugal also end up cancelling their bookings. Assuming that the hotels are located in Portugal, it appears that domestic travelers have a much higher chance of cancelling booking compared to foreign travelers.

4. Average Daily Rates

As seen in Figure 6, the average daily rate (adr) is between 60 Euro to 120 Euro. Room type A is one of the most economical for both hotels, which explains why it is the most popular room type.

When we study the seasonality of the average daily rate across the hotel, we have some interesting observations (Figure 7 and Figure 8). The adr across the year shows a trend similar to that of the booking demands, where the rates peak in summer months (July and August) and drop to the lowest in winter months (November to January). We also observe that the adr for Resort Hotel are more responsive to the time of year, as seen in the sharper increase during summer months, where the adr is almost comparable to that of City Hotel.

5. Deposit

We also discovered that the reservation status differs according to the deposit type. As seen in Figure 9, a majority of the check-out cases do not require a deposit. About ⅔ of the cancellation cases require no deposit, and about ⅓ are non-refund. It is rather unusual that non-refund bookings would have such a high number of cancellations, while refundable bookings have very few cancellations; it would be more intuitive if many customers cancel when the booking is refundable, as they would not suffer a loss of booking fees. However, our analysis later on may provide a viable explanation to this observation.

6. Booking Channel

From Figure 10, we see that City Hotel has many bookings via online travel agents (Online TA), followed by offline travel agents or travel operators (TA/TO) and Groups. While Resort Hotel also has a high number of bookings from Online TA and Offline TA/TO, it has a proportionally high number of direct bookings and corporate bookings as well, suggesting that individual customers would also approach Resort Hotel directly or book under the name of a company. Figure 11 confirms this observation as we see that while both hotels rely heavily on TA/TO, Resort Hotel also has bookings via direct channels and corporate.

Data Modelling

1. Predicting cancellation with MARS

Firstly, we used Caret to find the best model.

Since there are two tuning parameters associated with our MARS model:

(1) the degree of interactions

(2) the number of retained terms

We performed a grid search for each type of hotel to identify the optimal combination of these hyperparameters that minimises prediction error (i.e. highest cross-validation accuracy).

Resort Hotel: The optimal model retains 45 terms and includes up to 3rd degree interactions.

City Hotel: The optimal model retains 34 terms and includes up to 3rd degree interactions.

The following are the key findings from Resort Hotel model:

As “lead_time” increases from 0 to 30 days, “is_canceled” gets closer to being categorised as “yes”, even though as long as “lead_time” is less than 30 days, “is_canceled” is predicted to be “no” i.e. less than 0.5.
Above 30 days, “is_canceled” is predicted to be “yes” and the two variables are also linearly related with a positive gradient, albeit less steeply than when “lead_time” is below 30.
The higher the number of “adults”, the greater the predicted value of “is_canceled” and the steeper the gradient of “is_canceled” against “lead_time”. This means that “lead_time” has a greater impact on “is_canceled” when there are more “adults”.

The following are the key findings from City Hotel model:

“lead_time” in the City model shows a similar trend to that in the Resort model.
“Market_segment” OnlineTA and “deposit_type” NonRefund show a distinctly higher “is_canceled” value than the other levels in the respective graphs.
As long as there are “previous_bookings_not_canceled”, “is_canceled” is predicted to be below 0.5 and this value decreases as “previous_bookings_not_canceled” increases.
The more “total_of_special_requests”, the lower the predicted “is_canceled” value.

2. Finding important variables of cancellation with RF

This Resort Hotel model has an accuracy of 85.52% based on a 0.7 train-test split.

The City Hotel model has an accuracy of 86.09% based on a 0.7 train-test split.

It can be noted that “arrival_date_week_number” and “arrival_date_day_of_month” are the top 2 most important variables. This indicates the presence of seasonality, which will be addressed in the section below with Time-Series Forecasting.

3. Time Series Forecasting

A time-series model was conducted to forecast the number of rooms occupied in the hotel daily in the future. This will enable the hotel to conduct more efficient manpower and resource planning. The dataset is prepared separately for time-series forecasting as the timing of check-in, such as the date, week, month and year of arrival are all required for forecasting.

“Reservation_status” is removed as it provides the same information as “is.canceled”.

“Assigned_room_type” is removed as it is only made known to the customers when they arrive.

Bookings which are cancelled are removed as this model aims to predict only the number of customers staying in the hotel.

Each data entry corresponds to one room booked.

For customers who stayed in the hotel, the “reservation_status_date” will refer to their check-out date as it is the date when the last status is set.

The check-in date was created by combining the “arrival_date_year”, “arrival_date_month” and “arrival_date_day_of_month”.

The number of days stayed can be calculated by either subtracting the check-in date from check-out date, which is the “reservation_status_date”, or by summing up the number of stays on weekday nights and weekend nights. Both methods are conducted and compared so that inconsistent data could be removed. There were a total of 27 inconsistent data entries.

The total guest number was calculated by summing up the number of adults, children and babies. 154 entries with no guests are removed as the hotel rooms are not occupied.

The time period of this dataset on a whole is measured from the first check-in date to the last check-out date. It starts from 1 July 2015 to 14 September 2017.

Given that this dataset only recorded customers who check-in and check-out in the time period of 1 July 2015 to 14 September 2017, customers who stayed in the hotel during this period but check-in before or check-out after this time period will not be recorded in the data. Thus, the dataset could be an inaccurate portrayal of the actual number of customers staying in the hotel during that time period. Hence, to ensure that the model is able to forecast the accurate number of customers staying in the hotel at a point in time, 14 days were shaved off the start and end of the dataset respectively as 99% of the customers stayed less than 14 days.

This would mean that the dataset will now include almost all customers who stayed during the time period of 15 July 2015 to 31 August 2017.

For Resort Hotel:

Time period used for modelling: 15 July 2015 to 31 August 2017

By conducting a forecast for the next 60 days as shown in the graph , it can be concluded that the hotel will have about an average 174 rooms occupied daily.

For City Hotel:

Time period used for modelling: 15 July 2015 to 24 August 2017

By conducting a forecast for the next 60 days as shown in Forecasts for City Hotel, it can be concluded that the hotel will have about an average 216 rooms occupied daily.

Overall, from forecast for Resort Hotel and forecast for City Hotel , it can be concluded that there are usually more rooms occupied at the beginning and end of the year, which corresponds to the holiday season and the school holiday period. Additionally, the peak in the number of rooms occupied in mid-year would also correspond to the school holiday period.

This would tie in with the important variables picked up by the Random Forest model for both hotels, which are “arrival_date_week_number” and “arrival_date_day_of_month”.

Recommendations based on the results

Recommendation based on Previous Cancellation

As shown from the results, the customers who had previous cancellations are very likely to cancel again. In order to address this issue, we can put the deposit type as ‘Non-Refundable’ for those customers who have a previous record of cancellation so as to deter them from cancelling. Also, in the case which they decide to cancel, the company will still be able to gain some revenue and minimise the loss from cancellation.

In addition, the hotel can also limit the number of cancellations made by a customer. If the customer has exceeded the limit, the hotel can bar him/ her from booking again for a period of time.

2. Recommendation based on Deposit Type

Since the Deposit Type appeared as a top variable for Neural Network, Random Forest and MARS. It is shown that Non-refundable deposit type has the highest likelihood of cancellation. This shows that the hotel is able to identify the customers who are more likely to cancel thus charging them a deposit. In addition to the Non-Refundable deposit, the hotel should charge a higher ADR for Bookings which are refundable than those with no deposits so as to ensure that the company does not make a loss when bookings are cancelled and to ensure that they are able to accurately gauge the capacity of the hotel.

3. Recommendation based on Lead Time

Based on the Neural Network and MARS models, we learnt that two important rules that predict cancellation is lead time being less than 9.

With this, a recommendation for Resort Hotel to help minimize loss of profits from cancellation would be to allow for a higher percentage of overbooking for bookings made less than 9 days as the customers are more likely to cancel than those who booked more than 9 days ahead.

4. Recommendation based on Market Segment

For Resort Hotel, the market segment “Groups” is a unique predictor of cancellation. Group bookings, such as those booked for conferences or events, could have a higher chance of being cancelled as more guests are concerned, and hence higher uncertainties. It could also be that when group bookings are cancelled, it would involve a large number of guests and rooms, thus resulting in a high cancellation number.

To address this issue, the hotel can implement specific payment terms for bookings involving large groups to minimise loss if cancellation does take place. For instance, the hotel group bookings can only be made when a partial or full deposit is paid for, and should cancellation happen, the nearer the cancellation occurs to the booked date, the higher proportion of the deposit would be forfeited as a penalty.

5. Recommendation based on Seasonality

Based on the results of Random Forest we can see that the week number is one of the top variables for cancellation and results from MARS have shown that customers who make a booking in week 31 from Online Market Segments are more likely to cancel. This aligns with the results from the time series prediction as July and August (Week 28–36) are the peak period for booking.

To reduce the number of cancellations during this period, the hotel can prioritise the bookings that are made offline during this period as they have a lower likelihood of cancellation. Also, the company can also adjust their pricing by charging a higher room rate for week 28–36 and offer promotion during the off-peak season (week 1–5). This allows the company to manage the demand so that they can maximise the capacity during the off-peak period and maximise their profit during the peak period.

6. Ensuring Recommendation Effectiveness and Sustainability

The effectiveness of the recommendations can be measured by comparing the number of booking cancellations over time and comparing the number of customers staying in the hotel per day over time. The hotels can closely monitor the incidents of cancellations periodically to ensure their new policies have yielded results. The dataset will also have to be continuously updated so that the models can provide the most relevant analysis on the current situation faced by the hotels.

Conclusion

To conclude, Random Forest is the most reliable model in predicting the variables that contribute to cancellation. It has a predictive accuracy of 85.5% for the Resort Hotel dataset and 86.1% for City Hotel dataset. We have identified a few variables that are important across the 3 models: previous cancellation records, the types of deposits, the market segment, lead time for each booking, Average Daily Rates and the arrival week number. These factors allow us to predict the likelihood of cancellation for each booking to increase the overbooking percentage for the bookings that are highly likely to cancel and to adjust the hotel policies such as deposit type and pricing to deter cancellation. Aside from the prediction for cancellation, we also want to predict the demand so as to match the capacity of the hotel and maximise the profit. To find out the demand, we used the Winter’s method of time-series forecast with seasonality to predict the demand for the next time period. We found out that the peak period is from July- August and it is forecasted that there will be an average of 174 and 216 rooms occupied daily for Resort Hotel and City Hotel respectively. By predicting both the demand and cancellation, we are able to adjust our pricings for the different time periods to reduce underbooking and to maximise our profit.