Functional approach to analysis of daily tax revenues

We present a functional data analysis approach to modeling and analyzing daily tax revenues. The main features of daily tax revenue we need to extract are some patterns within calendar months which can be used for prediction. As standard seasonal time series techniques cannot be used due to varying number of banking days per calendar month and presence of seasonality between and within months we interpret monthly tax revenues as curves obtained from daily data. Standard smoothing techniques and registration taking into account time variability are used for data preparation.


Introduction
The State Tax Inspectorate under the Ministry of Finance of the Republic of Lithuania (hereinafter referred to as STI) makes forecasts of daily tax revenues for State budget according to historical data trends, which are based on expert experience. It is a difficult task to predict and assess daily changes in revenue collection, which depend not only on the tax calendar, but also on tax payer behavior, especially in cases where the payment date of the obligation coincides with the weekend or holiday.
The main goal of this research is to test some statistical methodologies that could improve tax revenue forecasts.
As far as we know only few papers are devoted to time-series models of daily tax revenues. Koopman and Ooms [1] provides a detailed discussion of this problem and suggests to use a two-way mapping to transform irregular data to regular ones. They use a state space model to forecast daily tax revenues. The later articles by Koopman and Ooms [2,3,4] are the improved versions of the analysis of daily tax revenues.
In this paper we illustrate daily time series features using a series for Lithuanian aggregate tax revenues using functional data analysis tools. Functional data are often characterized by both shape and phase variability. Tax revenue is a typical example where these two sources of variation are clearly identified and interpreted. An overall pattern is observed that tax revenue accelerates around two fixed days. In this setting, phase variability is identified as variation in the calendar timing. Explicit consideration of phase variability is necessary in order to obtain consistent estimation of typical tax revenues patterns. This paper will focus on fitting a structural model and using it to forecast a tax revenues monthly patterns. By structural model we mean a model estimated from registered data. The main issue is to find appropriate wrapping functions. After accommodating prediction techniques to tax revenue data the most accurate predictions are considered to be derived from functional principal component regression and exponential smoothing.
In Section 2 we present preliminary analysis of daily series for Lithuanian tax revenues including smoothing and registration of data. In Section 3 we discuss some prediction tools.

Preliminary analysis
We use a daily series for Lithuanian tax revenues, i.e. taxes, fees and other payments paid by the tax payers that are paid to STI's budget revenue collection accounts. For the analysis data are taken from the period January 2011 to February 2019. The data have the form: where k corresponds to regular time (months in our case) and index j = 1, . . . , N k corresponds to a time grid within period k.

Smoothing
We interpret the date as observations of random curves: Moreover we assume that the sample curves are observed at discrete instants of time with some noise, so that y k,j = y k (j/N k ) + ε k,j , j = 1, . . . , N k . Figure 1(a) represents monthly patterns of accumulated tax revenue data, which are observed at a discrete time (see dots in Fig. 1(a) and for the visualization purposes the interpolated lines helps to recognize the trends within one month). Clearly, the number of bank days in a month is specific for each month as well as the calendar day at which the discrete data point is observed. We reconstruct each function (y k (s), s ∈ [0, 1]) by smoothing techniques thus obtaining functional sampleŷ k (s), s ∈ [0, 1]; k = 1, . . . , n which we interpret as observations of random functions  with values in the classical Hilbert space L 2 (0, 1). In order to have increasing and differentiable functions we need to choose appropriate smoothing technique.
Since the process behind the raw accumulated tax revenue data is always increasing, monotone transformation will be used to smooth the data. Suppose W (t) is a conventional functional data object, and that is unconstrained in any way except for W (t 0 ) = 0 where t 0 is the lower boundary over which we are smoothing.
Given that there are two clearly visible peaks in the middle and end of the month (see Fig. 1(a)), by examining various smoothing techniques it was found that splines can track such features with satisfactory accuracy.
In order to define monotone smoothing, a B-spline basis will be used, and the constraint W (t 0 ) = 0 can easily be achieved by fixing the first coefficient to be zero. Then each smoothed function takes the following form: where φ denotes a vector containing the B-spline basis functions and the parameter c is a vector containing the coefficients of the B-spline expansion. Coefficients c are estimated by minimizing the sum of squared errors. Figure 1(b) shows in this way smoothed accumulated tax revenue data.
For k = 1, . . . , n, let x k be the derivative of the function y k . We interpret the sample as observations of random curves again as random elements in the sample space L 2 (0, 1). Figure 2(a) shows the derivative of smoothed accumulated tax revenue data. Clear peaks are visible around the 15th and 25th days of the month since the most important tax sources has to be paid on these days or the next business day if a due date falls on a Saturday, Sunday or legal holiday. The peak at the end of the month as well as the beginning of the month is due to the smoothing algorithm, since the assumptions that are necessary to calculate smoothing parameters are insufficient.

Registration
We assume that for each k = 1, . . . , n, there exists (probably a random) time trans- where the non-random function µ(s), s ∈ [0, 1] can be interpreted as a structural mean, η k (s), s ∈ [0, 1] accounts a structural individual variation from µ, ε k is an error process. We assume that η k and ε k are independent. Moreover, we assume that (ε k ) is a strong white noise. The function w k = v −1 k is called warping function and the random curve X * k (s) = X k w k (s) , s ∈ [0, 1] is a registered or aligned version of X k . The random sample X * 1 , . . . , X * n is a structural sample corresponding to X 1 , . . . , X n and is the main object of analysis within this paper. Clearly we have under the model (1) Hence, techniques developed so far in functional data analysis, can be applied for the statistical analysis of the structural sample. There exists several constructions of warping functions proposed in the literature (see, e.g., Ramsay et al. [5,6] and references therein). The first one we apply is the socalled landmark method which seems to be easiest. The method involves identifying the timings of specific features of the curves (deadlines for payments as defined by law in the tax revenue example), and then aligning the curves so that all these events occur at the same time for each curve.
Consider the situation with two landmarks per curve (that corresponds to the first and second obligations in the tax revenue calendar of i's month), say, τ i1 and τ i2 , on the interval [0, 1]. Let be the average landmarks. where µ ∈ L 2 (0, 1) and (ε t ) is a (strong) white noise. In this case E(X n+h (s)|F n ) = µ(v n+h (s)). See Fig. 3(a) of mean prediction for the derivative of accumulated tax revenues. The 95% prediction interval is quite wide and the peaks of forecasts around the due dates are underestimated. On other days the tendency is grasped except in the end of the month, where true intensity of the accumulated tax revenue where observed before the last day, but the prediction shows that intensity is growing the few days before the end of the month and spikes at the last day.

Exponential smoothing
This method, which is widely used in practice, consists of assigning weights to the observations that tend to 0 at an exponential rate: where 0 < q < 1 and c is a normalization constant. Usually, we choose c = 1 − q and 0.7 q 0.95. After careful consideration c = 0.1 was chosen and Fig. 3(b) represents exponential smoothing prediction for the analyzed period. For the due date around 25th day the prediction is very accurate and for the second largest peak the prediction is underestimated, but is more accurate than by mean prediction. On other days the tendency is quite similar to the real curve except in the end of the month.

Conclusions
In this paper we have investigated several modelling strategies for daily time series in the functional data analysis context, with the objective of short term forecasting. Since most of parametric models are not able to model irregularly spaced data or at least they need to be transformed, functional data analysis technique is introduced to overcome challenges linked to daily time series such as a changing number of observations per month or year. The empirical results were based on a series of Lithuanian daily tax revenues, which embodied three modelling stages. First, the choice of suitable basis function, which transforms data from discrete -to functional observations. Second, alignment of monthly curves since they might have the same shape, but individual curves have been deformed due to tax calendar. And lastly, comparison of several modelling techniques using one step out-of-sample forecast procedure. It is shown that all of strategies exponential smoothing and functional principal component regression are the most accurate at the peaks of 15th and 25th days of the month when taxes are most collected. Although predictions in other periods can be improved. This work was intended as an attempt to motivate public sector to improve daily tax revenue predictions with the tendencies using functional data analysis.