上QQ阅读APP看书，第一时间看更新

Data preprocessing

We are now able to import datasets, even a big, problematic ones. Now, we need to learn the basic preprocessing routines in order to make it feasible for the next data science step.

First, if you need to apply a function to a limited section of rows, you can create a mask. A mask is a series of Boolean values (that is, True or False) that tells you whether the line is selected or not.

For example, let's say we want to select all the lines of the Iris dataset that have a sepal length greater than 6. We can simply do the following:

In: mask_feature = iris['sepal_length'] > 6.0
In: mask_feature

Out:   0     False
       1     False
     ...
     146     True
     147     True
     148     True
     149    False

In the preceding simple example, we can immediately see which observations are True and which are not (False), and which ones fit the selection query.

Now, let's check how you can use a selection mask on another example. We want to substitute the Iris-virginica target label with the New label label. We can do this by using the following two lines of code:

In: mask_target = iris['target'] == 'Iris-virginica'
    iris.loc[mask_target, 'target'] = 'New label'

You'll see that all occurrences of Iris-virginica are now replaced by New label. The loc() method is explained in the following code. Just think of it as a way to access the data of the matrix with the help of row-column indexes.

To see the new list of the labels in the target column, we can use the unique() method. This method is very handy if you want to first evaluate the dataset:

In: iris['target'].unique()

Out: array(['Iris-setosa', 'Iris-versicolor', 'New label'], 
            dtype=object)

If you want to see some statistics about each feature, you can group each column accordingly; eventually, you can also apply a mask. The pandas method groupby will produce a similar result to the GROUP BY clause in a SQL statement. The next method to apply should be an aggregate method on one or multiple columns. For example, the mean() pandas aggregate method is the counterpart of the AVG() SQL function to compute the mean of the values in the group; the pandas aggregate method var() calculates the variance; sum() the summation; count() the number of rows in the group; and so on. Note that the result is still a pandas DataFrame, and therefore multiple operations can be chained together.

Many common operations on variables, such as mean or sum, are DataFrame methods that can be directly used on all the data, by columns (using the parameter axis=0, that is, iris.sum(axis=0) or by rows (using axis=1):

count: The count of non-null (NaN) values
median: Returns the median; that is, the 50th percentile
min: The lowest value
max: The highest value
mode: The mode, which is the most frequently occurring value
var: The variance, which measures the dispersion of the values
std: The standard deviation, which is the square root of the variance
mad: The mean absolute deviation, which is a way to measure the dispersion of the values robust to outliers
skew: The measure of skewness, indicative of the distribution symmetry
kurt: The measure of kurtosis, indicative of the distribution shape

As a next step, we can try a couple of examples with groupby in action. By grouping observations by the target (that is, the label), we can check the difference between the average value and the variance of the features for each group:

In: grouped_targets_mean = iris.groupby(['target']).mean()
    grouped_targets_mean

The output is a grouped Iris dataset and the grouping function is the mean:

In: grouped_targets_var = iris.groupby(['target']).var()
    grouped_targets_var

Now the grouping function is the variance:

As you may need multiple statistics on each variable, instead of creating multiple aggregated datasets to be put together by concatenation, you can directly use the agg method, and for each variable to apply specific functions. You define the variables by a dictionary where keys are the variable labels and values are lists of functions to be applied – to be called by a string (such as 'mean', 'std', 'min', 'max', 'sum', and 'prod') or by a pre-defined function or even a lambda function declared on the spot:

In: funcs = {'sepal_length': ['mean','std'],
             'sepal_width' : ['max', 'min'],
             'petal_length': ['mean','std'],
             'petal_width' : ['max', 'min']}
    grouped_targets_f = iris.groupby(['target']).agg(funcs)
    grouped_targets_f

Now each column has different grouping functions:

Later, if you need to sort the observations using a function, you can use the .sort_index() method, as follows:

In: iris.sort_index(by='sepal_length').head()

As an output, you get the top rows of the dataset:

Finally, if your dataset contains a time series (for example, in the case of a numerical target) and you need to apply a rolling operation to it (in the case of noisy data points), you can simply do the following:

In: smooth_time_series = pd.rolling_mean(time_series, 5)

This can be performed for a rolling average of the values. Alternatively, you can give the following command:

In: median_time_series = pd.rolling_median(time_series, 5)

Instead, this can be performed in order to obtain a rolling median of the values. In both of these cases, the window had size 5 samples.

More generically, the apply() pandas method is able to perform any row-wise or column-wise operation programmatically. apply() should be called directly on the DataFrame; the first argument is the function to be applied row-wise or column-wise; the second argument is the axis to apply it on. Note that the function can be a built-in, library-provided, lambda, or any other user-defined function.

As an example of this powerful method, let's try to count how many non-zero elements there are in each line. With the apply method, this is simple:

In: iris.apply(np.count_nonzero, axis=1).head()

Out:   0    5
       1    5
       2    5
       3    5
       4    5
       dtype: int64

Similarly, to compute the non-zero elements feature-wise (that is, per column), you just need to change the second argument and set it to 0:

In: iris.apply(np.count_nonzero, axis=0)

Out:  sepal_length    150
      sepal_width     150
      petal_length    150
      petal_width     150
      target          150
      dtype: int64

Finally, to operate element-wise, the applymap() method should be used on the DataFrame. In this case, just one argument should be provided: the function to apply.

For example, let's say you're interested in the length of the string representation of each cell. To obtain that value, you should first cast each cell to a string value and then compute the length. With applymap, this operation is very easy:

In: iris.applymap(lambda x:len(str(x))).head()

The top rows of the transformed DataFrame are:

When applying transformations to your data, you actually don't need to apply the same function to each column. Using pandas apply methods, you can actually apply a transformation to a single variable or to multiple ones, by modifying the same variables or creating new ones in addition:

In: def square(x):
       return x**2

    original_variables = ['sepal_length', 'sepal_width', 
                          'petal_length', 'petal_width']
    squared_iris = iris[original_variables].apply(square)

One weak point of such an approach is that transformations can take a long time because the pandas library is not leveraging the multiprocessing capabilities of recent CPU models.

Because of issues with multiprocessing in Windows when using Jupyter, the following example can only run on Linux machines or on Windows machines if transformed into a script, just as this Stack Overflow answer suggests: https://stackoverflow.com/questions/37103243/multiprocessing-pool-in-jupyter-notebook-works-on-linux-but-not-windows.

In order to shorten such computation latency, you can leverage the multiprocessing package by creating the parallel_apply function. Such a function takes a DataFrame, a function, and the arguments of the function as input, and it creates a pool of workers (many Python duplicates in-memory, where ideally each one is operating on a different CPU of your system) to work in parallel and execute the required transformations:

In: import multiprocessing

    def apply_df(args):
        df, func, kwargs = args
        return df.apply(func, **kwargs)

    def parallel_apply(df, func, **kwargs):
         workers = kwargs.pop('workers')
         pool = multiprocessing.Pool(processes=workers)
         df_split = np.array_split(df, workers)
         results = pool.map(apply_df, [(ds, func, kwargs)
                                       for ds in df_split])
         pool.close()
         return pd.concat(list(results))

When using this function, it is important to specify the correct number of workers (depending on your system) and the axis the computation will take place on (since you operate by columns, axis=1 is the usual parameter configuration you'll be using):

In: squared_iris = parallel_apply(iris[['sepal_length', 'sepal_width', 
                                        'petal_length', 'petal_width']], 
                                        func=square, 
                                        axis=1, 
                                        workers=4)
    squared_iris

The Iris dataset is a tiny one, and in this case, the execution may take even longer than simply applying a command, but on larger sets of data, the difference could be quite notable, especially if you can count on a large number of workers.

As a tip, on an Intel i5 CPU, you can set workers=4 for optimal results, while on Intel i7, you can set workers=8.