In today’s world, we have started to use artificial intelligence huge area and it is very useful for us. Machine learning and deep learning models are deriving and developing by experts. Aside from the development of the models, the data we add to the model must also be useful. How much quality data you have will make artificial intelligence even more quality.

So, is data born with quality or does it need to be maintained? The answer to this question is both. First, the data must be related to the subject that we will examine. At the same time, the small but effective changes that we will make to the data will make our data even more powerful. It can’t be guaranteed that the procedures that I mentioned will have a 100% positive effect. These operations should be tried and reported based on their results. Data sets are different from each other and you should treat the results with observation.

You will be able to prepare your data for the artificial intelligence model with the 11-item processes which I created.I generally used sklearn documents by Python. (**sklearn.preprocessing**)

The distribution of the data usually has a broad axis. The range of one property might be 10–20, while another might be 500–9000. This makes the learning process difficult. We may need rescaling in this regard. We can compress the data into a certain range according to the chosen scaling method, and this might have a positive effect on the learning process.

The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse. Data is compressed between 0 and 1

It is sensitive to outliers, that is, outliers reduce the performance of the model.

Before MinMax Scaling;

`df=pd.DataFrame({`

"x1":np.random.chisquare(8,1000),

"x2":np.random.beta(8,2,1000)*40,

"x3":np.random.normal(50,3,1000),

})

df.plot.kde()

After MinMax Scaling;

`from sklearn.preprocessing import MinMaxScaler`

minmax=MinMaxScaler()

data_tf=minmax.fit_transform(df)

df=pd.DataFrame(data_tf,columns=["x1","x2","x3"])

df.plot.kde()

Grafikten de anlaşılacağı üzere veriler 0 ile 1 arasındadır.

It works similar to MinMaxScaler. It has a data range from -1 to 1.

`X_train = np.array([[ 1., -1., 2.],`

[ 2., 0., 0.],

[ 0., 1., -1.]])max_abs_scaler = preprocessing.MaxAbsScaler()

X_train_maxabs = max_abs_scaler.fit_transform(X_train)

X_train_maxabs

#output

>>>array([[ 0.5, -1. , 1. ],

[ 1. , 0. , 0. ],

[ 0. , 1. , -0.5]])

It is a good option for data with many outliers.

The Standard Scaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

If data is not normally distributed, this is not the best scaler to use.

`from sklearn import preprocessing`

import numpy as np

X_train = np.array([[ 0., 1., 2.],

[ 2., 2., 0.],

[ 4., 1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)X_scaled = scaler.transform(X_train)

print(X_scaled)

>>>array([[-1.22474487, -0.70710678, 1.33630621],

[ 0. , 1.41421356, -0.26726124],

[ 1.22474487, -0.70710678, -1.06904497]])

Another example for standard scaler;

It removes the mean for each label so it is centered on zero. Mean removal helps in removing any bias from the features.

`from sklearn.preprocessing import scale`

import numpy as np

input=np.array([[ 2, -2.4, 3.1],

[ 2, 0, 1.6],

[ 4.2,-1.1, 5.8]])

standardized=scale(input)

print("nMean: ",standardized.mean(axis=0))

print("Std Deviation: ",standardized.std(axis=0))>>>Mean: [3.70074342e-16 8.32667268e-17 0.00000000e+00]

Std Deviation: [1. 1. 1.]

Before explaining one hot encoding, I would like to talk about what nominal is.

Nominal is the scale of variables that cannot be ordered and measured. For example, car brand and colors cannot be measured and sorted. This type of data is arranged using by one hot encoding.

So what is one hot encoding?

A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values.Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.This is very useful for identifying catogerical values. This procedure makes prosper our dataset before applying the machine learning algorithms.

**Normalization** is the process of **scaling individual samples to have unit norm. It centralizes data to origin.**

Before explaining ordinal encoder, I would like to explain about what ordinal is.

Ordinal is the scale of unmeasurable but orderable variables. For example, apartment numbers are arranged with an ordinal encoder.

As you can see in the example, Male-female are not measurable but we can order them as 0 or 1.

`#male=1 female=0`

#US=1 Europe=0

#Safari=1 Firefox=0enc = preprocessing.OrdinalEncoder()

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]

enc.fit(X)

OrdinalEncoder()

enc.transform([['female', 'from US', 'uses Safari']])

>>>array([[0., 1., 1.]])

Another example is:

`df=pd.DataFrame(`

{'Age':[11,23,41,34,56,12,32,43],

'Income':["Low","Medium","High","Medium","High","Low","Medium","High"]})df=df.Income.map({"Low":1,"Medium":2,"High":3})

df

>>>0 1

1 2

2 3

3 2

4 3

5 1

6 2

7 3

It separates the data into 0 or 1. Converts to 0 for zero and less than zero, and to 1 for 1 and greater than 1.

We may encounter null values in the data. So we will either remove them from the data or replace them with a value. This value can be mode, median or mean. The selection process depends on the data set.

`df = pd.DataFrame({"name": ['Alfred', 'John','Jimmy'],`

"toy": [np.nan, 'Car', 'Baby'],

"born": [pd.NaT, pd.Timestamp("1999-02-23"),

pd.NaT]})

print(df)name toy born

0 Alfred NaN NaT

1 John Car 1999-02-23

2 Jimmy Baby NaT

print(df.dropna())

name toy born

1 John Car 1999-02-23

`import numpy as np`

from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

imp_mean.fit([[2, 4, 5], [1, np.nan, 6], [4, 2, 7]])

SimpleImputer()

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

print(imp_mean.transform(X))>>>[[ 2.33333333 2. 3. ]

>>>[ 4. 3. 6. ]

>>> [10. 3. 9. ]]

In a prediction context, simple imputer usually performs poorly when associated with a weak learner. Alternative ones are iterative_imputer or KNNImputer.

`>>> import numpy as np`

>>> from sklearn.experimental import enable_iterative_imputer

>>> from sklearn.impute import IterativeImputer

>>> imp_mean = IterativeImputer(random_state=0)

>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

IterativeImputer(random_state=0)

>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]

>>> imp_mean.transform(X)

array([[ 6.9584..., 2. , 3. ],

[ 4. , 2.6000..., 6. ],

[10. , 4.9999..., 9. ]])

When you start dealing with data, you will see that some entities have little data. So we’ll need to combine them. You can try this with pandas’s resample module.

`index = pd.date_range('1/1/2000', periods=9, freq='T')`

>>> series = pd.Series(range(9), index=index)

>>> series

2000-01-01 00:00:00 0

2000-01-01 00:01:00 1

2000-01-01 00:02:00 2

2000-01-01 00:03:00 3

2000-01-01 00:04:00 4

2000-01-01 00:05:00 5

2000-01-01 00:06:00 6

2000-01-01 00:07:00 7

2000-01-01 00:08:00 8

Freq: T, dtype: int64

`series.resample('3T').sum()`

2000-01-01 00:00:00 3

2000-01-01 00:03:00 12

2000-01-01 00:06:00 21

Freq: 3T, dtype: int64

It is similar to the Resampling example. But here we place the existing data in a range or label.

## As Feature Engineering and Dimensionality Reduction are important and long topics, I will examine them separately soon.

Some data can be large and complex. This complex dataset increases the spent time for machine learning and might make it difficult. It provides the opportunity to examine the data better by creating a subset. There are types such as PCA, LDA and GDA. As a result, we can say that it simplifies the data.

It is to obtain a new value by using the information which exists in the data set. In this way, data grows and learning situation can be even more successful. First time, it is necessary to analyze the data to be processed well. Second, you should understand the data correctly. Otherwise, you might put a barrier in the way of the learning situation.

Artificial intelligence is a very important technology for today and the future. In order to understand it well and obtain efficiency from the models, it is necessary to prepare the data well. We went through some steps with you to see the data better. I do not guarantee that these steps will have a positive impact on your data. The results obtained depend entirely on your data set.

Remember, just as an athlete prepares before going to the match, we must prepare our data before the learning process.