pandas

https://pandas.pydata.org/ https://www.kaggle.com/learn/pandas

pandas, or pd, is a Python library for working with data, for example, tabular data.

reading data

https://www.kaggle.com/code/dansbecker/finding-your-files-in-kaggle-kernels/notebook

I have added the wine-reviews dataset to this notebook.

wines = pd.read_csv('/kaggle/input/wine-reviews/winemag-data-130k-v2.csv')
wines.head()

There is an id column we should index on. This is what the optional parameter index_col is for. Alternatively, we could call set_index but I prefer using index_col personally.

wines = pd.read_csv('/kaggle/input/wine-reviews/winemag-data-130k-v2.csv', index_col=0)
wines.head()

overview

There are a few ways to get a sense of your data. head and tail show you the beginning and end of the table, and by default, calling the dataframe will print the head and tail combined.

print(wines.head())
print(wines.tail())
wines

wines.columns will give you the column index, and wines.dtypes will give you column names and their types. object means str in this data set.

print(wines.columns)
wines.dtypes

describe is an overview of the table, and should be thoroughly studied, as it gives a great initial "shape" of the data. You'll notice it doesn't work on strings because the numerical information it reports is not defined for object, or str.

wines.describe()

You can use value_counts for a more thorough inspection of columns, among other things. For example, the points distribution looks interesting. Let's inspect the how many wines there are per point/rating.

wines.points.value_counts()

renaming

Once we understand the data better, we might prefer alternative naming schemes. Most commonly, we will want to rename columns. For example, in our data, province, region_1, and region_2 are all correlated, with region_2 seemingly indicating the smallest geographical zone. I'll do is rename them to something I find more intuitive, personally.

renamed = wines.rename(columns={'province': 'state', 'region_1': 'county', 'region_2': 'city'})
renamed.dtypes

Note that the wines dataframe has not changed! This means you must either overwrite the variable, or continue with the new variable.

wines.dtypes

missing values

pd.isnull is an important utility. It will return a boolean series, either on a series itself, or it can take a series/column as input

# equivalent outputs
print(pd.isnull(wines.price))
wines.price.isnull()

Both of these can be used to index into the data frame to view NaN values

nan_priced_wines = wines[pd.isnull(wines.price)]
print(len(nan_priced_wines))
# price is all NaN now
nan_priced_wines.describe()

You can then fill in missing values with fillna. Note that this is not an in-place update without the optional parameter of inplace=True. So, capture the new data frame / series or update inplace

# fill NA/NaN in a whole dataframe
print(wines.loc[0].price) # nan
print(wines.fillna(0).loc[0].price) # 0

# note: wines doesn't update "in place"
wines.loc[0].price # nan

combining datasets

pd.concat is going to datasets with the same columns. It essentially creates a superset of the datasets.

pd.join is for when you want a dataset that essentially preserves the columns of the datasets, while joining on a particular column. This is similar to SQL join.

one = pd.read_csv('one.csv')
two = pd.read_csv('two.csv')
one.join(two, lsuffix='\_ONE', rsuffix='\_TWO')

pandas is a tool for working with data in Python. It is useful for AI Engineering among other things.

conditional selection

reviews.loc[(reviews.country == 'Italy') & (reviews.country >= 90)]

categorical data

When working with categorical data, it is common to do one-hot encoding. In pandas this is two step process of turning categorical data (like "male" and "female" string values) into numbers (like 1 and 0). This is done using pd.Categorical. See Pandas User Guide: Categorical data.

Once you have done that, you can the do one-hot encoding via get_dummies(data_frame). This is a reshaping of the data. See Pandas User Guide: Reshaping and pivot tables, particularly the section of get_dummies. Also, the API docs on get_dummies. This will one-hot encode each categorical column into n-many "dummy columns", which you then train on.

Most importantly, you can just call get_dummies without categorizing your data first. It will detect columns with object data types, assume they are categorical, and go from there.

From the Pandas User Guide: Reshaping and pivot tables: section on get_dummies

get_dummies() also accepts a DataFrame. By default, object, string, or categorical type columns are encoded as dummy variables with other columns unaltered.

So, if you have categorical data that is already a number don't forget to force it to be categorical, e.g., df.cabin_class = pd.Categorical(df.cabin_class)

You can also use pd.cut() to "bin" continuous values into discrete values, which can then be treated as categorical values. See the reshaping user guide: cut section.