4 minute read

Pandas is a Python library that gives structure to data, it also makes it very easy to access this data as well as manipulate it.

First Steps

We’ll begin by installing Pandas with:

pip install pandas

Then we can go a head and import it with:

import pandas as pd

We will begin with some very simple data:

import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
print(column)
['Mars', 'Earth', 'Jupiter']

When we print this list all we get is a plain text with no structure. We can pick this list and insert it into something called a DataFrame.

import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
data = pd.DataFrame(column)
print(data )
        0
0     Mars
1    Earth
2  Jupiter

Our list is now organized into rows and columns which makes it much easier to view. We can name the column by doing to following step:

import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
titled_column = {"name" : column}
data = pd.DataFrame(titled_column)
print(data)
     name
0     Mars
1    Earth
2  Jupiter

Insertion and Selections

Let’s add new columns to our DataFrame, so inside our dictionary we’ll create new keys:

import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
titled_column = {"name" : column,
                 "hours": [24.37, 24, 9.56],
                 "days": [687, 365, 4333]}
data = pd.DataFrame(titled_column)
print(data)
     name  hours  days
0     Mars  24.37   687
1    Earth  24.00   365
2  Jupiter   9.56  4333

Now, let’s select values. If we want to select a column we can use:

select = data["name"]
print(select)
0       Mars
1      Earth
2    Jupiter

If we want just a single value, we can use:

select = data["name"][1]
print(select)
Earth

And if we want to select a row, we can use:

select = data.iloc[1]
print(select)
name     Earth
hours       24
days       365

Save And Import Data

Let’s say our DataFrame is complete, how can we save it into a file?

data.to_csv("planets.csv")

Once we run it, we’ll get a new file popping up in our file system. The CSV or Comma Separated Values is a delimited text file that uses a comma to separate values. But we can actually separate the data using other characters, for example:

data.to_csv("planets.csv", sep="\t")

We are not limited to CSV files only, if we’d like to change it to txt, we’ll just change it to this following code and when we’ll run it, we will have a new text file in our file system:

data.to_csv("planets.txt", sep="\t")

Good, so we learned how to save this data into a file, but how exactly can we load them from an existing file?. To do this we’ll make a new python file:

import pandas as pd
data = pd.read_csv("planets.csv", sep="\t")
print(data)
  Unnamed: 0     name  hours  days
0           0     Mars  24.37   687
1           1    Earth  24.00   365
2           2  Jupiter   9.56  4333

Now, if you’d like to get rid of this unnecessary column, we can simply go back to the code where we saved this file and we’ll add:

data.to_csv("planets.csv", index=False, sep="\t")

And now, when we re-run the new file, we can see that that unnecessary column is gone:

     name  hours  days
0     Mars  24.37   687
1    Earth  24.00   365
2  Jupiter   9.56  4333

Now, if we want to print only the first x rows, we can use:

import pandas as pd
data = pd.read_csv("planets.csv", sep="\t")
print(data.head(2))
   name  hours  days
0   Mars  24.37   687
1  Earth  24.00   365

Or we can print the last x rows:

print(data.tail(2))
     name  hours  days
1    Earth  24.00   365
2  Jupiter   9.56  4333

Queries

Let’s go ahead and filter our DataFrame entries. Let’s say we only want to select the rows where the name is ‘Earth’. In the first instance we are specifying a filtering command and with the second instance we are selecting an entire column, if the value that is found within the column equals “Earth”, only then, this command returns the entire row.

filtering = data[data["name"] == "Earth"]
print(filtering)
   name  hours  days
1  Earth   24.0   365

Replace and Remove

What if we want to replace every instance of “Earth” with “Saturn”?

replaced = data.replace("Earth", "Saturn")
print(replaced)
     name  hours  days
0     Mars  24.37   687
1   Saturn  24.00   365
2  Jupiter   9.56  4333

And yes, this data is now wrong, but it just was an example. Now let’s remove our data, we can get rid of the entire column by typing the following code. And by the way, axis=1 is our columns and axis=0 is our rows.

removed = data.drop("days", axis=1)
print(removed)
     name  hours
0     Mars  24.37
1    Earth  24.00
2  Jupiter   9.56

Now, what happens if we want to get rid of two columns? Instead of passing a string into our drop command, we’ll pass a list of strings which includes the names of the columns.

removed = data.drop(["days", "hours"], axis=1)
print(removed)
     name
0     Mars
1    Earth
2  Jupiter

Adding new rows

Now if we would like to add a brand new row, we can simply do the following code. As a second argument we would also specify ignore_index.

row = {"name": "Saturn",
      "hours": 10.42,
      "days": 10759}
new_row_data = data.append(row, ignore_index = True)
print(new_row_data)
     name  hours   days
0     Mars  24.37    687
1    Earth  24.00    365
2  Jupiter   9.56   4333
3   Saturn  10.42  10759

Remove Duplicates

What if we want to remove duplicate values? We can iterate over our data entries and this can take quite some time. Or we can use the pandas duplicates method.

data = data.drop_duplicates(subset=["name"])

Iterations

Now, what if we want to iterate over rows?

for index, value in data.iterrows():
    print(value)
name      Mars
hours    24.37
days       687
Name: 0, dtype: object
name     Earth
hours       24
days       365
Name: 1, dtype: object
name     Jupiter
hours       9.56
days        4333
Name: 2, dtype: object

Leave a comment