Pandas Tutorial - Python
Pandas is a Python library that gives structure to data, it also makes it very easy to access this data as well as manipulate it.
First Steps
We’ll begin by installing Pandas with:
pip install pandas
Then we can go a head and import it with:
import pandas as pd
We will begin with some very simple data:
import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
print(column)
['Mars', 'Earth', 'Jupiter']
When we print this list all we get is a plain text with no structure. We can pick this list and insert it into something called a DataFrame.
import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
data = pd.DataFrame(column)
print(data )
0
0 Mars
1 Earth
2 Jupiter
Our list is now organized into rows and columns which makes it much easier to view. We can name the column by doing to following step:
import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
titled_column = {"name" : column}
data = pd.DataFrame(titled_column)
print(data)
name
0 Mars
1 Earth
2 Jupiter
Insertion and Selections
Let’s add new columns to our DataFrame, so inside our dictionary we’ll create new keys:
import pandas as pd
column = ["Mars", "Earth", "Jupiter"]
titled_column = {"name" : column,
"hours": [24.37, 24, 9.56],
"days": [687, 365, 4333]}
data = pd.DataFrame(titled_column)
print(data)
name hours days
0 Mars 24.37 687
1 Earth 24.00 365
2 Jupiter 9.56 4333
Now, let’s select values. If we want to select a column we can use:
select = data["name"]
print(select)
0 Mars
1 Earth
2 Jupiter
If we want just a single value, we can use:
select = data["name"][1]
print(select)
Earth
And if we want to select a row, we can use:
select = data.iloc[1]
print(select)
name Earth
hours 24
days 365
Save And Import Data
Let’s say our DataFrame is complete, how can we save it into a file?
data.to_csv("planets.csv")
Once we run it, we’ll get a new file popping up in our file system. The CSV or Comma Separated Values is a delimited text file that uses a comma to separate values. But we can actually separate the data using other characters, for example:
data.to_csv("planets.csv", sep="\t")
We are not limited to CSV files only, if we’d like to change it to txt, we’ll just change it to this following code and when we’ll run it, we will have a new text file in our file system:
data.to_csv("planets.txt", sep="\t")
Good, so we learned how to save this data into a file, but how exactly can we load them from an existing file?. To do this we’ll make a new python file:
import pandas as pd
data = pd.read_csv("planets.csv", sep="\t")
print(data)
Unnamed: 0 name hours days
0 0 Mars 24.37 687
1 1 Earth 24.00 365
2 2 Jupiter 9.56 4333
Now, if you’d like to get rid of this unnecessary column, we can simply go back to the code where we saved this file and we’ll add:
data.to_csv("planets.csv", index=False, sep="\t")
And now, when we re-run the new file, we can see that that unnecessary column is gone:
name hours days
0 Mars 24.37 687
1 Earth 24.00 365
2 Jupiter 9.56 4333
Now, if we want to print only the first x rows, we can use:
import pandas as pd
data = pd.read_csv("planets.csv", sep="\t")
print(data.head(2))
name hours days
0 Mars 24.37 687
1 Earth 24.00 365
Or we can print the last x rows:
print(data.tail(2))
name hours days
1 Earth 24.00 365
2 Jupiter 9.56 4333
Queries
Let’s go ahead and filter our DataFrame entries. Let’s say we only want to select the rows where the name is ‘Earth’. In the first instance we are specifying a filtering command and with the second instance we are selecting an entire column, if the value that is found within the column equals “Earth”, only then, this command returns the entire row.
filtering = data[data["name"] == "Earth"]
print(filtering)
name hours days
1 Earth 24.0 365
Replace and Remove
What if we want to replace every instance of “Earth” with “Saturn”?
replaced = data.replace("Earth", "Saturn")
print(replaced)
name hours days
0 Mars 24.37 687
1 Saturn 24.00 365
2 Jupiter 9.56 4333
And yes, this data is now wrong, but it just was an example. Now let’s remove our data, we can get rid of the entire column by typing the following code. And by the way, axis=1 is our columns and axis=0 is our rows.
removed = data.drop("days", axis=1)
print(removed)
name hours
0 Mars 24.37
1 Earth 24.00
2 Jupiter 9.56
Now, what happens if we want to get rid of two columns? Instead of passing a string into our drop command, we’ll pass a list of strings which includes the names of the columns.
removed = data.drop(["days", "hours"], axis=1)
print(removed)
name
0 Mars
1 Earth
2 Jupiter
Adding new rows
Now if we would like to add a brand new row, we can simply do the following code. As a second argument we would also specify ignore_index.
row = {"name": "Saturn",
"hours": 10.42,
"days": 10759}
new_row_data = data.append(row, ignore_index = True)
print(new_row_data)
name hours days
0 Mars 24.37 687
1 Earth 24.00 365
2 Jupiter 9.56 4333
3 Saturn 10.42 10759
Remove Duplicates
What if we want to remove duplicate values? We can iterate over our data entries and this can take quite some time. Or we can use the pandas duplicates method.
data = data.drop_duplicates(subset=["name"])
Iterations
Now, what if we want to iterate over rows?
for index, value in data.iterrows():
print(value)
name Mars
hours 24.37
days 687
Name: 0, dtype: object
name Earth
hours 24
days 365
Name: 1, dtype: object
name Jupiter
hours 9.56
days 4333
Name: 2, dtype: object
Leave a comment