import pandas as pdWhat bad columns looks like
Sometimes columns have extra spaces or are just plain odd, even if they look normal.
df = pd.read_csv("../Civil_List_2014.csv").head(3)
df| DPT | NAME | ADDRESS | TTL # | PC | SAL-RATE | |
|---|---|---|---|---|---|---|
| 0 | 868 | B J SANDIFORD | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 1 | 868 | C A WIGFALL | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 2 | 69 | A E A-AWOSOGBA | HRA/DEPARTMENT OF SOCIAL S | 52311 | A | $51955.00 |
df.columnsIndex(['DPT ', 'NAME ', 'ADDRESS ', 'TTL # ', 'PC ', 'SAL-RATE'], dtype='object')
Overwriting columns
In order to fix them, you have a few options. Once thing you can do is just overwrite them with new ones.
df = pd.read_csv("../Civil_List_2014.csv").head(3)
df| DPT | NAME | ADDRESS | TTL # | PC | SAL-RATE | |
|---|---|---|---|---|---|---|
| 0 | 868 | B J SANDIFORD | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 1 | 868 | C A WIGFALL | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 2 | 69 | A E A-AWOSOGBA | HRA/DEPARTMENT OF SOCIAL S | 52311 | A | $51955.00 |
df.columnsIndex(['DPT ', 'NAME ', 'ADDRESS ', 'TTL # ', 'PC ', 'SAL-RATE'], dtype='object')
In this case it might make sense to use a list comprehension to strip all of the extra spaces.
df.columns = [col.strip() for col in df.columns]df.columnsIndex(['DPT', 'NAME', 'ADDRESS', 'TTL #', 'PC', 'SAL-RATE'], dtype='object')
Mass renaming
You can also just pass in a new list of columns if you don’t like what they come in as.
df = pd.read_csv("../Civil_List_2014.csv").head(3)
df| DPT | NAME | ADDRESS | TTL # | PC | SAL-RATE | |
|---|---|---|---|---|---|---|
| 0 | 868 | B J SANDIFORD | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 1 | 868 | C A WIGFALL | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 2 | 69 | A E A-AWOSOGBA | HRA/DEPARTMENT OF SOCIAL S | 52311 | A | $51955.00 |
# Rename all of the columns, keeping them in order
df.columns = ['Department', 'Name', 'Address', 'Title', 'Pay Class', 'Salary Rate']df| Department | Name | Address | Title | Pay Class | Salary Rate | |
|---|---|---|---|---|---|---|
| 0 | 868 | B J SANDIFORD | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 1 | 868 | C A WIGFALL | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 2 | 69 | A E A-AWOSOGBA | HRA/DEPARTMENT OF SOCIAL S | 52311 | A | $51955.00 |
Rename on import
You also set their names when you’re reading in the csv.
# Read in the csv, passing names= to set the column names
df = pd.read_csv("../Civil_List_2014.csv", names=["Department", "Name", "Address", "Title", "Pay Class", "Salary Rate"]).head(3)
df/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
| Department | Name | Address | Title | Pay Class | Salary Rate | |
|---|---|---|---|---|---|---|
| 0 | DPT | NAME | ADDRESS | TTL # | PC | SAL-RATE |
| 1 | 868 | B J SANDIFORD | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 2 | 868 | C A WIGFALL | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
Unfortunately this leaves you with the first row of actual headers inside of
your data. When usings names= in read_csv, add skiprows=1 to skip the
first row (the header row).
# Pass in names=, but also skiprows=1
df = pd.read_csv("../Civil_List_2014.csv", skiprows=1, names=["Department", "Name", "Address", "Title", "Pay Class", "Salary Rate"]).head(3)
df| Department | Name | Address | Title | Pay Class | Salary Rate | |
|---|---|---|---|---|---|---|
| 0 | 868 | B J SANDIFORD | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 1 | 868 | C A WIGFALL | DEPARTMENT OF CITYWIDE ADM | 12702 | X | $5.00 |
| 2 | 69 | A E A-AWOSOGBA | HRA/DEPARTMENT OF SOCIAL S | 52311 | A | $51955.00 |