Here are some useful snippets that can come in handy when cleaning data with pandas. This was useful for me in completing the coursework for python data science course.
Extract a subset of columns from the dataframe based on a regular expression:
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | persona1 = pd.Series({ 'Last Post On': '02/04/2017', 'Friends-2015': 10, 'Friends-2016': 20, 'Friends-2017': 300 }) persona2 = pd.Series({ 'Last Post On': '02/04/2018', 'Friends-2015': 100, 'Friends-2016': 240, 'Friends-2017': 560 }) persona3 = pd.Series({ 'Last Post On': '02/04/2014', 'Friends-2015': 120, 'Friends-2016': 120, 'Friends-2017': 120 }) df = pd.DataFrame([persona1, persona2, persona3], index=['Chris', 'Bella', 'Laura']) df.filter(regex=("Friends-\d{4}")) |
Output:
Friends-2015 | Friends-2016 | Friends-2017 | |
---|---|---|---|
Chris | 10 | 20 | 300 |
Bella | 100 | 240 | 560 |
Laura | 120 | 120 | 120 |
Set a column based on the value of both the current row and adjacent rows:
For this example, we define regulars to the gym as those who have gone to the gym last year at least 3 months in a row:
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import datetime df = pd.DataFrame({'Month': [datetime.date(2008, i, 1).strftime('%B') for i in range(1,13)] * 3, 'visited': [False]*36}, index=['Alice']*12 + ['Bob']*12 + ['Bridgett']*12) df = df.reset_index() def make_regular(r, name): r['visited'] = (r['visited'] or (r['index'] == name) and ((r['Month'] == 'February') or (r['Month'] == 'March') or (r['Month'] == 'April'))) return r df = df.apply(make_regular, axis=1, args=('Alice',)) df = df.apply(make_regular, axis=1, args=('Bob',)) regular = ((df['visited'] == True) & (df['visited'].shift(-1) == True) & (df['visited'].shift(-2) == True)) df[regular]['index'].values .tolist() |
Output:
1 | ['Alice', 'Bob'] |
No comments:
Post a Comment