Description

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Snippets

Standard import

   1 import numpy as np
   2 import pandas as pd

Read CSV file

   1 path = '/path/to/file(s)'
   2 df = pd.read_csv(path + 'name.csv', sep=';', header=0, dtype={'Force String Column': str}, 
   3                  parse_dates=['Date Column'] encoding='utf-8')

Variable	Meaning
sep	separator character, can be an array
header	row number that contains column titles. remove if none
dtype	force any column to be interpreted as specific datatype
parse_dates	parse any column as datetime
encoding	file encoding e.g. utf-8 or ansi

Renaming Columns

   1 dt.rename(columns={'File column name 1': 'Column1', 
   2                    'File column name 2': '2'
   3                   }, inplace=True)

Drop Column

   1 df.drop('Column A', axis=1, inplace=True)

Do Y for each value in column X

   1 df.loc[:, ['X']] = df.loc[:, ['X']].apply(lambda x: x.Y)

Concat DataFrames

   1 df = pd.concat([df1, df2])
   2 df.reset_index(drop=True, inplace=True)

reset_index() is called to get rid of duplicate indices

Filter by regex on column A

   1 df = df[df['A']str.match('regex')]

Merge DataFrames

   1 df = pd.merge(df1, df2, on='Column A', how='left', suffixes=('_1','_2'))
   2 df = df.fillna(0)                    # To replace NaN fields with 0

Sort DataFrame

   1 df = df.sort_values('Column A', ascending = False)

Pivot Table

   1 pvt_df = pd.pivot_table(df, 
   2                      values=['Column B', 'Column C'], 
   3                      index='Column A', 
   4                      aggfunc={'Column B': [np.sum, np.average],
   5                               'Column C' : lambda x: len(x)})
   6 pvt_df .reset_index(inplace=True)         # Make Column A a column again, instead of index
   7 pvt_df .columns = [' '.join(col).strip() for col in 
   8                    pvt_df.columns.values] # Flatten multi-index columns
   9 pvt_df .rename(columns={                  # Rename known numbers
  10                     'Column B average':'average_B',
  11                     'Column B sum':'sum_B',
  12                     'Column C <lambda>':'count_C'
  13                    }, inplace=True)

Move values from column A and B to column B and C (shift right) if column X is NaN

   1 mask = df.loc[:,'Column X'].isnull()
   2 df.loc[mask,['Column B', 'Column C']] = df.loc[mask,['Column A', 'Column B']].values

-  ⇤ ← Revision 2 as of 2020-04-08 15:53:19 → 
  Size: 641
  Editor: Burathar
  Comment:
+   ← Revision 7 as of 2020-04-08 17:13:18 → ⇥
  Size: 3112
  Editor: Burathar
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
-== Standard import ==
+=== Standard import ===
-Line 18:
+Line 17:
-== Read CSV file ==
+=== Read CSV file ===
-Line 22:
+Line 20:
-import pandas as pd
+df = pd.read_csv(path + 'name.csv', sep=';', header=0, dtype={'Force String Column': str}, 
                 parse_dates=['Date Column'] encoding='utf-8')
}}}
|| '''Variable''' || '''Meaning'''                        ||
|| sep            || separator character, can be an array ||
|| header         || row number that contains column titles. remove if none ||
|| dtype          || force any column to be interpreted as specific [[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes|datatype]] ||
|| parse_dates    || parse any column as datetime ||
|| encoding       || file encoding e.g. utf-8 or ansi ||

=== Renaming Columns ===
{{{#!highlight python
dt.rename(columns={'File column name 1': 'Column1', 
                   'File column name 2': '2'
                  }, inplace=True)
-Line 25:
+Line 37:
+=== Drop Column ===
-Line 26:
+Line 39:
-import numpy as np
df = pd.read_csv(path + 'name.csv', sep=';', header=0, dtype={'Force String Column': str}, parse_dates=['Date Column'] encoding='utf-8')
+df.drop('Column A', axis=1, inplace=True)
-Line 29:
+Line 41:
+=== Do Y for each value in column X ===
{{{#!highlight python
df.loc[:, ['X']] = df.loc[:, ['X']].apply(lambda x: x.Y)
}}}

=== Concat DataFrames ===
{{{#!highlight python
df = pd.concat([df1, df2])
df.reset_index(drop=True, inplace=True)
}}}
`reset_index()` is called to get rid of duplicate indices

=== Filter by regex on column A ===
{{{#!highlight python
df = df[df['A']str.match('regex')]
}}}

=== Merge DataFrames ===
{{{#!highlight python
df = pd.merge(df1, df2, on='Column A', how='left', suffixes=('_1','_2'))
df = df.fillna(0)                    # To replace NaN fields with 0
}}}

=== Sort DataFrame ===
{{{#!highlight python
df = df.sort_values('Column A', ascending = False)
}}}


=== Pivot Table ===
{{{#!highlight python
pvt_df = pd.pivot_table(df, 
                     values=['Column B', 'Column C'], 
                     index='Column A', 
                     aggfunc={'Column B': [np.sum, np.average],
                              'Column C' : lambda x: len(x)})
pvt_df .reset_index(inplace=True)         # Make Column A a column again, instead of index
pvt_df .columns = [' '.join(col).strip() for col in 
                   pvt_df.columns.values] # Flatten multi-index columns
pvt_df .rename(columns={                  # Rename known numbers
                    'Column B average':'average_B',
                    'Column B sum':'sum_B',
                    'Column C <lambda>':'count_C'
                   }, inplace=True)
}}}


=== Move values from column A and B to column B and C (shift right) if column X is NaN ===
{{{#!highlight python
mask = df.loc[:,'Column X'].isnull()
df.loc[mask,['Column B', 'Column C']] = df.loc[mask,['Column A', 'Column B']].values
}}}

Wiki

Page

User