Size: 641
Comment:
|
← Revision 7 as of 2020-04-08 17:13:18 ⇥
Size: 3112
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 11: | Line 11: |
== Standard import == |
=== Standard import === |
Line 18: | Line 17: |
== Read CSV file == |
=== Read CSV file === |
Line 22: | Line 20: |
import pandas as pd | df = pd.read_csv(path + 'name.csv', sep=';', header=0, dtype={'Force String Column': str}, parse_dates=['Date Column'] encoding='utf-8') }}} || '''Variable''' || '''Meaning''' || || sep || separator character, can be an array || || header || row number that contains column titles. remove if none || || dtype || force any column to be interpreted as specific [[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes|datatype]] || || parse_dates || parse any column as datetime || || encoding || file encoding e.g. utf-8 or ansi || === Renaming Columns === {{{#!highlight python dt.rename(columns={'File column name 1': 'Column1', 'File column name 2': '2' }, inplace=True) |
Line 25: | Line 37: |
=== Drop Column === | |
Line 26: | Line 39: |
import numpy as np df = pd.read_csv(path + 'name.csv', sep=';', header=0, dtype={'Force String Column': str}, parse_dates=['Date Column'] encoding='utf-8') |
df.drop('Column A', axis=1, inplace=True) |
Line 29: | Line 41: |
=== Do Y for each value in column X === {{{#!highlight python df.loc[:, ['X']] = df.loc[:, ['X']].apply(lambda x: x.Y) }}} === Concat DataFrames === {{{#!highlight python df = pd.concat([df1, df2]) df.reset_index(drop=True, inplace=True) }}} `reset_index()` is called to get rid of duplicate indices === Filter by regex on column A === {{{#!highlight python df = df[df['A']str.match('regex')] }}} === Merge DataFrames === {{{#!highlight python df = pd.merge(df1, df2, on='Column A', how='left', suffixes=('_1','_2')) df = df.fillna(0) # To replace NaN fields with 0 }}} === Sort DataFrame === {{{#!highlight python df = df.sort_values('Column A', ascending = False) }}} === Pivot Table === {{{#!highlight python pvt_df = pd.pivot_table(df, values=['Column B', 'Column C'], index='Column A', aggfunc={'Column B': [np.sum, np.average], 'Column C' : lambda x: len(x)}) pvt_df .reset_index(inplace=True) # Make Column A a column again, instead of index pvt_df .columns = [' '.join(col).strip() for col in pvt_df.columns.values] # Flatten multi-index columns pvt_df .rename(columns={ # Rename known numbers 'Column B average':'average_B', 'Column B sum':'sum_B', 'Column C <lambda>':'count_C' }, inplace=True) }}} === Move values from column A and B to column B and C (shift right) if column X is NaN === {{{#!highlight python mask = df.loc[:,'Column X'].isnull() df.loc[mask,['Column B', 'Column C']] = df.loc[mask,['Column A', 'Column B']].values }}} |
Contents
Description
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Snippets
Standard import
Read CSV file
Variable |
Meaning |
sep |
separator character, can be an array |
header |
row number that contains column titles. remove if none |
dtype |
force any column to be interpreted as specific datatype |
parse_dates |
parse any column as datetime |
encoding |
file encoding e.g. utf-8 or ansi |
Renaming Columns
Drop Column
1 df.drop('Column A', axis=1, inplace=True)
Do Y for each value in column X
1 df.loc[:, ['X']] = df.loc[:, ['X']].apply(lambda x: x.Y)
Concat DataFrames
reset_index() is called to get rid of duplicate indices
Filter by regex on column A
1 df = df[df['A']str.match('regex')]
Merge DataFrames
Sort DataFrame
1 df = df.sort_values('Column A', ascending = False)
Pivot Table
1 pvt_df = pd.pivot_table(df,
2 values=['Column B', 'Column C'],
3 index='Column A',
4 aggfunc={'Column B': [np.sum, np.average],
5 'Column C' : lambda x: len(x)})
6 pvt_df .reset_index(inplace=True) # Make Column A a column again, instead of index
7 pvt_df .columns = [' '.join(col).strip() for col in
8 pvt_df.columns.values] # Flatten multi-index columns
9 pvt_df .rename(columns={ # Rename known numbers
10 'Column B average':'average_B',
11 'Column B sum':'sum_B',
12 'Column C <lambda>':'count_C'
13 }, inplace=True)