[Pandas] 메모리 줄이기 read_csv, chunk, multiprocessing
pandas는 디스크말고 RAM에 데이터를 적재하는데, 이때 Contiguous Memory Allocation(연속 메모리 할당) 방식을 사용한다.
- 디스크 적재시(Reading from SSDs) : ~16000 nanoseconds
- 램 적재시(Reading from RAM) : ~100 nanoseconds
**Contiguous Memory Allocation(consecutive blocks are assigned) : logical address가 연속적이면 physical address도 연속적으로 배치된다.
**NonContiguous Memory Allocation(separate blocks at different locations)
멀티프로세싱과 GPU를 사용하기 전에, pd.read_csv()를 효과적으로 하는 방법 먼저 고려해보자.(근데 큰 효과는 없는듯..)
사용한 데이터 : https://www.kaggle.com/c/bluebook-for-bulldozers/overview
Summary
1. read_csv() 인자에 usecols 사용하기
2. 수치형(Numerical) 데이터에 올바른 dtype 사용하기
3. missing value, NANs를 로딩할 때 치환하는 converters 사용하기
4. 범주형(Categorical) 데이터에 올바른 dtype 사용하기
5. Sparse Series 사용하기
6. nrows, skip rows 사용하기
7. Chunks 사용하기
8. Multiprocessing 사용하기
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%%time
df = pd.read_csv('./Train.csv')
CPU times: user 558 ms, sys: 20.5 ms, total: 578 ms
Wall time: 577 ms
df.head()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1139246 | 66000 | 999089 | 3157 | 121 | 3 | 2004 | 68.0 | Low | 11/16/2006 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
1 | 1139248 | 57000 | 117657 | 77 | 121 | 3 | 1996 | 4640.0 | Low | 3/26/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
2 | 1139249 | 10000 | 434808 | 7009 | 121 | 3 | 2001 | 2838.0 | High | 2/26/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1139251 | 38500 | 1026470 | 332 | 121 | 3 | 2001 | 3486.0 | High | 5/19/2011 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1139253 | 11000 | 1057373 | 17311 | 121 | 3 | 2007 | 722.0 | Medium | 7/23/2009 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 53 columns
df.info(verbose=False, memory_usage='deep')
# memory_usage
# We get bytes used by each variable, but this time it gives the memory use of object data types.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192500 entries, 0 to 192499
Columns: 53 entries, SalesID to Steering_Controls
dtypes: float64(1), int64(7), object(45)
memory usage: 390.6 MB
req_cols = ['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
'Coupler', 'Coupler_System']
%%time
df = pd.read_csv('Train.csv', usecols=req_cols)
CPU times: user 490 ms, sys: 47.6 ms, total: 538 ms
Wall time: 537 ms
2. use correct dtypes for numerical data¶
- int8 can store integers from -128 to 127
- int16 can store integers from -32768 to 32767
- int64 can store integers from -9223372036854775808 to 9223372036854775807
- int64가 디폴트
df['YearMade'].memory_usage(index=False, deep=True)
1540000
df['YearMade'].min()
1000
df['YearMade'].max()
2010
df = pd.read_csv('Train.csv', dtype={"YearMade": "int16"})
df['YearMade'].memory_usage(index=False, deep=True)
385000
(1540000 - 385000)/ 1540000*100
75.0
%%time
df = pd.read_csv('Train.csv', dtype={"auctioneerID": "int8"})
CPU times: user 569 ms, sys: 35.4 ms, total: 604 ms
Wall time: 603 ms
%%time
def converter(val):
if val == np.nan:
return 0
return val
df = pd.read_csv('Train.csv',
converters={"auctioneerID": converter},
dtype={"auctioneerID": "int8"})
CPU times: user 634 ms, sys: 43.2 ms, total: 677 ms
Wall time: 676 ms
df['Thumb'].value_counts()
# Thumb 컬럼 디폴트로 string으로 파싱되었지만,
# fixed number과 파싱되지 않는 값들이 존재
None or Unspecified 33984
Manual 3897
Hydraulic 1833
Name: Thumb, dtype: int64
df['Thumb'].memory_usage(index=False, deep=True)
7838425
df = pd.read_csv("Train.csv", dtype={"Thumb": "category"})
df['Thumb'].memory_usage(index=False, deep=True)
192813
(7838425 - 192813)/7838425*100
97.54015634518413
df = pd.read_csv('Train.csv')
series = df['Scarifier']
series.memory_usage(index=False, deep=True)
6725532
len(series)
192500
len(series.dropna())
15605
sparse_series = series.astype("Sparse[str]")
len(sparse_series)
192500
sparse_series.memory_usage(index=False, deep=True)
5372792
(6725532-5372792)/6725532*100
20.11350180179055
6. nrows, skip rows¶
- 모든 데이터를 램에 로드하더라도 테스트는 작은 데이터로 하는 것이 좋다.
df = pd.read_csv('Train.csv', nrows=100) # 100개로우만
df = pd.read_csv('Train.csv', skiprows=[0,2,5]) # 0,2,5 행 로우는 빼고
# 효과적으로 nrows를 사용하는 방법은,
# 모든 컬럼마다 적절한 dtypes을 체크하고 정의하는 것이다.
sample = pd.read_csv("Train.csv", nrows=100) # Load Sample data
dtypes = sample.dtypes # Get the dtypes
cols = sample.columns # Get the columns
dtype_dictionary = {}
for c in cols:
"""
Write your own dtypes using
# rule 2
# rule 3
"""
if str(dtypes[c]) == 'int64':
dtype_dictionary[c] = 'float32' # Handle NANs in int columns
else:
dtype_dictionary[c] = str(dtypes[c])
# Load Data with increased speed and reduced memory.
df = pd.read_csv("Train.csv", dtype=dtype_dictionary,
keep_default_na=False,
error_bad_lines=False,
na_values=['na',''])
df = pd.read_csv('Train.csv', chunksize=1000)
total_len = 0
for chunk in df:
total_len += len(chunk)
total_len
192500
# concatnate each chunk one by one
tp = pd.read_csv('Train.csv', iterator=True, chunksize=1000)
df = pd.concat(tp, ignore_index=True)
len(df)
192500
8. Multiprocessing¶
- pandas는 multiprocessing을 위한 njobs variable을 가지고 있지 않다.
python utillze의 multiprocessing library를 사용해서 chunk size 만큼 핸들링
import multiprocessing as mp
%%time
df = pd.read_csv('Train.csv', chunksize=1000)
total_length = 0
for chunk in df:
total_length += len(chunk)
total_length
CPU times: user 1.22 s, sys: 12.5 ms, total: 1.24 s
Wall time: 1.23 s
192500
%%time
LARGE_FILE = "Train.csv"
CHUNKSIZE = 1000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print (f"There are {result} rows of data")
There are 192500 rows of data
CPU times: user 537 ms, sys: 54.5 ms, total: 592 ms
Wall time: 531 ms
import dask.dataframe as dd # pip install dask 필요
data = dd.read_csv("Train.csv",dtype={'MachineHoursCurrentMeter': 'float64'},assume_missing=True)
data.compute()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1139246.0 | 66000.0 | 999089.0 | 3157.0 | 121.0 | 3.0 | 2004.0 | 68.0 | Low | 11/16/2006 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
1 | 1139248.0 | 57000.0 | 117657.0 | 77.0 | 121.0 | 3.0 | 1996.0 | 4640.0 | Low | 3/26/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
2 | 1139249.0 | 10000.0 | 434808.0 | 7009.0 | 121.0 | 3.0 | 2001.0 | 2838.0 | High | 2/26/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1139251.0 | 38500.0 | 1026470.0 | 332.0 | 121.0 | 3.0 | 2001.0 | 3486.0 | High | 5/19/2011 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1139253.0 | 11000.0 | 1057373.0 | 17311.0 | 121.0 | 3.0 | 2007.0 | 722.0 | Medium | 7/23/2009 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
76475 | 1629579.0 | 20000.0 | 1335822.0 | 4604.0 | 132.0 | 1.0 | 1997.0 | NaN | NaN | 11/27/2001 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
76476 | 1629581.0 | 21000.0 | 1286789.0 | 4604.0 | 132.0 | 1.0 | 1997.0 | NaN | NaN | 7/24/2008 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
76477 | 1629583.0 | 30000.0 | 1287576.0 | 4604.0 | 132.0 | 1.0 | 1997.0 | NaN | NaN | 4/8/1999 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
76478 | 1629585.0 | 34000.0 | 1150345.0 | 4806.0 | 132.0 | 1.0 | 1997.0 | NaN | NaN | 9/24/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | None or Unspecified | PAT | None or Unspecified | NaN | NaN |
76479 | 1629594.0 | 30000.0 | 1467123.0 | 4604.0 | 132.0 | 1.0 | 1997.0 | NaN | NaN | 4/8/1999 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
192500 rows × 53 columns
References
https://towardsdatascience.com/️-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-️-e93b485086c7