AI 개발

[Pandas] 메모리 줄이기 read_csv, chunk, multiprocessing

pandas는 디스크말고 RAM에 데이터를 적재하는데, 이때 Contiguous Memory Allocation(연속 메모리 할당) 방식을 사용한다.

디스크 적재시(Reading from SSDs) : ~16000 nanoseconds
램 적재시(Reading from RAM) : ~100 nanoseconds

**Contiguous Memory Allocation(consecutive blocks are assigned) : logical address가 연속적이면 physical address도 연속적으로 배치된다.

**NonContiguous Memory Allocation(separate blocks at different locations)

멀티프로세싱과 GPU를 사용하기 전에, pd.read_csv()를 효과적으로 하는 방법 먼저 고려해보자.(근데 큰 효과는 없는듯..)

사용한 데이터 : https://www.kaggle.com/c/bluebook-for-bulldozers/overview

Blue Book for Bulldozers

Predict the auction sale price for a piece of heavy equipment to create a "blue book" for bulldozers.

www.kaggle.com

Summary

1. read_csv() 인자에 usecols 사용하기

2. 수치형(Numerical) 데이터에 올바른 dtype 사용하기

3. missing value, NANs를 로딩할 때 치환하는 converters 사용하기

4. 범주형(Categorical) 데이터에 올바른 dtype 사용하기

5. Sparse Series 사용하기

6. nrows, skip rows 사용하기

7. Chunks 사용하기

8. Multiprocessing 사용하기

Jupyter Code

1. read_csv의 usecols 옵션¶

In [1]:

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [26]:

%%time
df = pd.read_csv('./Train.csv')

CPU times: user 558 ms, sys: 20.5 ms, total: 578 ms
Wall time: 577 ms

In [27]:

df.head()

Out[27]:

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
0	1139246	66000	999089	3157	121	3	2004	68.0	Low	11/16/2006 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
1	1139248	57000	117657	77	121	3	1996	4640.0	Low	3/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
2	1139249	10000	434808	7009	121	3	2001	2838.0	High	2/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1139251	38500	1026470	332	121	3	2001	3486.0	High	5/19/2011 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1139253	11000	1057373	17311	121	3	2007	722.0	Medium	7/23/2009 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 53 columns

In [3]:

df.info(verbose=False, memory_usage='deep')
# memory_usage
# We get bytes used by each variable, but this time it gives the memory use of object data types.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192500 entries, 0 to 192499
Columns: 53 entries, SalesID to Steering_Controls
dtypes: float64(1), int64(7), object(45)
memory usage: 390.6 MB

In [4]:

req_cols = ['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System']

In [5]:

%%time
df = pd.read_csv('Train.csv', usecols=req_cols)

CPU times: user 490 ms, sys: 47.6 ms, total: 538 ms
Wall time: 537 ms

2. use correct dtypes for numerical data¶

int8 can store integers from -128 to 127
int16 can store integers from -32768 to 32767
int64 can store integers from -9223372036854775808 to 9223372036854775807
int64가 디폴트

In [6]:

df['YearMade'].memory_usage(index=False, deep=True)

Out[6]:

In [7]:

df['YearMade'].min()

Out[7]:

In [8]:

df['YearMade'].max()

Out[8]:

In [9]:

df = pd.read_csv('Train.csv', dtype={"YearMade": "int16"})
df['YearMade'].memory_usage(index=False, deep=True)

Out[9]:

In [10]:

(1540000 - 385000)/ 1540000*100

Out[10]:

75.0

3. use converters¶

missing value, NANs를 로딩할 때 다른 값으로 치환.

In [12]:

%%time
df = pd.read_csv('Train.csv', dtype={"auctioneerID": "int8"})

CPU times: user 569 ms, sys: 35.4 ms, total: 604 ms
Wall time: 603 ms

In [11]:

%%time
def converter(val):
    if val == np.nan:
        return 0
    return val

df = pd.read_csv('Train.csv',
                converters={"auctioneerID": converter},
                dtype={"auctioneerID": "int8"})

CPU times: user 634 ms, sys: 43.2 ms, total: 677 ms
Wall time: 676 ms

4. use correct dtypes for categorical columns¶

파싱되지 않는 값을 categorical values로 바꾸면, 메모리가 준다.

In [13]:

df['Thumb'].value_counts()
# Thumb 컬럼 디폴트로 string으로 파싱되었지만,
# fixed number과 파싱되지 않는 값들이 존재

Out[13]:

None or Unspecified    33984
Manual                  3897
Hydraulic               1833
Name: Thumb, dtype: int64

In [14]:

df['Thumb'].memory_usage(index=False, deep=True)

Out[14]:

In [15]:

df = pd.read_csv("Train.csv", dtype={"Thumb": "category"})
df['Thumb'].memory_usage(index=False, deep=True)

Out[15]:

In [16]:

(7838425 - 192813)/7838425*100

Out[16]:

97.54015634518413

5. Sparse Series 사용¶

컬럼의 값에 empty value나 missing value, NANs가 많으면, Sparse Series로 바꾸면 메모리가 준다.

In [19]:

df = pd.read_csv('Train.csv')
series = df['Scarifier']
series.memory_usage(index=False, deep=True)

Out[19]:

In [20]:

len(series)

Out[20]:

In [22]:

len(series.dropna())

Out[22]:

In [23]:

sparse_series = series.astype("Sparse[str]")
len(sparse_series)

Out[23]:

In [24]:

sparse_series.memory_usage(index=False, deep=True)

Out[24]:

In [25]:

(6725532-5372792)/6725532*100

Out[25]:

20.11350180179055

6. nrows, skip rows¶

모든 데이터를 램에 로드하더라도 테스트는 작은 데이터로 하는 것이 좋다.

In [ ]:

df = pd.read_csv('Train.csv', nrows=100) # 100개로우만
df = pd.read_csv('Train.csv', skiprows=[0,2,5]) # 0,2,5 행 로우는 빼고

In [29]:

# 효과적으로 nrows를 사용하는 방법은, 
# 모든 컬럼마다 적절한 dtypes을 체크하고 정의하는 것이다.
sample = pd.read_csv("Train.csv", nrows=100) # Load Sample data
dtypes = sample.dtypes # Get the dtypes
cols = sample.columns # Get the columns
dtype_dictionary = {} 
for c in cols:
    """
    Write your own dtypes using 
    # rule 2
    # rule 3 
    """
    if str(dtypes[c]) == 'int64':
        dtype_dictionary[c] = 'float32' # Handle NANs in int columns
    else:
        dtype_dictionary[c] = str(dtypes[c])
# Load Data with increased speed and reduced memory.
df = pd.read_csv("Train.csv", dtype=dtype_dictionary, 
                 keep_default_na=False, 
                 error_bad_lines=False,
                 na_values=['na',''])

7. Chunks¶

사실상 전체 데이터를 읽어들이는 것보다 느리고, 하나로 합칠 때는 concat이 필요하지만,
내 메모리보다 더 큰(10GB를 넘는) 데이터를 로드하게 해준다.

In [30]:

df = pd.read_csv('Train.csv', chunksize=1000)
total_len = 0
for chunk in df:
    total_len += len(chunk)
total_len

Out[30]:

In [32]:

# concatnate each chunk one by one
tp = pd.read_csv('Train.csv', iterator=True, chunksize=1000)
df = pd.concat(tp, ignore_index=True)
len(df)

Out[32]:

8. Multiprocessing¶

pandas는 multiprocessing을 위한 njobs variable을 가지고 있지 않다.
python utillze의 multiprocessing library를 사용해서 chunk size 만큼 핸들링

In [33]:

import multiprocessing as mp

In [34]:

%%time
df = pd.read_csv('Train.csv', chunksize=1000)
total_length = 0
for chunk in df:
    total_length += len(chunk)
total_length

CPU times: user 1.22 s, sys: 12.5 ms, total: 1.24 s
Wall time: 1.23 s

Out[34]:

In [35]:

%%time
LARGE_FILE = "Train.csv"
CHUNKSIZE = 1000 # processing 100,000 rows at a time

def process_frame(df):
        # process data frame
        return len(df)

if __name__ == '__main__':
        reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
        pool = mp.Pool(4) # use 4 processes

        funclist = []
        for df in reader:
                # process each data frame
                f = pool.apply_async(process_frame,[df])
                funclist.append(f)
                
        result = 0
        for f in funclist:
                result += f.get(timeout=10) # timeout in 10 seconds

        print (f"There are {result} rows of data")

There are 192500 rows of data
CPU times: user 537 ms, sys: 54.5 ms, total: 592 ms
Wall time: 531 ms

8. Dask Instead of Pandas¶

Dask는 pandas보다 preprocessing하기에 좋진 않지만, 병렬 계산과 데이터 로드가 판다스보다 빠르다.

In [37]:

import dask.dataframe as dd # pip install dask 필요

data = dd.read_csv("Train.csv",dtype={'MachineHoursCurrentMeter': 'float64'},assume_missing=True)
data.compute()

Out[37]:

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
0	1139246.0	66000.0	999089.0	3157.0	121.0	3.0	2004.0	68.0	Low	11/16/2006 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
1	1139248.0	57000.0	117657.0	77.0	121.0	3.0	1996.0	4640.0	Low	3/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
2	1139249.0	10000.0	434808.0	7009.0	121.0	3.0	2001.0	2838.0	High	2/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1139251.0	38500.0	1026470.0	332.0	121.0	3.0	2001.0	3486.0	High	5/19/2011 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1139253.0	11000.0	1057373.0	17311.0	121.0	3.0	2007.0	722.0	Medium	7/23/2009 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
76475	1629579.0	20000.0	1335822.0	4604.0	132.0	1.0	1997.0	NaN	NaN	11/27/2001 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
76476	1629581.0	21000.0	1286789.0	4604.0	132.0	1.0	1997.0	NaN	NaN	7/24/2008 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
76477	1629583.0	30000.0	1287576.0	4604.0	132.0	1.0	1997.0	NaN	NaN	4/8/1999 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
76478	1629585.0	34000.0	1150345.0	4806.0	132.0	1.0	1997.0	NaN	NaN	9/24/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	None or Unspecified	PAT	None or Unspecified	NaN	NaN
76479	1629594.0	30000.0	1467123.0	4604.0	132.0	1.0	1997.0	NaN	NaN	4/8/1999 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

192500 rows × 53 columns

References

https://towardsdatascience.com/️-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-️-e93b485086c7

저작자표시 비영리 변경금지

'AI 개발' 카테고리의 다른 글

[Python] 모듈 상대경로(ImportError: attempted relative import with no known parent package) (0)	2021.10.08
[Python] collections & itertools 의 유용한 함수들 (0)	2021.09.02
Pandas vs PySpark (0)	2021.06.22
같은 글자의 유니코드가 다를때, 정규식이 먹지 않을때 (0)	2021.05.17
[Pandas] pandas 꿀팁(?) (0)	2021.05.17

Contents

새소식

인기 검색어