-
[Pandas] 메모리 줄이기 read_csv, chunk, multiprocessing💫 Computer Science/Python & AI Framework 2021. 7. 1. 00:45
pandas는 디스크말고 RAM에 데이터를 적재하는데, 이때 Contiguous Memory Allocation(연속 메모리 할당) 방식을 사용한다.
- 디스크 적재시(Reading from SSDs) : ~16000 nanoseconds
- 램 적재시(Reading from RAM) : ~100 nanoseconds
**Contiguous Memory Allocation(consecutive blocks are assigned) : logical address가 연속적이면 physical address도 연속적으로 배치된다.
**NonContiguous Memory Allocation(separate blocks at different locations)
멀티프로세싱과 GPU를 사용하기 전에, pd.read_csv()를 효과적으로 하는 방법 먼저 고려해보자.(근데 큰 효과는 없는듯..)
사용한 데이터 : https://www.kaggle.com/c/bluebook-for-bulldozers/overview
Summary
1. read_csv() 인자에 usecols 사용하기
2. 수치형(Numerical) 데이터에 올바른 dtype 사용하기
3. missing value, NANs를 로딩할 때 치환하는 converters 사용하기
4. 범주형(Categorical) 데이터에 올바른 dtype 사용하기
5. Sparse Series 사용하기
6. nrows, skip rows 사용하기
7. Chunks 사용하기
8. Multiprocessing 사용하기
In [1]:import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')
In [26]:%%time df = pd.read_csv('./Train.csv')
CPU times: user 558 ms, sys: 20.5 ms, total: 578 ms Wall time: 577 ms
In [27]:df.head()
Out[27]:SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate ... Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls 0 1139246 66000 999089 3157 121 3 2004 68.0 Low 11/16/2006 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional 1 1139248 57000 117657 77 121 3 1996 4640.0 Low 3/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional 2 1139249 10000 434808 7009 121 3 2001 2838.0 High 2/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 1139251 38500 1026470 332 121 3 2001 3486.0 High 5/19/2011 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 1139253 11000 1057373 17311 121 3 2007 722.0 Medium 7/23/2009 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 rows × 53 columns
In [3]:df.info(verbose=False, memory_usage='deep') # memory_usage # We get bytes used by each variable, but this time it gives the memory use of object data types.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 192500 entries, 0 to 192499 Columns: 53 entries, SalesID to Steering_Controls dtypes: float64(1), int64(7), object(45) memory usage: 390.6 MB
In [4]:req_cols = ['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System']
In [5]:%%time df = pd.read_csv('Train.csv', usecols=req_cols)
CPU times: user 490 ms, sys: 47.6 ms, total: 538 ms Wall time: 537 ms
2. use correct dtypes for numerical data¶
- int8 can store integers from -128 to 127
- int16 can store integers from -32768 to 32767
- int64 can store integers from -9223372036854775808 to 9223372036854775807
- int64가 디폴트
In [6]:df['YearMade'].memory_usage(index=False, deep=True)
Out[6]:1540000
In [7]:df['YearMade'].min()
Out[7]:1000
In [8]:df['YearMade'].max()
Out[8]:2010
In [9]:df = pd.read_csv('Train.csv', dtype={"YearMade": "int16"}) df['YearMade'].memory_usage(index=False, deep=True)
Out[9]:385000
In [10]:(1540000 - 385000)/ 1540000*100
Out[10]:75.0
In [12]:%%time df = pd.read_csv('Train.csv', dtype={"auctioneerID": "int8"})
CPU times: user 569 ms, sys: 35.4 ms, total: 604 ms Wall time: 603 ms
In [11]:%%time def converter(val): if val == np.nan: return 0 return val df = pd.read_csv('Train.csv', converters={"auctioneerID": converter}, dtype={"auctioneerID": "int8"})
CPU times: user 634 ms, sys: 43.2 ms, total: 677 ms Wall time: 676 ms
In [13]:df['Thumb'].value_counts() # Thumb 컬럼 디폴트로 string으로 파싱되었지만, # fixed number과 파싱되지 않는 값들이 존재
Out[13]:None or Unspecified 33984 Manual 3897 Hydraulic 1833 Name: Thumb, dtype: int64
In [14]:df['Thumb'].memory_usage(index=False, deep=True)
Out[14]:7838425
In [15]:df = pd.read_csv("Train.csv", dtype={"Thumb": "category"}) df['Thumb'].memory_usage(index=False, deep=True)
Out[15]:192813
In [16]:(7838425 - 192813)/7838425*100
Out[16]:97.54015634518413
In [19]:df = pd.read_csv('Train.csv') series = df['Scarifier'] series.memory_usage(index=False, deep=True)
Out[19]:6725532
In [20]:len(series)
Out[20]:192500
In [22]:len(series.dropna())
Out[22]:15605
In [23]:sparse_series = series.astype("Sparse[str]") len(sparse_series)
Out[23]:192500
In [24]:sparse_series.memory_usage(index=False, deep=True)
Out[24]:5372792
In [25]:(6725532-5372792)/6725532*100
Out[25]:20.11350180179055
6. nrows, skip rows¶
- 모든 데이터를 램에 로드하더라도 테스트는 작은 데이터로 하는 것이 좋다.
In [ ]:df = pd.read_csv('Train.csv', nrows=100) # 100개로우만 df = pd.read_csv('Train.csv', skiprows=[0,2,5]) # 0,2,5 행 로우는 빼고
In [29]:# 효과적으로 nrows를 사용하는 방법은, # 모든 컬럼마다 적절한 dtypes을 체크하고 정의하는 것이다. sample = pd.read_csv("Train.csv", nrows=100) # Load Sample data dtypes = sample.dtypes # Get the dtypes cols = sample.columns # Get the columns dtype_dictionary = {} for c in cols: """ Write your own dtypes using # rule 2 # rule 3 """ if str(dtypes[c]) == 'int64': dtype_dictionary[c] = 'float32' # Handle NANs in int columns else: dtype_dictionary[c] = str(dtypes[c]) # Load Data with increased speed and reduced memory. df = pd.read_csv("Train.csv", dtype=dtype_dictionary, keep_default_na=False, error_bad_lines=False, na_values=['na',''])
In [30]:df = pd.read_csv('Train.csv', chunksize=1000) total_len = 0 for chunk in df: total_len += len(chunk) total_len
Out[30]:192500
In [32]:# concatnate each chunk one by one tp = pd.read_csv('Train.csv', iterator=True, chunksize=1000) df = pd.concat(tp, ignore_index=True) len(df)
Out[32]:192500
8. Multiprocessing¶
- pandas는 multiprocessing을 위한 njobs variable을 가지고 있지 않다.
python utillze의 multiprocessing library를 사용해서 chunk size 만큼 핸들링
In [33]:import multiprocessing as mp
In [34]:%%time df = pd.read_csv('Train.csv', chunksize=1000) total_length = 0 for chunk in df: total_length += len(chunk) total_length
CPU times: user 1.22 s, sys: 12.5 ms, total: 1.24 s Wall time: 1.23 s
Out[34]:192500
In [35]:%%time LARGE_FILE = "Train.csv" CHUNKSIZE = 1000 # processing 100,000 rows at a time def process_frame(df): # process data frame return len(df) if __name__ == '__main__': reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE) pool = mp.Pool(4) # use 4 processes funclist = [] for df in reader: # process each data frame f = pool.apply_async(process_frame,[df]) funclist.append(f) result = 0 for f in funclist: result += f.get(timeout=10) # timeout in 10 seconds print (f"There are {result} rows of data")
There are 192500 rows of data CPU times: user 537 ms, sys: 54.5 ms, total: 592 ms Wall time: 531 ms
In [37]:import dask.dataframe as dd # pip install dask 필요 data = dd.read_csv("Train.csv",dtype={'MachineHoursCurrentMeter': 'float64'},assume_missing=True) data.compute()
Out[37]:SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate ... Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls 0 1139246.0 66000.0 999089.0 3157.0 121.0 3.0 2004.0 68.0 Low 11/16/2006 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional 1 1139248.0 57000.0 117657.0 77.0 121.0 3.0 1996.0 4640.0 Low 3/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional 2 1139249.0 10000.0 434808.0 7009.0 121.0 3.0 2001.0 2838.0 High 2/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 1139251.0 38500.0 1026470.0 332.0 121.0 3.0 2001.0 3486.0 High 5/19/2011 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 1139253.0 11000.0 1057373.0 17311.0 121.0 3.0 2007.0 722.0 Medium 7/23/2009 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 76475 1629579.0 20000.0 1335822.0 4604.0 132.0 1.0 1997.0 NaN NaN 11/27/2001 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76476 1629581.0 21000.0 1286789.0 4604.0 132.0 1.0 1997.0 NaN NaN 7/24/2008 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76477 1629583.0 30000.0 1287576.0 4604.0 132.0 1.0 1997.0 NaN NaN 4/8/1999 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76478 1629585.0 34000.0 1150345.0 4806.0 132.0 1.0 1997.0 NaN NaN 9/24/2004 0:00 ... NaN NaN NaN NaN NaN None or Unspecified PAT None or Unspecified NaN NaN 76479 1629594.0 30000.0 1467123.0 4604.0 132.0 1.0 1997.0 NaN NaN 4/8/1999 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 192500 rows × 53 columns
References
https://towardsdatascience.com/️-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-️-e93b485086c7
'💫 Computer Science > Python & AI Framework' 카테고리의 다른 글
[Python] 모듈 상대경로(ImportError: attempted relative import with no known parent package) (0) 2021.10.08 [Python] collections & itertools 의 유용한 함수들 (0) 2021.09.02 Pandas vs PySpark (0) 2021.06.22 같은 글자의 유니코드가 다를때, 정규식이 먹지 않을때 (0) 2021.05.17 [Pandas] pandas 꿀팁(?) (0) 2021.05.17