modin.pandas.read_csv¶
- modin.pandas.read_csv(filepath_or_buffer: FilePath, *, sep: str | NoDefault | None = _NoDefault.no_default, delimiter: str | None = None, header: int | Sequence[int] | Literal['infer'] | None = 'infer', names: Sequence[Hashable] | NoDefault | None = _NoDefault.no_default, index_col: IndexLabel | Literal[False] | None = None, usecols: list[Hashable] | Callable | None = None, dtype: DtypeArg | None = None, engine: CSVEngine | None = None, converters: dict[Hashable, Callable] | None = None, true_values: list[Any] | None = None, false_values: list[Any] | None = None, skipinitialspace: bool | None = False, skiprows: int | None = None, skipfooter: int | None = 0, nrows: int | None = None, na_values: Sequence[Hashable] | None = None, keep_default_na: bool | None = True, na_filter: bool | None = True, verbose: bool | None = _NoDefault.no_default, skip_blank_lines: bool | None = True, parse_dates: None | bool | Sequence[int] | Sequence[Sequence[int]] | dict[str, Sequence[int]] = None, infer_datetime_format: bool | None = _NoDefault.no_default, keep_date_col: bool | None = _NoDefault.no_default, date_parser: Callable | None = _NoDefault.no_default, date_format: str | dict | None = None, dayfirst: bool | None = False, cache_dates: bool | None = True, iterator: bool = False, chunksize: int | None = None, compression: Literal['infer', 'gzip', 'bz2', 'brotli', 'zstd', 'deflate', 'raw_deflate', 'none'] = 'infer', thousands: str | None = None, decimal: str | None = '.', lineterminator: str | None = None, quotechar: str = '"', quoting: int | None = 0, doublequote: bool = True, escapechar: str | None = None, comment: str | None = None, encoding: str | None = None, encoding_errors: str | None = 'strict', dialect: str | csv.Dialect | None = None, on_bad_lines: str = 'error', delim_whitespace: bool | None = _NoDefault.no_default, low_memory: bool | None = True, memory_map: bool | None = False, float_precision: Literal['high', 'legacy'] | None = None, storage_options: StorageOptions = None, dtype_backend: DtypeBackend = _NoDefault.no_default) pd.DataFrame [source] (https://github.com/snowflakedb/snowpark-python/blob/v1.26.0/snowpark-python/src/snowflake/snowpark/modin/plugin/extensions/io_overrides.py#L173-L241)¶
Read csv file(s) into a Snowpark pandas DataFrame. This API can read files stored locally or on a Snowflake stage.
Snowpark pandas stages files (unless they’re already staged) and then reads them using Snowflake’s CSV reader.
- Parameters:
filepath_or_buffer (str) – Local file location or staged file location to read from. Staged file locations starts with a ‘@’ symbol. To read a local file location with a name starting with @, escape it using a @. For more info on staged files, read here.
sep (str, default ',') – Delimiter to use to separate fields in an input file. Delimiters can be multiple characters in Snowpark pandas.
delimiter (str, default ',') – Alias for sep.
header (int, list of int, None, default 'infer') – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names. If a non-zero integer or a list of integers is passed, aNotImplementedError
will be raised.names (array-like, optional) – List of column names to use. If the file contains a header row, then you should explicitly pass
header=0
to override the column names. Duplicates in this list are not allowed.index_col (int, str, sequence of int / str, or False, optional, default
None
) – Column(s) to use as the row labels of theDataFrame
, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used. Note:index_col=False
can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.usecols (list-like or callable, optional) –
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If
names
are given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be[0, 1, 2]
or['foo', 'bar', 'baz']
. Element order is ignored, sousecols=[0, 1]
is the same as[1, 0]
. To instantiate a DataFrame fromdata
with element order preserved usepd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]
for columns in['foo', 'bar']
order orpd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]
for['bar', 'foo']
order.If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be
lambda x: x.upper() in ['AAA', 'BBB', 'DDD']
.dtype (Type name or dict of column -> type, optional) – Data type for data or columns. E.g. {{‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
engine ({{'c', 'python', 'pyarrow', 'snowflake'}}, optional) – Changes the parser for reading CSVs. ‘snowflake’ will use the parser from Snowflake itself, which matches the behavior of the COPY INTO command.
converters (dict, optional) – This parameter is only supported on local files.
true_values (list, optional) – This parameter is only supported on local files.
false_values (list, optional) – This parameter is only supported on local files.
skiprows (list-like, int or callable, optional) – Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
skipfooter (int, default 0) – This parameter is only supported on local files.
nrows (int, optional) – This parameter is only supported on local files.
na_values (scalar, str, list-like, or dict, optional) – Additional strings to recognize as NA/NaN.
keep_default_na (bool, default True) – This parameter is only supported on local files.
na_filter (bool, default True) – This parameter is only supported on local files.
verbose (bool, default False) – This parameter is only supported on local files.
skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.
parse_dates (bool or list of int or names or list of lists or dict, default False) – This parameter is only supported on local files.
infer_datetime_format (bool, default False) – This parameter is only supported on local files.
keep_date_col (bool, default False) – This parameter is only supported on local files.
date_parser (function, optional) – This parameter is only supported on local files.
date_format (str or dict of column -> format, optional) – This parameter is only supported on local files.
dayfirst (bool, default False) – This parameter is only supported on local files.
cache_dates (bool, default True) – This parameter is not supported and will be ignored.
iterator (bool, default False) – This parameter is not supported and will raise an error.
chunksize (int, optional) – This parameter is not supported and will be ignored.
compression (str, default 'infer') – String (constant) that specifies the current compression algorithm for the data files to be loaded. Snowflake uses this option to detect how already-compressed data files were compressed so that the compressed data in the files can be extracted for loading. List of Snowflake standard compressions .
thousands (str, optional) – This parameter is only supported on local files.
decimal (str, default '.') – This parameter is only supported on local files.
lineterminator (str (length 1), optional) – This parameter is only supported on local files.
quotechar (str (length 1), optional) – The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting (int or csv.QUOTE_* instance, default 0) – This parameter is only supported on local files.
doublequote (bool, default
True
) – This parameter is only supported on local files.escapechar (str (length 1), optional) – This parameter is only supported on local files.
comment (str, optional) – This parameter is only supported on local files.
encoding (str, default 'utf-8') – Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Snowflake standard encodings .
encoding_errors (str, optional, default "strict") – This parameter is only supported on local files.
dialect (str or csv.Dialect, optional) – This parameter is only supported on local files.
on_bad_lines ({{'error', 'warn', 'skip'}} or callable, default 'error') – This parameter is only supported on local files.
delim_whitespace (bool, default False) – This parameter is only supported on local files, not files which have been uploaded to a snowflake stage.
low_memory (bool, default True) – This parameter is not supported and will be ignored.
memory_map (bool, default False) – This parameter is not supported and will be ignored.
float_precision (str, optional) – This parameter is not supported and will be ignored.
dtype_backend ({'numpy_nullable', 'pyarrow'}, default 'numpy_nullable') – This parameter is not supported and will be ignored.
- Return type:
Snowpark pandas DataFrame
- Raises:
NotImplementedError if a parameter is not supported. –
Notes
Both local files and files staged on Snowflake can be passed into
filepath_or_buffer
. A single file or a folder that matches a set of files can be passed intofilepath_or_buffer
. Local files will be processed locally by default using the stand pandas parser before they are uploaded to a staging location as parquet files. This behavior can be overriden by explicitly using the snowflake engine withengine=snowflake
If parsing the file using Snowflake, certain parameters may not be supported and the order of rows in the dataframe may be different than the order of records in an input file. When reading multiple files, there is no deterministic order in which the files are read.
Examples
Read local csv file.
>>> import csv >>> import tempfile >>> temp_dir = tempfile.TemporaryDirectory() >>> temp_dir_name = temp_dir.name >>> with open(f'{temp_dir_name}/data.csv', 'w') as f: ... writer = csv.writer(f) ... writer.writerows([['c1','c2','c3'], [1,2,3], [4,5,6], [7,8,9]]) >>> import modin.pandas as pd >>> import snowflake.snowpark.modin.plugin >>> df = pd.read_csv(f'{temp_dir_name}/data.csv') >>> df c1 c2 c3 0 1 2 3 1 4 5 6 2 7 8 9
Read staged csv file.
>>> _ = session.sql("create or replace temp stage mytempstage").collect() >>> _ = session.file.put(f'{temp_dir_name}/data.csv', '@mytempstage/myprefix') >>> df2 = pd.read_csv('@mytempstage/myprefix/data.csv') >>> df2 c1 c2 c3 0 1 2 3 1 4 5 6 2 7 8 9
Read csv files from a local folder.
>>> with open(f'{temp_dir_name}/data2.csv', 'w') as f: ... writer = csv.writer(f) ... writer.writerows([['c1','c2','c3'], [1,2,3], [4,5,6], [7,8,9]]) >>> df3 = pd.read_csv(f'{temp_dir_name}/data2.csv') >>> df3 c1 c2 c3 0 1 2 3 1 4 5 6 2 7 8 9
Read csv files from a staged location.
>>> _ = session.file.put(f'{temp_dir_name}/data2.csv', '@mytempstage/myprefix') >>> df4 = pd.read_csv('@mytempstage/myprefix') >>> df4 c1 c2 c3 0 1 2 3 1 4 5 6 2 7 8 9 3 1 2 3 4 4 5 6 5 7 8 9
>>> temp_dir.cleanup()