Package org.openrefine.model
Interface Grid
-
- All Known Implementing Classes:
LocalGrid
,LoggedGrid
,TestingGrid
public interface Grid
Immutable object which represents the state of the project grid at a given point in a workflow.
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
Grid.ApproxCount
Utility class to represent the outcome of a partial count: the number of records/rows processed, and how many of these fulfilled the condition.static class
Grid.Metadata
Utility class to help with deserialization of the metadata without other attributes (such as number of rows)static class
Grid.PartialAggregation<T extends Serializable>
Utility class to represent the result of a partial aggregation
-
Field Summary
Fields Modifier and Type Field Description static String
GRID_PATH
static String
METADATA_PATH
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description <T extends Serializable>
TaggregateRecords(RecordAggregator<T> aggregator, T initialState)
Computes the result of a row aggregator on the grid.<T extends Serializable>
Grid.PartialAggregation<T>aggregateRecordsApprox(RecordAggregator<T> aggregator, T initialState, long maxRecords)
Computes the result of a row aggregator on the grid, reading about at most a fixed number of records.<T extends Serializable>
TaggregateRows(RowAggregator<T> aggregator, T initialState)
Computes the result of a row aggregator on the grid.<T extends Serializable>
Grid.PartialAggregation<T>aggregateRowsApprox(RowAggregator<T> aggregator, T initialState, long maxRows)
Computes the result of a row aggregator on the grid, reading about at most a fixed number of rows.boolean
cache()
Attempt to cache this grid in memory.ProgressingFuture<Boolean>
cacheAsync()
Attempt to cache this grid in memory, in an async way.List<Record>
collectRecords()
Returns all records in a list.List<IndexedRow>
collectRows()
Returns all rows in a list.default Grid
concatenate(List<Grid> otherGrids)
Concatenates this with other grids, in the given order.Grid
concatenate(Grid other)
Creates a new grid containing all rows in this grid, followed by all rows in the other grid supplied.long
countMatchingRecords(RecordFilter filter)
Return the number of records which are filtered by this filter.Grid.ApproxCount
countMatchingRecordsApprox(RecordFilter filter, long limit)
Return the number of records matching the given record filter, but by processing about at most a fixed number of records.long
countMatchingRows(RowFilter filter)
Count the number of rows which match a given filter.Grid.ApproxCount
countMatchingRowsApprox(RowFilter filter, long limit)
Return the number of rows matching the given row filter, but by processing about at most a fixed number of row.default Grid
dropRows(long rowsToDrop)
Drop the first rows.Grid
flatMapRows(RowFlatMapper mapper, ColumnModel newColumnModel)
Returns a new grid, where the rows have been mapped by the flat mapper.ColumnModel
getColumnModel()
Map<String,OverlayModel>
getOverlayModels()
Record
getRecord(long id)
Returns a record obtained by its id.List<Record>
getRecordsAfter(long start, int limit)
Returns a list of records, starting from a given index and defined by a maximum size.List<Record>
getRecordsAfter(RecordFilter filter, long start, int limit)
Among the filtered subset of records, returns a list of records, starting from a given index and defined by a maximum size.List<Record>
getRecordsBefore(long end, int limit)
Returns a list of consecutive records, ending at a given index (exclusive) and defined by a maximum size.List<Record>
getRecordsBefore(RecordFilter filter, long end, int limit)
Among the filtered subset of records, returns a list of records, ending at a given index (exclusive) and defined by a maximum size.Row
getRow(long id)
Returns a row by index.default List<IndexedRow>
getRows(List<Long> rowIndices)
Returns a list of rows corresponding to the row indices supplied.List<IndexedRow>
getRowsAfter(long start, int limit)
Returns a list of rows, starting from a given index and defined by a maximum size.List<IndexedRow>
getRowsAfter(RowFilter filter, long start, int limit)
Among the subset of filtered rows, return a list of rows, starting from a given index and defined by a maximum size.List<IndexedRow>
getRowsBefore(long end, int limit)
Returns a list of consecutive rows, just before the given row index (not included) and up to a maximum size.List<IndexedRow>
getRowsBefore(RowFilter filter, long end, int limit)
Among the subset of filtered rows, return a list of rows, just before the row with a given index (excluded) and defined by a maximum size.Runner
getRunner()
boolean
isCached()
Is this grid cached in memory?CloseableIterator<Record>
iterateRecords(RecordFilter filter)
Iterate over records matched by a filter.CloseableIterator<IndexedRow>
iterateRows(RowFilter filter)
Iterate over rows matched by a filter, in the order determined by a sorting configuration.<T> Grid
join(ChangeData<T> changeData, RecordChangeDataJoiner<T> recordJoiner, ColumnModel newColumnModel)
Joins pre-computed change data with the current grid data, record by record.<T> Grid
join(ChangeData<T> changeData, RowChangeDataFlatJoiner<T> rowJoiner, ColumnModel newColumnModel)
Joins pre-computed change data with the current grid data, with a joiner function that can return multiple rows for a given original row.<T> Grid
join(ChangeData<T> changeData, RowChangeDataJoiner<T> rowJoiner, ColumnModel newColumnModel)
Joins pre-computed change data with the current grid data, row by row.default Grid
limitRows(long rowLimit)
Only keep the first rows.<T> ChangeData<T>
mapRecords(RecordFilter filter, RecordChangeDataProducer<T> recordMapper, Optional<ChangeData<T>> incompleteChangeData)
Extract change data by applying a function to each filtered record.Grid
mapRecords(RecordMapper mapper, ColumnModel newColumnModel)
Returns a new grid, where the records have been mapped by the mapper<T> ChangeData<T>
mapRows(RowFilter filter, RowChangeDataProducer<T> rowMapper, Optional<ChangeData<T>> incompleteChangeData)
Extract change data by applying a function to each filtered row.Grid
mapRows(RowMapper mapper, ColumnModel newColumnModel)
Returns a new grid, where the rows have been mapped by the mapper.<S extends Serializable>
GridmapRows(RowScanMapper<S> mapper, ColumnModel newColumnModel)
Returns a new grid where the rows have been mapped by the stateful mapper.long
recordCount()
Grid
removeRecords(RecordFilter filter)
Removes all records selected by a filterGrid
removeRows(RowFilter filter)
Removes all rows selected by a filterGrid
reorderRecords(SortingConfig sortingConfig, boolean permanent)
Returns a new grid where records have been reordered according to the configuration supplied.Grid
reorderRows(SortingConfig sortingConfig, boolean permanent)
Returns a new grid where rows have been reordered according to the configuration supplied.long
rowCount()
void
saveToFile(File file)
Saves the grid to a specified directory, following OpenRefine's format for grid storage.ProgressingFuture<Void>
saveToFileAsync(File file)
Saves the grid to a specified directory, in an asynchronous fashion.void
uncache()
Free up any memory used to cache this grid in memory.Grid
withColumnModel(ColumnModel newColumnModel)
Grid
withOverlayModels(Map<String,OverlayModel> overlayModel)
Returns a new grid where the overlay models have changed.
-
-
-
Field Detail
-
METADATA_PATH
static final String METADATA_PATH
- See Also:
- Constant Field Values
-
GRID_PATH
static final String GRID_PATH
- See Also:
- Constant Field Values
-
-
Method Detail
-
getRunner
Runner getRunner()
- Returns:
- the runner which created this grid
-
getColumnModel
ColumnModel getColumnModel()
- Returns:
- the column metadata at this stage of the workflow
-
withColumnModel
Grid withColumnModel(ColumnModel newColumnModel)
- Parameters:
newColumnModel
- the column model to apply to the grid- Returns:
- a copy of this grid with a modified column model.
-
getRow
Row getRow(long id)
Returns a row by index. Repeatedly calling this method to obtain multiple rows might be inefficient compared to fetching them by batch, depending on the implementation.- Parameters:
id
- the row index. This refers to the current position of the row in the grid, which corresponds toIndexedRow.getIndex()
.- Returns:
- the row at the given index
- Throws:
IndexOutOfBoundsException
- if row id could not be found
-
getRowsAfter
List<IndexedRow> getRowsAfter(long start, int limit)
Returns a list of rows, starting from a given index and defined by a maximum size.- Parameters:
start
- the first row id to fetch (inclusive)limit
- the maximum number of rows to fetch- Returns:
- the list of rows with their ids (if any)
-
getRowsAfter
List<IndexedRow> getRowsAfter(RowFilter filter, long start, int limit)
Among the subset of filtered rows, return a list of rows, starting from a given index and defined by a maximum size.- Parameters:
filter
- the subset of rows to paginate through. This object and its dependencies are required to be serializable.start
- the first row id to fetch (inclusive)limit
- the maximum number of rows to fetch- Returns:
- the list of rows with their ids (if any)
- See Also:
getRowsBefore(long, int)
-
getRowsBefore
List<IndexedRow> getRowsBefore(long end, int limit)
Returns a list of consecutive rows, just before the given row index (not included) and up to a maximum size.- Parameters:
end
- the last row id to fetch (exclusive)limit
- the maximum number of rows to fetch- Returns:
- the list of rows with their ids (if any)
- See Also:
getRowsAfter(long, int)
-
getRowsBefore
List<IndexedRow> getRowsBefore(RowFilter filter, long end, int limit)
Among the subset of filtered rows, return a list of rows, just before the row with a given index (excluded) and defined by a maximum size.- Parameters:
filter
- the subset of rows to paginate through. This object and its dependencies are required to be serializable.end
- the last row id to fetch (exclusive)limit
- the maximum number of rows to fetch- Returns:
- the list of rows with their ids (if any)
-
getRows
default List<IndexedRow> getRows(List<Long> rowIndices)
Returns a list of rows corresponding to the row indices supplied. By default, this callsgetRow(long)
on all values, but implementations can override this to more efficient strategies if available.- Parameters:
rowIndices
- the indices of the rows to lookup- Returns:
- the list contains null values for the row indices which could not be found.
-
iterateRows
CloseableIterator<IndexedRow> iterateRows(RowFilter filter)
Iterate over rows matched by a filter, in the order determined by a sorting configuration. This might not require loading all rows in memory at once, but might be less efficient thancollectRows()
if all rows are to be stored in memory downstream.
-
countMatchingRows
long countMatchingRows(RowFilter filter)
Count the number of rows which match a given filter.- Parameters:
filter
- the row filter- Returns:
- the number of rows for which this filter returns true
-
countMatchingRowsApprox
Grid.ApproxCount countMatchingRowsApprox(RowFilter filter, long limit)
Return the number of rows matching the given row filter, but by processing about at most a fixed number of row.- Parameters:
filter
- counts the number of records on which it returns truelimit
- maximum number of records to process
-
collectRows
List<IndexedRow> collectRows()
Returns all rows in a list. This is inefficient for large datasets as it forces the entire grid to be loaded in memory.
-
getRecord
Record getRecord(long id)
Returns a record obtained by its id. Repeatedly calling this method to obtain multiple records might be inefficient depending on the implementation.- Parameters:
id
- the row id of the first row in the record. This refers to the current position of the record in the grid, which corresponds toRecord.getStartRowId()
.- Returns:
- the corresponding record
- Throws:
IllegalArgumentException
- if record id could not be found
-
getRecordsAfter
List<Record> getRecordsAfter(long start, int limit)
Returns a list of records, starting from a given index and defined by a maximum size.- Parameters:
start
- the first record id to fetch (inclusive)limit
- the maximum number of records to fetch- Returns:
- the list of records (if any)
- See Also:
getRecordsBefore(long, int)
-
getRecordsAfter
List<Record> getRecordsAfter(RecordFilter filter, long start, int limit)
Among the filtered subset of records, returns a list of records, starting from a given index and defined by a maximum size.- Parameters:
filter
- the filter which defines the subset of records to paginate through This object and its dependencies are required to be serializable.start
- the first record id to fetch (inclusive)limit
- the maximum number of records to fetch- Returns:
- the list of records (if any)
-
getRecordsBefore
List<Record> getRecordsBefore(long end, int limit)
Returns a list of consecutive records, ending at a given index (exclusive) and defined by a maximum size.- Parameters:
end
- the last record id to fetch (exclusive)limit
- the maximum number of records to fetch- Returns:
- the list of records (if any)
- See Also:
getRecordsAfter(long, int)
-
getRecordsBefore
List<Record> getRecordsBefore(RecordFilter filter, long end, int limit)
Among the filtered subset of records, returns a list of records, ending at a given index (exclusive) and defined by a maximum size.- Parameters:
filter
- the filter which defines the subset of records to paginate through This object and its dependencies are required to be serializable.end
- the last record id to fetch (exclusive)limit
- the maximum number of records to fetch- Returns:
- the list of records (if any)
-
iterateRecords
CloseableIterator<Record> iterateRecords(RecordFilter filter)
Iterate over records matched by a filter. This might not require loading all records in memory at once, but might be less efficient thancollectRecords()
if all records are to be stored in memory downstream.
-
countMatchingRecords
long countMatchingRecords(RecordFilter filter)
Return the number of records which are filtered by this filter.- Parameters:
filter
- the filter to evaluate- Returns:
- the number of records for which this filter evaluates to true
-
countMatchingRecordsApprox
Grid.ApproxCount countMatchingRecordsApprox(RecordFilter filter, long limit)
Return the number of records matching the given record filter, but by processing about at most a fixed number of records.- Parameters:
filter
- counts the number of records on which it returns truelimit
- maximum number of records to process
-
collectRecords
List<Record> collectRecords()
Returns all records in a list. This is inefficient for large datasets as it forces all records to be loaded in memory.
-
rowCount
long rowCount()
- Returns:
- the number of rows in the table
-
recordCount
long recordCount()
- Returns:
- the number of records in the table
-
getOverlayModels
Map<String,OverlayModel> getOverlayModels()
- Returns:
- the overlay models in this state
-
saveToFile
void saveToFile(File file) throws IOException
Saves the grid to a specified directory, following OpenRefine's format for grid storage.- Parameters:
file
- the directory where to save the grid- Throws:
IOException
-
saveToFileAsync
ProgressingFuture<Void> saveToFileAsync(File file)
Saves the grid to a specified directory, in an asynchronous fashion.- Parameters:
file
- the directory where to save the grid- Returns:
- a future which completes once the save is complete
-
aggregateRows
<T extends Serializable> T aggregateRows(RowAggregator<T> aggregator, T initialState)
Computes the result of a row aggregator on the grid.
-
aggregateRecords
<T extends Serializable> T aggregateRecords(RecordAggregator<T> aggregator, T initialState)
Computes the result of a row aggregator on the grid.
-
aggregateRowsApprox
<T extends Serializable> Grid.PartialAggregation<T> aggregateRowsApprox(RowAggregator<T> aggregator, T initialState, long maxRows)
Computes the result of a row aggregator on the grid, reading about at most a fixed number of rows. The rows read should be deterministic for a given implementation.
-
aggregateRecordsApprox
<T extends Serializable> Grid.PartialAggregation<T> aggregateRecordsApprox(RecordAggregator<T> aggregator, T initialState, long maxRecords)
Computes the result of a row aggregator on the grid, reading about at most a fixed number of records. The records read should be deterministic for a given implementation.
-
withOverlayModels
Grid withOverlayModels(Map<String,OverlayModel> overlayModel)
Returns a new grid where the overlay models have changed.- Parameters:
overlayModel
- the new overlay models to apply to the grid- Returns:
- the changed grid
-
mapRows
Grid mapRows(RowMapper mapper, ColumnModel newColumnModel)
Returns a new grid, where the rows have been mapped by the mapper.- Parameters:
mapper
- the function used to transform rows This object and its dependencies are required to be serializable.newColumnModel
- the column model of the resulting grid- Returns:
- the resulting grid
-
flatMapRows
Grid flatMapRows(RowFlatMapper mapper, ColumnModel newColumnModel)
Returns a new grid, where the rows have been mapped by the flat mapper.- Parameters:
mapper
- the function used to transform rows This object and its dependencies are required to be serializable.newColumnModel
- the column model of the resulting grid- Returns:
- the resulting grid
-
mapRows
<S extends Serializable> Grid mapRows(RowScanMapper<S> mapper, ColumnModel newColumnModel)
Returns a new grid where the rows have been mapped by the stateful mapper. This can be significantly less efficient than a stateless mapper, so only use this if you really need to rely on state.- Type Parameters:
S
- the type of state kept by the mapper- Parameters:
mapper
- the mapper to apply to the gridnewColumnModel
- the column model to apply to the new grid
-
mapRecords
Grid mapRecords(RecordMapper mapper, ColumnModel newColumnModel)
Returns a new grid, where the records have been mapped by the mapper- Parameters:
mapper
- the function used to transform records This object and its dependencies are required to be serializable.newColumnModel
- the column model of the resulting grid- Returns:
- the resulting grid
-
reorderRows
Grid reorderRows(SortingConfig sortingConfig, boolean permanent)
Returns a new grid where rows have been reordered according to the configuration supplied.- Parameters:
sortingConfig
- the criteria to sort rowspermanent
- if true, forget the original row ids. If false, store them in the correspondingIndexedRow.getOriginalIndex()
.- Returns:
- the resulting grid
-
reorderRecords
Grid reorderRecords(SortingConfig sortingConfig, boolean permanent)
Returns a new grid where records have been reordered according to the configuration supplied.- Parameters:
sortingConfig
- the criteria to sort recordspermanent
- if true, forget the original record ids. If false, store them in the correspondingRecord.getOriginalStartRowId()
.- Returns:
- the resulting grid
-
removeRows
Grid removeRows(RowFilter filter)
Removes all rows selected by a filter- Parameters:
filter
- which returns true when we should delete the row- Returns:
- the grid where the matching rows have been removed
-
removeRecords
Grid removeRecords(RecordFilter filter)
Removes all records selected by a filter- Parameters:
filter
- which returns true when we should delete the record- Returns:
- the grid where the matching record have been removed
-
limitRows
default Grid limitRows(long rowLimit)
Only keep the first rows.By default, this uses
removeRows(RowFilter)
to remove the last rows, but implementations can override this for efficiency.- Parameters:
rowLimit
- the number of rows to keep- Returns:
- the limited grid
-
dropRows
default Grid dropRows(long rowsToDrop)
Drop the first rows.By default, this uses
removeRows(RowFilter)
to remove the first rows, but implementations can override this for efficiency.- Parameters:
rowsToDrop
- the number of rows to drop- Returns:
- the grid consisting of the last rows
-
mapRows
<T> ChangeData<T> mapRows(RowFilter filter, RowChangeDataProducer<T> rowMapper, Optional<ChangeData<T>> incompleteChangeData)
Extract change data by applying a function to each filtered row. The calls to the change data producer are batched if requested by the producer.- Type Parameters:
T
- the type of change data that is serialized to disk for each row- Parameters:
filter
- a filter to select which rows to maprowMapper
- produces the change data for each rowincompleteChangeData
- a previously, incompletely fetched version of the same change data, from which the computation should be resumed, to avoid recomputing the items already in the incomplete change data- Throws:
IllegalStateException
- if the row mapper returns a batch of results with a different size than the batch of rows it was called on
-
mapRecords
<T> ChangeData<T> mapRecords(RecordFilter filter, RecordChangeDataProducer<T> recordMapper, Optional<ChangeData<T>> incompleteChangeData)
Extract change data by applying a function to each filtered record. The calls to the change data producer are batched if requested by the producer.- Type Parameters:
T
- the type of change data that is serialized to disk for each row- Parameters:
filter
- a filter to select which rows to maprecordMapper
- produces the change data for each recordincompleteChangeData
- a previously, incompletely fetched version of the same change data, from which the computation should be resumed, to avoid recomputing the items already in the incomplete change data- Throws:
IllegalStateException
- if the record mapper returns a batch of results with a different size than the batch of records it was called on
-
join
<T> Grid join(ChangeData<T> changeData, RowChangeDataJoiner<T> rowJoiner, ColumnModel newColumnModel)
Joins pre-computed change data with the current grid data, row by row.- Type Parameters:
T
- the type of change data that was serialized to disk for each row- Parameters:
changeData
- the serialized change datarowJoiner
- produces the new row by joining the old row with change datanewColumnModel
- the column model to apply to the new grid
-
join
<T> Grid join(ChangeData<T> changeData, RowChangeDataFlatJoiner<T> rowJoiner, ColumnModel newColumnModel)
Joins pre-computed change data with the current grid data, with a joiner function that can return multiple rows for a given original row.- Type Parameters:
T
- the type of change data that was serialized to disk for each row- Parameters:
changeData
- the serialized change datarowJoiner
- produces the new row by joining the old row with change datanewColumnModel
- the column model to apply to the new grid
-
join
<T> Grid join(ChangeData<T> changeData, RecordChangeDataJoiner<T> recordJoiner, ColumnModel newColumnModel)
Joins pre-computed change data with the current grid data, record by record.- Type Parameters:
T
- the type of change data that was serialized to disk for each record- Parameters:
changeData
- the serialized change datarecordJoiner
- produces the new list of rows by joining the old record with change datanewColumnModel
- the column model to apply to the new grid
-
concatenate
Grid concatenate(Grid other)
Creates a new grid containing all rows in this grid, followed by all rows in the other grid supplied. The overlay models of this grid have priority over the others.The two grids are required to have the same number of columns.
- Parameters:
other
- the grid to concatenate to this one- Returns:
- a new grid, union of the two
-
concatenate
default Grid concatenate(List<Grid> otherGrids)
Concatenates this with other grids, in the given order. This is a variant ofconcatenate(Grid)
which implementations can override to make more efficient than making repeated calls toconcatenate(Grid)
(which is the default implementation).- Parameters:
otherGrids
- the list of other grids to concatenate with this one.- Returns:
- a new grid, union of all those grids
-
isCached
boolean isCached()
Is this grid cached in memory? If not, its contents are stored on disk.
-
uncache
void uncache()
Free up any memory used to cache this grid in memory.
-
cache
boolean cache()
Attempt to cache this grid in memory. If the grid is too big, this can fail.- Returns:
- whether the grid was actually cached in memory.
-
cacheAsync
ProgressingFuture<Boolean> cacheAsync()
Attempt to cache this grid in memory, in an async way.- Returns:
- a future to keep track of the status of the caching process. The future returns whether the caching succeeded.
-
-