Package org.openrefine.runners.local.pll
Class PairPLL<K,V>
- java.lang.Object
-
- org.openrefine.runners.local.pll.PLL<Tuple2<K,V>>
-
- org.openrefine.runners.local.pll.PairPLL<K,V>
-
- Type Parameters:
K
- the type of keys in the collectionV
- the type of values in the collection
public class PairPLL<K,V> extends PLL<Tuple2<K,V>>
Adds additional methods for PLLs of keyed collections. The supplied Partitioner determines in which partition an element should be found, depending on its key.- Author:
- Antonin Delpeuch
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.openrefine.runners.local.pll.PLL
PLL.LastFlush, PLL.PLLExecutionError
-
-
Field Summary
Fields Modifier and Type Field Description protected Optional<Partitioner<K>>
partitioner
protected PLL<Tuple2<K,V>>
pll
-
Fields inherited from class org.openrefine.runners.local.pll.PLL
cachedPartitions, context, id, name
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static <T> PairPLL<Long,T>
assumeIndexed(PairPLL<Long,T> pairPLL, long totalRowCount)
Assuming that the keys of the PairPLL are indices, deduce the partition sizes from the first element of each partition and the total number of elements, creating an appropriate RangePartitioner and adding it to the PLL.static <T> PairPLL<Long,T>
assumeSorted(PairPLL<Long,T> pairPLL)
Assumes that a PLL is sorted by key and derive the appropriate partitioner for it.ProgressingFuture<Void>
cacheAsync()
Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.protected CloseableIterator<Tuple2<K,V>>
compute(Partition partition)
Iterate over the elements of the given partition.io.vavr.collection.Array<Long>
computePartitionSizes()
PairPLL<K,V>
dropFirstElements(long n)
Drops the first n elements at the beginning of the collection.PairPLL<K,V>
dropLastElements(long n)
Drops the first n elements at the end of the collection.PairPLL<K,V>
filter(Predicate<? super Tuple2<K,V>> predicate)
Derives a new PLL by filtering the collection to only contain elements which match the supplied predicate.<W> PairPLL<K,Tuple2<V,W>>
fullJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the full (outer) join of both PLLs.protected static <K,V>
List<Tuple2<K,V>>gatherElementsBefore(K upperBound, int limit, CloseableIterator<Tuple2<K,V>> stream, Comparator<K> comparator)
Returns the last n elements whose key is strictly less than the supplied upper bound.io.vavr.collection.Array<V>
get(K key)
Returns the list of elements of the PLL indexed by the given key.io.vavr.collection.Array<Tuple2<K,V>>
getByKeys(Set<K> keys)
Returns the list of elements whose keys match one of the supplied keys.List<PLL<?>>
getParents()
Returns directly the parents of the underlying PLL of this PairPLL, so that the PairPLL wrapper is transparent in query trees.Optional<Partitioner<K>>
getPartitioner()
io.vavr.collection.Array<? extends Partition>
getPartitions()
QueryTree
getQueryTree()
List<Tuple2<K,V>>
getRangeAfter(K from, int limit, Comparator<K> comparator)
Returns the list of elements starting at the given key and for up to the given number of elements.List<Tuple2<K,V>>
getRangeBefore(K upperBound, int limit, Comparator<K> comparator)
Returns the list of elements ending at the given key (excluded) and for up to the given number of elements.boolean
hasCachedPartitionSizes()
Is this PLL aware of the size of its partitions?<W> PairPLL<K,Tuple2<V,W>>
innerJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the inner join of both PLLs.boolean
isCached()
Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.CloseableIterator<Tuple2<K,V>>
iterate(Partition partition)
Iterate over the elements of the given partition.PLL<K>
keys()
Returns a PLL of the keys contained in this collection, in the same order.<W> PairPLL<K,Tuple2<V,W>>
leftJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the left join of both PLLs.PairPLL<K,V>
limitPartitions(long limit)
Limit each partition to contain only their first N elements.<W> PairPLL<K,W>
mapValues(BiFunction<K,V,W> mapFunction, String mapDescription)
Returns a PLL obtained by mapping each element and preserving the indexing.PairPLL<K,V>
retainPartitions(List<Integer> partitionIndices)
Only retain partitions designated by the given list of indices.<W> PairPLL<K,Tuple2<V,W>>
rightJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the right join of both PLLs.CloseableIterator<Tuple2<K,V>>
streamBetweenKeys(Optional<K> from, Optional<K> upTo, Comparator<K> comparator)
Iterates over the elements of this PLL between the given keys, where both boundaries are optional.CloseableIterator<Tuple2<K,V>>
streamBetweenKeys(K from, K upTo, Comparator<K> comparator)
Iterates over the elements of this PLL from the given key (inclusive) and up to the other given key (exclusive).CloseableIterator<Tuple2<K,V>>
streamFromKey(K from, Comparator<K> comparator)
Iterates over the elements of this PLL starting from a given key (inclusive).CloseableIterator<Tuple2<K,V>>
streamUpToKey(K upTo, Comparator<K> comparator)
Iterates over the elements of this PLL up to the given key (exclusive).PLL<Tuple2<K,V>>
toPLL()
Returns the underlying PLL, discarding any partitioner.String
toString()
void
uncache()
Unloads the partition contents from memoryPLL<V>
values()
Returns a PLL of the values contained in this collection, in the same order.PairPLL<K,V>
withCachedPartitionSizes(io.vavr.collection.Array<Long> newCachedPartitionSizes)
Returns a copy of this PairPLL with the given partition sizes, when they are externally known.PairPLL<K,V>
withPartitioner(Optional<Partitioner<K>> partitioner)
Returns a copy of this PairPLL with a changed partitioner.-
Methods inherited from class org.openrefine.runners.local.pll.PLL
aggregate, batchPartitions, collect, collectPartitionsAsync, concatenate, concatenate, count, flatMap, getContext, getId, getPartitionSizes, isEmpty, iterateFromPartition, iterator, map, mapPartitions, mapToPair, mapToPair, numPartitions, runOnPartitions, runOnPartitions, runOnPartitionsAsync, runOnPartitionsAsync, runOnPartitionsWithoutInterruption, runOnPartitionsWithoutInterruption, saveAsTextFile, saveAsTextFileAsync, scanMap, scanMapStream, sort, take, writeOriginalPartition, writePartition, writePlannedPartition, zipWithIndex
-
-
-
-
Method Detail
-
getPartitioner
public Optional<Partitioner<K>> getPartitioner()
- Returns:
- the partitioner used to locate elements by key
-
computePartitionSizes
public io.vavr.collection.Array<Long> computePartitionSizes()
- Overrides:
computePartitionSizes
in classPLL<Tuple2<K,V>>
-
hasCachedPartitionSizes
public boolean hasCachedPartitionSizes()
Description copied from class:PLL
Is this PLL aware of the size of its partitions?- Overrides:
hasCachedPartitionSizes
in classPLL<Tuple2<K,V>>
-
keys
public PLL<K> keys()
Returns a PLL of the keys contained in this collection, in the same order.- Returns:
-
values
public PLL<V> values()
Returns a PLL of the values contained in this collection, in the same order.- Returns:
-
mapValues
public <W> PairPLL<K,W> mapValues(BiFunction<K,V,W> mapFunction, String mapDescription)
Returns a PLL obtained by mapping each element and preserving the indexing. If a partitioner is set on this PLL, it will also be used by the returned PLL.- Type Parameters:
W
-- Parameters:
mapFunction
- the function to apply on each elementmapDescription
- a short description of the function, for debugging purposes- Returns:
-
get
public io.vavr.collection.Array<V> get(K key)
Returns the list of elements of the PLL indexed by the given key. This operation will be more efficient if a partitioner is available, making it possible to scan the relevant partition only. Furthermore, if the PLL is known to be sorted (which happens when it bears aRangePartitioner
), then we only scan the relevant partition up to the given key and do not scan the greater keys.
-
getRangeAfter
public List<Tuple2<K,V>> getRangeAfter(K from, int limit, Comparator<K> comparator)
Returns the list of elements starting at the given key and for up to the given number of elements. This assumes that the PLL is sorted by keys.- Parameters:
from
- the first key to returnlimit
- the maximum number of elements to returncomparator
- the ordering on K that is assumed on the PLL- Returns:
-
getRangeBefore
public List<Tuple2<K,V>> getRangeBefore(K upperBound, int limit, Comparator<K> comparator)
Returns the list of elements ending at the given key (excluded) and for up to the given number of elements. This assumes that the PLL is sorted by keys.- Parameters:
upperBound
- the least key not to returnlimit
- the maximum number of elements to returncomparator
- the ordering on K that is assumed on the PLL- Returns:
- TODO this could be optimized further, when the partitions are cached in memory, we can iterate from them in reverse
-
gatherElementsBefore
protected static <K,V> List<Tuple2<K,V>> gatherElementsBefore(K upperBound, int limit, CloseableIterator<Tuple2<K,V>> stream, Comparator<K> comparator)
Returns the last n elements whose key is strictly less than the supplied upper bound.- Parameters:
stream
- the stream to take the elements from, which is assumed to be in increasing order
-
getByKeys
public io.vavr.collection.Array<Tuple2<K,V>> getByKeys(Set<K> keys)
Returns the list of elements whose keys match one of the supplied keys.- Parameters:
keys
- the keys to look up- Returns:
- the list of elements in the order they appear in the PLL
-
streamFromKey
public CloseableIterator<Tuple2<K,V>> streamFromKey(K from, Comparator<K> comparator)
Iterates over the elements of this PLL starting from a given key (inclusive). This assumes that the PLL is sorted.- Parameters:
from
- the first key to start iterating fromcomparator
- the order used to compare the keys- Returns:
- a streams which starts on the first element whose key is greater or equal to the provided one
-
streamUpToKey
public CloseableIterator<Tuple2<K,V>> streamUpToKey(K upTo, Comparator<K> comparator)
Iterates over the elements of this PLL up to the given key (exclusive). This assumes that the PLL is sorted.- Parameters:
upTo
- the key to stop iterating atcomparator
- the order used to compare the keys- Returns:
-
streamBetweenKeys
public CloseableIterator<Tuple2<K,V>> streamBetweenKeys(K from, K upTo, Comparator<K> comparator)
Iterates over the elements of this PLL from the given key (inclusive) and up to the other given key (exclusive). This assumes that the PLL is sorted.- Parameters:
from
- the key to start iterating fromupTo
- the key to stop iterating atcomparator
- the order used to compare the keys- Returns:
-
streamBetweenKeys
public CloseableIterator<Tuple2<K,V>> streamBetweenKeys(Optional<K> from, Optional<K> upTo, Comparator<K> comparator)
Iterates over the elements of this PLL between the given keys, where both boundaries are optional. This assumes that the PLL is sorted.- Parameters:
from
-upTo
-comparator
-- Returns:
-
filter
public PairPLL<K,V> filter(Predicate<? super Tuple2<K,V>> predicate)
Description copied from class:PLL
Derives a new PLL by filtering the collection to only contain elements which match the supplied predicate. The predicate is evaluated lazily, so it can be called multiple times on the same element, depending on the actions called on the returned PLL.
-
limitPartitions
public PairPLL<K,V> limitPartitions(long limit)
Description copied from class:PLL
Limit each partition to contain only their first N elements.- Overrides:
limitPartitions
in classPLL<Tuple2<K,V>>
- Parameters:
limit
- the maximum number of items per partition- Returns:
-
retainPartitions
public PairPLL<K,V> retainPartitions(List<Integer> partitionIndices)
Description copied from class:PLL
Only retain partitions designated by the given list of indices. The other partitions are dropped from the PLL. The partitions in the resulting PLL are ordered according to the list of indices supplied.- Overrides:
retainPartitions
in classPLL<Tuple2<K,V>>
- Parameters:
partitionIndices
- the indices of the partitions to retain
-
dropFirstElements
public PairPLL<K,V> dropFirstElements(long n)
Drops the first n elements at the beginning of the collection. This also adapts any partitioner to work on the cropped collection.- Overrides:
dropFirstElements
in classPLL<Tuple2<K,V>>
- Parameters:
n
- the number of elements to remove- Returns:
-
dropLastElements
public PairPLL<K,V> dropLastElements(long n)
Drops the first n elements at the end of the collection. This also adapts any partitioner to work on the cropped collection.- Overrides:
dropLastElements
in classPLL<Tuple2<K,V>>
- Parameters:
n
- the number of elements to remove- Returns:
-
compute
protected CloseableIterator<Tuple2<K,V>> compute(Partition partition)
Description copied from class:PLL
Iterate over the elements of the given partition. This is the method that should be implemented by subclasses. As this method forces computation, ignoring any caching, consumers should not call it directly but rather usePLL.iterate(Partition)
. Once the iterator is not needed anymore, it should be closed. This makes it possible to release the underlying resources supporting it, such as open files or sockets.
-
iterate
public CloseableIterator<Tuple2<K,V>> iterate(Partition partition)
Description copied from class:PLL
Iterate over the elements of the given partition. If the contents of this PLL have been cached, this will iterate from the cache instead.
-
getPartitions
public io.vavr.collection.Array<? extends Partition> getPartitions()
- Specified by:
getPartitions
in classPLL<Tuple2<K,V>>
- Returns:
- the partitions in this list
-
assumeIndexed
public static <T> PairPLL<Long,T> assumeIndexed(PairPLL<Long,T> pairPLL, long totalRowCount)
Assuming that the keys of the PairPLL are indices, deduce the partition sizes from the first element of each partition and the total number of elements, creating an appropriate RangePartitioner and adding it to the PLL.
If the total row count is unknown (negative) then partition sizes are not inferred.
Note: this method is static for type-checking purposes (it can only apply to a PairPLL keyed by Long).- Parameters:
pairPLL
- the new PLL with the derived partitionertotalRowCount
- the known row count- Returns:
-
assumeSorted
public static <T> PairPLL<Long,T> assumeSorted(PairPLL<Long,T> pairPLL)
Assumes that a PLL is sorted by key and derive the appropriate partitioner for it.- Type Parameters:
T
-- Parameters:
pairPLL
-- Returns:
-
withPartitioner
public PairPLL<K,V> withPartitioner(Optional<Partitioner<K>> partitioner)
Returns a copy of this PairPLL with a changed partitioner.- Parameters:
partitioner
-- Returns:
-
withCachedPartitionSizes
public PairPLL<K,V> withCachedPartitionSizes(io.vavr.collection.Array<Long> newCachedPartitionSizes)
Returns a copy of this PairPLL with the given partition sizes, when they are externally known.- Overrides:
withCachedPartitionSizes
in classPLL<Tuple2<K,V>>
- Returns:
-
getParents
public List<PLL<?>> getParents()
Returns directly the parents of the underlying PLL of this PairPLL, so that the PairPLL wrapper is transparent in query trees. This is because the PairPLL class is a bare wrapper on top of PLL for type-checking purposes, such as to hold a partitioner, without adding any computation on top of the underlying PLL.- Specified by:
getParents
in classPLL<Tuple2<K,V>>
- See Also:
PLL.getQueryTree()
-
getQueryTree
public QueryTree getQueryTree()
- Overrides:
getQueryTree
in classPLL<Tuple2<K,V>>
- Returns:
- a tree-based representation of the dependencies of this PLL.
-
isCached
public boolean isCached()
Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.
-
cacheAsync
public ProgressingFuture<Void> cacheAsync()
Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.- Overrides:
cacheAsync
in classPLL<Tuple2<K,V>>
-
uncache
public void uncache()
Description copied from class:PLL
Unloads the partition contents from memory
-
innerJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> innerJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the inner join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
leftJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> leftJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the left join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
rightJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> rightJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the right join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
fullJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> fullJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the full (outer) join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
-